本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新593篇论文，其中：

自然语言处理94篇
信息检索17篇
计算机视觉130篇

自然语言处理

1. 【2606.24841】Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

链接：https://arxiv.org/abs/2606.24841

作者：Ahmad Pouramini,Hesham Faili

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

2. 【2606.24828】Less is More: Quality-Aware Training Data Selection for Scientific Summarization

链接：https://arxiv.org/abs/2606.24828

作者：Maria Nefeli Paraskevopoulou,Tatiana Passali,Grigorios Tsoumakas

类目：Computation and Language (cs.CL)

关键词：datasets commonly treat, gold reference summaries, commonly treat author-written, summarization datasets commonly, scientific summarization datasets

备注：

点击查看摘要

Abstract:Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific summarization datasets remain limited in scale and structure for modern long-context models. In this work, we address both challenges by a) constructing and releasing one of the largest biomedical and life science datasets for long-document summarization, containing 1.88 million PMC articles, and b) analyzing the reference quality of author-written abstracts with source-grounded and model-based metrics. We show that author-written abstracts vary in their alignment with the full article and that these quality signals can guide training-data selection. Training on selected high-quality subsets outperforms random sampling at matched training sizes and can match or exceed larger random subsets on factuality-oriented metrics. Our findings suggest that reference quality is an important factor in scientific summarization and that quality-aware data selection can improve training efficiency.

3. 【2606.24825】L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

链接：https://arxiv.org/abs/2606.24825

作者：Hariom Ingle,Ronit Ghode,Ishwari Gondkar,Jidnyasa Harad,Raviraj Joshi

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：underpinning machine translation, task underpinning machine, foundational NLP task, NLP task underpinning, information extraction

备注：

点击查看摘要

Abstract:Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most spoken languages worldwide, it remains severely under-resourced in annotated corpora and standardised evaluation benchmarks. Marathi presents unique challenges for computational modelling owing to its rich morphology, relatively free word order, lack of capitalisation conventions, and pervasive code-mixing with Hindi and English. We introduce L3Cube-MahaPOS, a gold-standard POS tagging dataset for Marathi comprising 32,354 manually annotated sentences drawn from news text. Annotation was performed entirely manually by a team of Marathi-proficient annotators following a 16-tag Universal Dependencies-aligned scheme. A structured preprocessing pipeline covering Unicode normalisation, Devanagari-aware tokenisation, and noise filtering ensures label consistency across all splits. We benchmark the dataset across six model families spanning HMM, CRF, BiLSTM, BiLSTM+CharCNN, MuRIL, and the Marathi-specific transformer MahaBERT-v2. The best system achieves 88.67\% token-level accuracy and a macro-F1 of 81.67% over 15 evaluated tag classes. We release the dataset, annotation guidelines, and trained model checkpoints to foster further research in Marathi NLP.

4. 【2606.24820】SHERLOC: Structured Diagnostic Localization for Code Repair Agents

链接：https://arxiv.org/abs/2606.24820

作者：Hovhannes Tamoyan,Sean Narenthiran,Erik Arakelyan,Mira Mezini,Boris Ginsburg

类目：Computation and Language (cs.CL)

关键词：solve repository-level coding, repository-level coding tasks, agents solve repository-level, LLM agents solve, Structured Hypothesis-driven Exploration

备注：

点击查看摘要

Abstract:LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact repository tools and self-recovery, without fine-tuning or multi-agent orchestration. SHERLOC reaches state-of-the-art localization across model scales: 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified; at ~30B parameters, it matches or outperforms other agentic methods. Injecting our locations and diagnostic findings into repair agents yields, on average, +5.95 pp resolve rate on SWE-Bench Verified while cutting localization and total tokens by 36.7% and 23.1%.

5. 【2606.24783】Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

链接：https://arxiv.org/abs/2606.24783

作者：Filippos Ventirozos,Matthew Shardlow

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Commercial NLP treats, Commercial NLP, conversion tool, treats the shopping, shopping chatbot

备注： 8 pages, 1 figure. Vision paper, under review

点击查看摘要

Abstract:Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2) changes what is scarce. When the buyer is an autonomous agent that can investigate exhaustively, the bottleneck is no longer matching products but acquiring trustworthy, decision-relevant information about them. We envision agentic e-commerce as a micro-transaction market for verified information: buyer agents spend fractions of a cent to progressively unlock seller- and reviewer-supplied data -- service histories, third-party test reports, bills of materials, audited sales and support metrics -- paid for a la carte under a freemium model, with reviewer trust scored reputationally. We sketch the architecture of such a market and argue that it rewards genuine product quality and yields truer competition than ranking-based storefronts. We then translate the vision into concrete NLP problems -- cost-optimal information acquisition, data pricing and negotiation, real-time entity resolution, grounded value exchange, and privacy-preserving persona modelling -- and argue that these, not chat fluency, deserve the field's attention.

6. 【2606.24775】Are We Ready For An Agent-Native Memory System?

链接：https://arxiv.org/abs/2606.24775

作者：Wei Zhou,Xuanhe Zhou,Shaokun Han,Hongming Xu,Guoliang Li,Zhiyu Li,Feiyu Xiong,Fan Wu

类目：Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)

关键词：large language model, simple retrieval-augmented mechanisms, supports persistent information, dynamic lifecycle governance, persistent information storage

备注： Paper list available at: [this https URL](https://github.com/OpenDataBox/awesome-agent-memory) . Source code available at: [this https URL](https://github.com/OpenDataBox/MemoryData)

点击查看摘要

Abstract:Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at this https URL.

7. 【2606.24773】Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

链接：https://arxiv.org/abs/2606.24773

作者：Manan Agarwal,Sheel Shah,Chanhyuk Lee,Jaehoon Yoo,Jerry Huang,Seunghoon Hong,Aditi Raghunathan,Jinwoo Kim,Nicholas M. Boffi

类目：Computation and Language (cs.CL)

关键词：regenerate arbitrary subsets, Non-autoregressive generation offers, recursively critique, erase and regenerate, existing non-autoregressive models

备注： 24 pages, 23 figures

点击查看摘要

Abstract:Non-autoregressive generation offers a powerful paradigm for iterative refinement, allowing models to recursively critique, erase and regenerate arbitrary subsets of tokens. However, existing non-autoregressive models fail to realize this potential. Masked Diffusion Models (MDMs) suffer from factorization error, causing sample quality to collapse when generating multiple tokens simultaneously. Flow Map Language Models (FMLMs) circumvent this bottleneck via joint sequence transport for excellent few-step generation, but sacrifice the inference-time flexibility of MDMs. We introduce FMLM+, a framework that bridges this gap by equipping FMLM with masking-style noise schedules. While generating the full sequence in a single step, FMLM+ simultaneously scores the global consistency of each token a posteriori. We leverage this to introduce Posterior Refinement, a novel inference-time refinement strategy that enables the model to adaptively self-correct its outputs, matching the performance of discrete baselines with 32x fewer NFEs. Across diverse benchmarks, we demonstrate that FMLM+ with Posterior Refinement improves the speed--quality tradeoff over both MDM and FMLM families, providing a scalable foundation for high-fidelity language modeling.

8. 【2606.24758】CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

链接：https://arxiv.org/abs/2606.24758

作者：Faris Alasmary,Taif Nono,Orjuwan Zaafarani,Kholood Al Tabash,Ahmad Ghannam,Anas Salamah,Shouq Sadah,Lahouari Ghouti

类目：Computation and Language (cs.CL)

关键词：Handling repeated characters, informal character elongation, Handling repeated, social media posts, Connectionist Temporal Classification

备注：

点击查看摘要

Abstract:Handling repeated characters in text can be tricky, since they can represent either the correct spelling of a word or informal character elongation often seen in social media posts. We present CANDLE, a lightweight system for character-level Arabic noise deduplication that addresses this challenge without relying on handcrafted rules, dictionaries, or morphological analyzers. At the heart of CANDLE is a novel application of Connectionist Temporal Classification (CTC) to this task, a formulation not previously explored for character deduplication, which frames normalization as a sequence alignment problem over a character-based encoder. Evaluated on three benchmarks spanning clean newspaper, manually curated ambiguous cases, and real-world social media text, the CTC model achieves a Sentence Error Rate (SER) as low as $5.37\%$ and consistently outperforms a classification-based baseline by a large margin. To reduce inference overhead, we distill the 6-layer CTC model into a 2-layer student, achieving a $3\times$ depth reduction with minimal performance degradation. Beyond deduplication accuracy, normalization yields a practical downstream benefit: a relative reduction in tokenizer fertility of up to $12.8\%$ across a diverse set of Arabic LLM tokenizers, directly lowering inference costs and improving context window utilization. We release all code and models publicly to support reproducibility and advance future research\footnote{this https URL}.

9. 【2606.24734】ask Decomposition for Efficient Annotation

链接：https://arxiv.org/abs/2606.24734

作者：Nupoor Gandhi,Emma Strubell

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：High-quality annotations, annotation, large corpora, collect over large, inferential load

备注：

点击查看摘要

Abstract:High-quality annotations of structured representations are expensive to collect over large corpora. Manual annotation of structure is laborious, and model-based annotation, although cheaper to generate, requires expensive validation and potentially significant supervision to ensure that the annotation quality is strong enough to be useful downstream. In traditional annotation workflows, annotation of each complete example is performed end-to-end by a single annotator. However, structured annotation is complex, and each aspect of the task represents a unique challenge with an associated inferential load for a given annotator. Modern annotation projects can incorporate heterogeneous groups of annotators, including both models and human annotators with varying domain and linguistic expertise. It remains unclear, however, how to redesign annotation tasks in this setting, where efforts are discriminately allocated across heterogeneous annotators with respect to distinct annotation challenges. We propose to decompose annotation tasks into sub-tasks in order to reduce the aggregate inferential load of annotation projects. Inspired by the notion of centers from centering theory, we introduce a formal model of inferential load based on the degrees of freedom in the space of valid annotations. Using this model, we show that identifying these centers (i.e. salient anchor entities realized by annotation sub-tasks) constrains the output space complexity, and decompositions which isolate and advance center identification reduce the aggregate inferential load. We provide guidelines for decomposing complex structured annotation tasks, supported by examples demonstrating improved cost-efficiency from our prior work. Finally, we present a procedure for allocating sub-tasks across annotators to maximize quality under a fixed budget.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2606.24734 [cs.CL]

(or
arXiv:2606.24734v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.24734

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

10. 【2606.24714】CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

链接：https://arxiv.org/abs/2606.24714

作者：Shijun Luo

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：English abbreviations, unit symbols, hyphenated model, dense written forms, English

备注： 5 pages, 1 figure, 8 tables. ICASSP-style preprint

点击查看摘要

Abstract:Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening workflows, and a text-to-speech (TTS) system can preserve the written string while changing the spoken meaning. We introduce CN-NewsTTS Bench v0.1, an open target-level benchmark for evaluating whether Chinese news TTS products pronounce such targets correctly from raw text, without user-side rules, LLM rewriting, SSML hints, or manual edits. The release contains a 200-record development set, an 800-record public test set, 992 public auto-evaluable targets, fixed transcripts from a three-ASR ensemble, an automatic target scorer, and initial results for seven product TTS systems. We additionally report ASR-route diagnostics, ASR-subset ablations, category-level results, confidence intervals, and provider configuration metadata. The best system reaches 0.879 strict accuracy, while several systems remain below 0.60.

11. 【2606.24667】DREAM: Dense Retrieval Embeddings via Autoregressive Modeling

链接：https://arxiv.org/abs/2606.24667

作者：Yixuan Tang,Yi Yang

类目：Computation and Language (cs.CL)

关键词：Dense retrieval, retrieval-based AI systems, Dense retrieval embedding, Dense Retrieval Embeddings, fundamental component

备注：

点击查看摘要

Abstract:Dense retrieval embedding models are a fundamental component of modern retrieval-based AI systems. Most dense retrievers are trained with contrastive objectives, which require labeled positive and negative document pairs that are often costly and difficult to obtain. In this work, we investigate whether the autoregressive next-token prediction objective of a large language model (LLM) can provide supervision for dense retrieval. The intuition is simple: if a document contains information relevant to a query, conditioning on that document should make the target output easier for the LLM to predict. A key challenge is that the next-token prediction loss is computed inside the LLM, while the retriever is a separate embedding model. To address this challenge, we propose DREAM (Dense Retrieval Embeddings via Autoregressive Modeling), which injects retriever-generated query-document similarity scores into selected attention heads of a frozen LLM. During training, these scores determine how much attention each candidate document receives while the LLM predicts the target output. The resulting prediction loss provides gradients for retriever training through the attention mechanism. We evaluate DREAM on retrieval benchmarks BEIR and RTEB using embedding backbones ranging from 0.5B to 3B parameters. DREAM consistently outperforms existing baselines across different model scales. These results demonstrate that DREAM provides a promising approach for training dense retrievers through autoregressive modeling.

12. 【2606.24655】AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach

链接：https://arxiv.org/abs/2606.24655

作者：Murilo Gazzola,Hugo Gobato Souto,Samuel Silva,Júlia Schubert Peixoto,Felipe Siqueira,André Luis Pedroso de Morais,Caio Gomes

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)

关键词：structured information extraction, landscape demand robust, e-commerce landscape demand, dynamic Brazilian e-commerce, Brazilian e-commerce landscape

备注：

点击查看摘要

Abstract:The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large Language Models (LLMs) to perform high-accuracy PAVE specifically for Brazilian e-commerce catalogs. Second, to facilitate reproducible research and provide a definitive benchmark, we introduce and share the Golden Set, a new, meticulously curated, and manually annotated dataset for PAVE in Portuguese. We detail the creation process and structure (Entity, Category, Subcategories) of this high-quality reference set. Our experiments conclusively show that AI-PAVE-Br, leveraging targeted prompt engineering, dramatically outperforms conventional Named Entity Recognition (NER) baselines. This work not only delivers a superior, scalable solution for a major non-English market but also enriches the NLP community with a valuable, publicly available resource for future PAVE research.

13. 【2606.24650】Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling

链接：https://arxiv.org/abs/2606.24650

作者：Petr Nyoma

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：hierarchical state space, state space model, present Harmonic, SSM, hierarchical state

备注： 12 pages, 8 figures. NeurIPS 2024 format

点击查看摘要

Abstract:We present Harmonic, a hierarchical state space model (SSM) for language modeling. The architecture stacks three recurrent levels at progressively slower timescales; each level receives the prediction error of the level below as input, rather than its raw hidden state. On enwiki8 with equal token budgets, Harmonic outperforms a comparable Transformer (28M params) by +1.4% at 1K tokens, +6.7% at 8K tokens, and +11.4% at 32K tokens (bpt, lower is better). It also outperforms Mamba at every tested length by 0.7--1.8%. At 64K tokens, both Mamba and Transformer run out of memory on an 80GB H100; Harmonic trains successfully, reaching 6.169 bpt. Results replicate on WikiText-103 (H-TF gap +1.7% to +7.2% across 1K--32K). At 1B parameter scale, replacing all attention layers in TinyLlama 1.1B with HarmonicBlock eliminates the RoPE positional encoding limit: the resulting Hallamonic model maintains stable loss across sequence lengths 1K--8K on two independent clean benchmarks (Lambada and fineweb-edu held-out), while TinyLlama degrades catastrophically past its 2K-token RoPE limit (gap: +9.4 bpt at seq=8K on Lambada). Compute is O(L) per forward pass vs. O(L^2) for attention. Logs: this https URL.

Comments:
12 pages, 8 figures. NeurIPS 2024 format

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2606.24650 [cs.CL]

(or
arXiv:2606.24650v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.24650

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

14. 【2606.24648】ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

链接：https://arxiv.org/abs/2606.24648

作者：Jisu Jeon,Seungyeon Jwa,Joosung Lee,Jinhyeon Kim,Woojin Chung,Hwiyeol Jo,Jeonghoon Kim,Jonghyun Choi,Soyoon Kim

类目：ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：Large Audio-Language Models, Large Audio-Language, Audio-Language Models, Models, judge models

备注： Accepted to Interspeech 2026

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPairAudioBench, a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions: Style, Rate, Emphasis, Age, and Gender. Our experiments show that current LALM judges still lag behind human judgments by 32%p on average and exhibit severe calibration failures, particularly in Tie cases where the correct decision is to abstain. To further analyze lexical versus acoustic reliance, the benchmark includes both same-transcript and cross-transcript conditions. ParaPairAudioBench enables multi-dimensional, calibration-aware assessment of the reliability of LALM-as-a-Judge for paralinguistic speech evaluation.

15. 【2606.24644】Measuring User's Mental Models of Speech Translation in Human-AI Collaboration

链接：https://arxiv.org/abs/2606.24644

作者：HyoJung Han,Nishant Balepur,Jordan Boyd-Graber,Marine Carpuat

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Millions of people, tools daily, people use machine, mental models, Millions

备注： ACL2026

点击查看摘要

Abstract:Millions of people use machine translation (MT) tools daily, yet little is known about their perception of what systems can and cannot do. This paper studies users' mental models of speech translation systems through a new framework based on cross-lingual question answering, where users either accept MT output or request professional re-translation to answer questions based on the information presented in a foreign language. By analyzing user behavior and accuracy trends across varying translation qualities, we examine to what extent they can predict where the system is likely to be wrong, and how this mental model evolves. Users develop stronger mental models with practice, especially when they have some knowledge of the source language, primarily by relying on surface-level error cues. Moreover, providing speech transcriptions can help users develop better mental models. Our results show the promise of cross-lingual question answering as a downstream task for studying MT mental models and advancing our understanding of human-AI collaboration.

16. 【2606.24627】he Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

链接：https://arxiv.org/abs/2606.24627

作者：Arka Ujjal Dey,John Collomosse

类目：Computation and Language (cs.CL)

关键词：Fact-checking systems built, LLMs achieve high, achieve high verdict, output Supports labels, routinely output Supports

备注：

点击查看摘要

Abstract:Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect those warrants, but rigid extraction protocols strip the full-claim context that facets need. We introduce SIFT -- claim-conditioned re-scoring of extracted evidence spans against the full claim -- paired with WSP (Warranted Supports Proportion), an automatic NLI check that the cited warrant entails the claim. We evaluate on FEVER, SciFact, 5PILS, and DP across four open-source backbones. SIFT recovers accuracy on cells where naive decomposition costs up to 27.6 points, while raising WSP above direct prompting; WSP itself calibrates against human gold evidence at AUC 0.92 and precision 0.98.

17. 【2606.24623】Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity

链接：https://arxiv.org/abs/2606.24623

作者：Yuanhe Zhao,Tianyu Zhang,Huafei Xing,Derek F. Wong,Jianbin Li,Tao Fang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Retrieval-Augmented Generation enhances, incorporating external knowledge, Generation enhances large, Retrieval-Augmented Generation, sensitive scenarios risks

备注： This full manuscript contains 23 pages and has been formally accepted for publication in Information Processing Management (Elsevier IPM). Tao Fang is the corresponding author

点击查看摘要

Abstract:Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three specialized agents for privacy extraction, semantic analysis, and reconstruction, our approach collaboratively removes sensitive identifiers while preserving the semantic core. We evaluate the framework on the ChatDoctor and Wiki-PII datasets across six large language models. Experimental results demonstrate a significant reduction in privacy leakage under targeted attacks. For instance, we reduced targeted information exposure in LLaMA-3-8B from 144 instances in the baseline to just 1. Furthermore, we maintain strong contextual fidelity with a BLEU-1 score of 0.122, outperforming the existing SAGE method's 0.117. Finally, the framework operates as an asynchronous preprocessing module, introducing no additional latency to online inference, as all rewriting is executed as a one-time offline preprocessing step. To promote reproducibility, the source code of this work is publicly available at this https URL.

18. 【2606.24610】Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

链接：https://arxiv.org/abs/2606.24610

作者：Jory Alshaalan,Haya Albaker,Abeer Aldayel,Aljawharah Alabdullatif,Rehab Alahmadi

类目：Computation and Language (cs.CL)

关键词：multiple cultures convey, complex when multiple, multiple cultures, cultures convey, cultural

备注： This paper is under review

点击查看摘要

Abstract:The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework that integrates a cross-linguistic collection of 414 proverbs spanning 15 languages and uses four LLMs to generate 13k narratives. By employing semantically equivalent proverbs as culturally grounded prompts, the analysis assesses whether models preserve meaning across languages, how cross-lingual conditioning influences narrative realization, and whether different model families converge on similar interpretations. Results indicate that cross-lingual prompting largely preserves proverb-level semantic meaning while systematically redistributing agency, social positioning, and narrative structure. Additionally, strong inter-model convergence is observed in both monolingual and cross-lingual settings, suggesting that multilingual LLMs rely on shared semantic abstractions despite architectural and linguistic differences. These findings shed light on the need for more comprehensive evaluations of cultural grounding. Relying exclusively on semantic similarity in multilingual narrative assessments may overestimate cultural preservation by neglecting culturally meaningful variations in narrative expression.

19. 【2606.24597】Qwen-AgentWorld: Language World Models for General Agents

链接：https://arxiv.org/abs/2606.24597

作者：Yuxin Zuo,Zikai Xiao,Li Sheng,Fei Huang,Jianhong Tu,Yuxuan Liu,Tianyi Tang,Xiaomeng Hu,Yang Su,Qingfeng Lan,Yantao Liu,Qin Zhu,Yinger Zhang,Bowen Yu,Haiquan Zhao,Haiyang Xu,Jianxin Yang,Jiayang Cheng,Junyang Wang,Lianghao Deng,Mingfeng Xue,Tianyi Bai,Yang Fan,Yubo Ma,Yucheng Li,Zeyu Cui,Zhihai Wang,Zhihui Xie,Zhuorui Ye,An Yang,Dayiheng Liu,Jingren Zhou,Ning Ding

类目：Computation and Language (cs.CL)

关键词：core cognitive mechanism, observations and actions, current observations, core cognitive, cognitive mechanism

备注：

点击查看摘要

Abstract:A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7 domains via long chain-of-thought reasoning. Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline: CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards. To evaluate language world models, we present AgentWorldBench, a comprehensive benchmark constructed from real-world interactions of 5 frontier models on 9 established benchmarks. Empirical results demonstrate that Qwen-AgentWorld significantly outperforms existing frontier models. (ii) Beyond foundation models, we further investigate two complementary paradigms through which world modeling enhances general agents. First, as a decoupled environment simulator, Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone. Second, as a unified agent foundation model, world-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks. Code: this https URL

20. 【2606.24596】o Compare, or Not to Compare: On Methodological Practices in Evaluating Social Bias

链接：https://arxiv.org/abs/2606.24596

作者：Federico Marcuzzi,Xuefei Ning,Roy Schwartz,Iryna Gurevych

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, critical applications, Language Models, increasingly deployed

备注：

点击查看摘要

Abstract:As Large Language Models are increasingly deployed in critical applications, robustly evaluating their social biases is paramount. However, the current literature suffers from widespread methodological fragmentation, which yields contradictory conclusions. This stems largely from ignoring the structural framing of benchmark-level evaluations. To resolve this, we introduce a unified and controllable framework that standardizes heterogeneous benchmarks to systematically contrast isolated demographic assessments with forced-choice comparative settings. Crucially, this allows us to disentangle the confounding effects of Chain-of-Thought reasoning, neutral fallback options, and other structural artifacts in social bias evaluations. Our evaluation across multiple model families reveals a massive, systematic paradigm gap: while isolated assessments limit prejudice activation, comparative settings act as aggressive catalysts for latent discrimination, a shift primarily driven by underspecified contexts. Alarmingly, CoT reasoning exacerbates social biases under comparative settings, and this systemic bias persists as a deterministic prejudice even when models are provided neutral fallback options or claim to answer randomly. Finally, we demonstrate that this comparative prejudice is a generalized phenomenon that scales positively with model size. Ultimately, we offer a crucial methodological guideline: while researchers must leverage comparative settings to robustly audit hidden biases, practitioners cannot safely rely on comparative deployments in ambiguous real-world tasks.

21. 【2606.24595】MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

链接：https://arxiv.org/abs/2606.24595

作者：Enze Ma,Yufan Zhou,Wei-Chieh Huang,Jie Yang,Huanhuan Ma,Zixuan Wang,Chengze Li,Chunyu Miao,Philip S. Yu,Zhen Wang

类目：Computation and Language (cs.CL)

关键词：promises LLM agents, memory promises LLM, promises LLM, Long-term memory promises, LLM agents

备注：

点击查看摘要

Abstract:Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which tests that understanding only indirectly and leaves the memory artifact itself largely unaudited. We argue that long-term memory should instead be evaluated as an auditable post-interaction artifact: after ordinary assistance, what structured user state can be reconstructed from the memory the agent leaves behind? We instantiate this view in MEMPROBE, a benchmark in which a memory-equipped agent assists simulated users, each carrying a hidden, taxonomy-anchored user-state bank, across a trajectory of leak-controlled tasks, after which that bank is reconstructed from the agent's resulting memory under both full-store and top-k access. Built on synthetic ground truth for efficient, scalable measurement, MEMPROBE spans 50 simulated users with 31 hidden dimensions each (1,550 recovery targets) and tests 5 representative memory systems. Testing state-of-the-art memory agents, we find that successful assistance and recoverable memory behave as distinct capabilities. Task completion nearly saturates, even for a memoryless baseline, while category-balanced recovery stays moderate (about 0.6) and drops further under top-k retrieval. MEMPROBE is the first benchmark to study memory recovery directly, reconstructing the user state a system retains and scoring it against ground truth. We see recovery as a concrete objective for future memory agents to optimize, and MEMPROBE as a step toward an environment where agents are trained to remember their users, growing more faithful the longer they know them.

22. 【2606.24589】AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

链接：https://arxiv.org/abs/2606.24589

作者：Khanak Khandelwal(Indian Institute of Technology Jodhpur)

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：generating hard inputs, Scaling adversarial evaluation, large language models, language models requires, Scaling adversarial

备注： 10 pages, 4 figures, 5 tables. Code and data at [this https URL](https://github.com/khanak0509/AdversaBench)

点击查看摘要

Abstract:Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreement of 80-87% coexists with near-zero Cohen's kappa due to label skew; category-level disagreement rates are more informative. Fourth, adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting the mutations exploit general behavioral patterns rather than model-specific weaknesses. Code, dataset, and analysis scripts are available at this https URL .

23. 【2606.24579】Cross-Lingual Exploration for Parametric Knowledge

链接：https://arxiv.org/abs/2606.24579

作者：Elisha Diskind,Itamar Trainin,Uri Shaham,Leshem Choshen,Idan Szpektor,Omri Abend

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, equally accessible, Parametric knowledge

备注： 29 pages, 5 figures, preprint

点击查看摘要

Abstract:Parametric knowledge in Large Language Models is not equally accessible across languages. As a result, standard inference techniques often struggle to surface localized facts, leading to failures in cross-lingual knowledge transfer and consistency. In this work, we investigate techniques for accessing hidden factual knowledge by exploring cross-lingual prompting strategies. We identify four inherent dimensions of cross-lingual exploration that directly govern parametric knowledge retrieval and evaluate them on multilingual factual benchmarks covering 17 typologically diverse languages. Our results demonstrate that cross-lingual exploration significantly improves knowledge transfer and factual recall, representing a more efficient compute Pareto frontier than native-language scaling. Furthermore, we observe corresponding improvements in cross-lingual consistency, exceeding what can be explained by accuracy gains alone. Overall, our work establishes multilingual prompt exploration as a highly effective inference-time strategy for unlocking latent parametric knowledge.

24. 【2606.24530】NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

链接：https://arxiv.org/abs/2606.24530

作者：Yuru Wang,Lejun Cheng,Yuxin Zuo,Sihang Zeng,Bingxiang He,Che Jiang,Junlin Yang,Yuchong Wang,Kaikai Zhao,Weifeng Huang,Kai Tian,Zhenzhao Yuan,Jincheng Zhong,Weizhi Wang,Ning Ding,Bowen Zhou,Kaiyan Zhang

类目：Computation and Language (cs.CL)

关键词：peer-reviewed Nature-family publications, Nature-family publications, peer-reviewed Nature-family, designed to evaluate, distilled from peer-reviewed

备注：

点击查看摘要

Abstract:We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: this https URL

25. 【2606.24526】AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

链接：https://arxiv.org/abs/2606.24526

作者：Honglin Guo,Qi Zhang,Yu Zhang,Weijie Li,Rui Zheng,Zhikai Lei,Qiyuan Peng,Zhiheng Xi,Tao Gui,Qi Zhang

类目：Computation and Language (cs.CL)

关键词：Large language models, parametric knowledge, increasingly deployed, Large language, reconciling inconsistent terminology

备注：

点击查看摘要

Abstract:Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model's context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.

26. 【2606.24523】Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

链接：https://arxiv.org/abs/2606.24523

作者：Arda Eren,Micheal Cheung,Youqian Zhang,Grace Ngai,Eugene Yujun Fu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：vulnerable communities worldwide, phone calls exploit, calls exploit vulnerable, exploit vulnerable communities, Scam phone calls

备注： Poster paper accepted at 47th IEEE Security Privacy 2026

点击查看摘要

Abstract:Scam phone calls exploit vulnerable communities worldwide, yet research on detection has focused almost exclusively on English and other high-resource languages. In low-resource settings such as Turkish, detection is especially difficult, as annotated data is scarce and technological defenses remain limited. This research investigates how large language models (LLMs) can support scam detection in Turkish by introducing the first public multi-modal dataset of 100 aligned audio-transcript pairs of scam and benign conversations. We evaluate seven LLMs spanning three model families: Gemini 2.5 (Flash, Flash-Lite, Pro), GPT-4o, and Qwen (Max, Plus, Turbo), under three input conditions: raw audio, automatic speech-to-text transcripts, and transcripts refined by a native speaker. Our results suggest that transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. By centering a low-resource language and real world threat, this work highlights the urgent need for culturally and linguistically inclusive AI safety research and more robust multi-modal systems for fraud prevention.

27. 【2606.24510】A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

链接：https://arxiv.org/abs/2606.24510

作者：Haichao Chen,Songchi Zhou,Zhengyun Zhao,Shikai Hu,Xianghong Jin,Hongwei Ji,Li He,Shuli Li,Yiming Qin,Xin Tan,Runfeng Shi,Yih Chung Tham,Jiaye Zhu,Ye Li,Ye Jin,Longhao Cao,Dawei Li,Honghan Wu,Hongqiu Gu,Guanqiao Li,Tudor Groza,Chunying Li,Dian Zeng,Weihong Yu,Gareth Baynam,Saumya Shekhar Jamuar,Min Shen,Shuyang Zhang,Bin Sheng,Sheng Yu,Tien Yin Wong

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：health challenge due, diseases affect millions, Rare diseases affect, rare disease diagnosis, specialized clinical expertise

备注： 36 pages, 5 figures

点击查看摘要

Abstract:Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for rare disease diagnosis. RaDaR was trained with 49,170 publicly available free-text cases and 104,666 synthetic cases with reasoning-enhanced training. RaDaR showed the strongest performance among evaluated open-source models, including the 671B DeepSeek-R1, across public benchmarks and four external validation centers. In a retrospective cohort, RaDaR prioritized the final diagnosis before documented clinical suspicion in 61.06 percent of cases, corresponding to a potential lead time of 1.87 months and 50.18 percent of the within-center interval. In a randomized physician-assistance trial, RaDaR assistance improved physicians' rare-disease diagnostic accuracy by 21.44 percentage points compared with internet search alone. Synthetic-data ablations suggested that phenotype-anchored narratives provide useful training signal for long-tail rare diseases, with a monotonic scaling trend within the tested data range. Together, RaDaR and its development and validation framework provide a deployable rare-disease reasoning model and a reproducible development framework for diagnostic AI under data scarcity.

28. 【2606.24501】UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction

链接：https://arxiv.org/abs/2606.24501

作者：Nouran Khallaf,Serge Sharoff

类目：Computation and Language (cs.CL)

关键词：paper describes UOL, vocabulary difficulty prediction, describes UOL, IDEM closed-track submission, IDEM closed-track

备注： Published at BEA2026, 21st Workshop on Innovative Use of NLP for Building Educational Applications, at ACL, July 2026, San Diego

点击查看摘要

Abstract:This paper describes UOL@IDEM's closed-track submission to the BEA 2026 shared task on L1-aware vocabulary difficulty prediction. We model the task as regression and train separate systems for Spanish, German, and Mandarin Chinese\footnote{Below we use \emph{Chinese} for brevity.}. Our system combines multilingual contextual representations with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show consistent gains over the official closed-track baselines, with sentence-embedding encoders such as BGE-M3, multilingual E5, and LaBSE performing best. Official submissions achieve RMSE scores of 1.132, 1.037, and 0.891 for Spanish, German, and Chinese, respectively. Feature analysis identifies frequency as the most stable predictor, while contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive signals. Error analysis shows strong ranking performance but weaker calibration for the easiest items, which are often overpredicted. See this https URL

29. 【2606.24460】he African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

链接：https://arxiv.org/abs/2606.24460

作者：Olaoye Anthony Somide

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Commercial large language, Commercial large, language models bill, African languages, large language models

备注： 40 pages, 5 figures, 25 tables

点击查看摘要

Abstract:Commercial large language models bill, scale latency, and budget context per token. Yet tokenizers assign more subword tokens to the same meaning in some languages than in others, so speakers of languages with high token-fertility pay a structural penalty before a model is ever invoked. This penalty is documented for multilingual settings in general, but it has not been measured systematically for African languages at the level of enterprise deployment economics and cognitive context capacity. We measure it across 20 African languages spanning five language families and three scripts (Latin, Ge'ez/Ethiopic, N'Ko; 19 appear in the primary FLORES-200+ corpus, with Nigerian Pidgin measured via MAFAND-MT only), using parallel corpora so that the language effect is isolated from content. Across 11 frontier and open tokenizers on FLORES-200+, every African language carries a tokenization premium above English (median 1.88x on GPT-5 / o200k_base, up to 8.92x for N'Ko); the penalty is largest for Ethiopic and N'Ko scripts (reaching 7-9x) and is near-invariant across corpora (FLORES vs SIB-200 Pearson r = 0.9998). Translated into deployment terms, this results in up to 8.9x inference cost and an equivalent generation-latency multiplier (N'Ko vs English on GPT-5; 7.4x for Amharic), and as little as 11% of English's effective context window. The best currently available tokenizer for African languages, Gemma 4, reduces the mean premium from 3.31x (cl100k_base) to 2.38x, but no tokenizer eliminates the penalty. We release an open measurement tool (afri-fertility), a public leaderboard, a results dataset, and mitigation guidance for African builders. The penalty falls hardest on the languages whose speakers can least afford it, a digital divide encoded directly into the subword vocabulary.

30. 【2606.24459】An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data

链接：https://arxiv.org/abs/2606.24459

作者：Jinghan Wang,Feng Cheng,Wentao Wu,Hang Li,Gaoliang Peng,Tianchen Liu

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：operating condition variations, Bearing fault diagnosis, diagnosis faces critical, faces critical challenges, data occur simultaneously

备注： Accepted as a conference article of AIM 2026

点击查看摘要

Abstract:Bearing fault diagnosis faces critical challenges when dataset heterogeneity, operating condition variations, and limited labeled data occur simultaneously in industrial environments. Existing approaches address these issues in isolation and rely on implicit feature alignment, limiting effectiveness under concurrent challenges. This paper proposes a knowledge-guided two-stage transfer learning framework that employs a lightweight GPT-2-style Transformer with causal self-attention for hierarchical feature extraction from vibration signals, establishing explicit pathways where pre-trained encoder weights and fault prototype embeddings serve as knowledge carriers from multi-source pre-training to target adaptation. The framework addresses the dual-shift challenge through multi-source learning for generalizable representations, prototype-based knowledge modulation for target adaptation, and taxonomy-adaptive classification for seamless transfer across heterogeneous fault categories. Experimental validation on four real-world datasets demonstrates 92.61% average accuracy with only 10% labeled target data, outperforming state-of-the-art methods by 17.24 percentage points, establishing a practical pathway toward cost-effective predictive maintenance in Industry 4.0 applications.

31. 【2606.24453】Bayesian control for coding agents

链接：https://arxiv.org/abs/2606.24453

作者：Theodore Papamarkou,Vladislav Smirnov,Viktor Mazanov,Artem Vazhentsev,Preslav Nakov,Timothy Baldwin,Artem Shelmanov

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：agents pair LLM, including cheap diagnostics, Modern coding agents, pair LLM generators, pair LLM

备注：

点击查看摘要

Abstract:Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and nine coding benchmarks, Bayesian control proves to be most valuable when verification is costly and critics are informative but imperfect. Beyond control, the belief state yields an interpretable correctness score that outperforms token-probability and raw tool-success baselines for uncertainty quantification.

32. 【2606.24428】Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

链接：https://arxiv.org/abs/2606.24428

作者：Shiding Zhu,Yudi Qi,Yajie Wang,Jiaze Li,Chao Song,Yaorui Shi,Yibo Miao,Hanqi Gao,Kai Zhang

类目：Computation and Language (cs.CL)

关键词：large language model, language model, open-world interaction, Experience-driven self-evolution, critical for large

备注： 28 pages, 11 figures

点击查看摘要

Abstract:Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent executes tasks, summarizes outcomes, and determines memory content. This setup makes agents vulnerable to the Self-Confirmation Trap: wrong-but-self-consistent trajectories are misidentified as successful experience, leading to cumulative errors during retrieval and reuse. To address this issue, we propose EDV, an Execute-Distill-Verify framework for reliable experience learning. In the Execute stage, multiple heterogeneous agents explore the same task space in parallel to generate diverse candidate trajectories. In the Distill stage, a dedicated third-party agent comparatively analyzes these trajectories to produce candidate experiences, reducing executor-centric summarization bias. In the Verify stage, the execution group validates candidates via a consensus mechanism, and only approved experiences are written into shared or private memory. By decoupling the three stages, EDV transforms experience learning from isolated self-reflection into collaborative construction, filtering erroneous and noisy content before memory insertion. We evaluate EDV on three challenging long-horizon benchmarks: tau2-bench, Mind2Web and MMTB. Results show EDV consistently outperforms strong baselines, validating that reliable experience construction is essential for robust agent self-evolution. Our code is available at this https URL.

33. 【2606.24420】Beyond Logprobs: A Multi-Signal Confidence Engine for LLM-Based Document Field Extraction

链接：https://arxiv.org/abs/2606.24420

作者：Nitesh Kumar

类目：Computation and Language (cs.CL)

关键词：including financial reconciliation, document processing pipelines, high-stakes document processing, compliance verification, LLM extraction

备注： Extended version of a paper accepted (Oral) at the RobustifAI Workshop, IJCAI-ECAI 2026, Bremen, Germany. 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:In high-stakes document processing pipelines, including financial reconciliation, compliance verification, and procurement automation, an LLM extraction that is silently wrong is more dangerous than one that is visibly absent. The central challenge is not extraction accuracy alone but reliable confidence estimation: knowing, field by field, whether an extraction can be trusted for automation or deferred to human review. Token-level log-probabilities, verbalized confidence, and multi-sample self-consistency all collapse toward all-positive behaviour at practical thresholds, offering no reliable separation between trustworthy and untrustworthy extractions. We present ExtractConf, a cross-domain, field-agnostic confidence engine that grounds confidence estimation in two structurally different readings of the same document. A field-guided Hunter call extracts each field under schema-slot completion pressure; a document-guided Mapper call scans holistically and surfaces values grounded in document content. This asymmetry yields different failure modes: Hunter hallucinates values for absent fields, while Mapper misses visually non-salient ones. Their disagreement is independently informative. ExtractConf fuses cross-call disagreement, LLM-internal uncertainty, OCR, image quality, and spatial layout into a classifier requiring no domain-specific rules or retraining. On DocILE (55-field invoices, 26% failure rate), it achieves 0.928 ROC AUC and reduces selective prediction risk by 70% over logprob-mean. At 80% coverage, accuracy reaches 99.1%, enabling a practical human-in-the-loop workflow. Zero-shot transfer to CORD receipts achieves 0.858 AUC; lightweight Lasso recalibration reduces ECE by 89% and Brier by 43%, confirming the signals generalise across document domains.

Comments:
Extended version of a paper accepted (Oral) at the RobustifAI Workshop, IJCAI-ECAI 2026, Bremen, Germany. 9 pages, 5 figures, 2 tables

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.24420 [cs.CL]

(or
arXiv:2606.24420v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.24420

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

34. 【2606.24391】Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

链接：https://arxiv.org/abs/2606.24391

作者：Arnaud Ricci

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)

关键词：introduce Age, grid to destroy, enemy base, destroy the enemy, strict JSON schema

备注： 25 pages including appendices, 8 figures, 4 tables; appendices include verbatim system prompt and engine resolution pseudocode. All correlations reported with p-values, 95% bootstrap confidence intervals and Spearman's rho; includes a Steiger test and Bradley-Terry fit

点击查看摘要

Abstract:We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks. Models receive a (near) rule-only prompt with no build-order advice (two tactical seed phrases were present during data collection; see Section 2.7). We benchmark 15 reasoning models across 54 matches and 5,258 actions. Findings: (1) the nuclear rush dominates (78% on the rules-coherent v0.11+ sub-corpus; 85% corpus-wide) with a sole-launcher signature that is largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure; (2) military conquest is rare but faster (12.3 vs 18.9 turns); (3) diplomacy is prolific yet almost never consummated; (4) ~58% of illegal actions are fog/state errors, making the illegal-action rate a measure of belief-tracking; (5) -- the least established, and the only one we label exploratory -- a weak link associates reliability with winning. The corpus is small, unbalanced and not side-swapped, so the ranking is a preliminary descriptive view, not a contribution. Beyond ranking, the turn-by-turn traces of actions and messages make the corpus a lens on how LLMs reason under adversarial uncertainty -- their belief-tracking, spontaneous deception, and per-model cognitive "personas" -- which we frame as a future research direction. We release the replay format, an isometric viewer and all replays; engine source on request.

35. 【2606.24387】AutoSpecNER: A Fine-Grained Named Entity Recognition Dataset for Vehicle Specification Extraction

链接：https://arxiv.org/abs/2606.24387

作者：Jordan Lee,Filippos Ventirozos,Abdirahman Abdullahm,Ioanna Nteka,Peter Appleby,Matthew Shardlow

类目：Computation and Language (cs.CL)

关键词：automotive NER resources, NER resources remain, rich specification information, resources remain limited, automotive NER

备注： 13 pages, 2 figures, 7 tables, Pre-print

点击查看摘要

Abstract:Vehicle advertisements contain rich specification information, but automotive NER resources remain limited. We introduce AutoSpecNER, an expert-annotated dataset for fine-grained entity recognition in vehicle listings. The dataset includes 659 advertisements from a popular car-selling website, with over 10,000 entities annotated across 15 categories, including MODEL, ENGINE_SPEC, and BATTERY_CAPACITY. Annotation quality was validated through inter-annotator agreement, achieving an average score of 91.5%. We benchmark rule-based extraction, fine-tuned transformer encoders, and large language models. DeBERTa achieves the best performance with a 90% micro-F1 score, outperforming the rule-based baseline (43%) and the strongest large language model (77.8%).

36. 【2606.24381】On the Stability of Prompt Ranking in Large Language Model Evaluation

链接：https://arxiv.org/abs/2606.24381

作者：Shaoshuai Du,Penghao Liang,Yixian Shen,Chuanqi Shi,Hang Zhang,Lun Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, Prompt-based interaction, multiple candidate prompts, language models, dominant paradigm

备注：

点击查看摘要

Abstract:Prompt-based interaction has become a dominant paradigm for using large language models (LLMs), where multiple candidate prompts are evaluated and the top-ranked one is selected for downstream use. This workflow implicitly assumes that prompt rankings are stable under minor variations in evaluation conditions. In this paper, we systematically study prompt ranking stability under common sources of variability, including random seeds and limited evaluation subsets. Across three open-weight LLMs and two benchmark tasks, we find that while overall rank correlations are often moderate to high, the identity of the top-performing prompt frequently changes, leading to unreliable selection decisions. To address this issue, we propose a simple stability-aware selection strategy based on a lower confidence bound, which accounts for both performance and variance. Our results show that this approach improves robustness in unstable settings while remaining competitive in more stable regimes. These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking.

37. 【2606.24379】ComputeFHE: A Privacy-Preserving General-Purpose Computation Library

链接：https://arxiv.org/abs/2606.24379

作者：Faris Serdar Tasel,Efe Ciftci

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Fully Homomorphic Encryption, Fully Homomorphic, Homomorphic Encryption, preserving data confidentiality, performed directly

备注： 16 pages, 3 figures

点击查看摘要

Abstract:Fully Homomorphic Encryption (FHE) enables computations to be performed directly on encrypted data while preserving data confidentiality. However, its practical applications remain limited by high computational costs and development complexity. This paper presents ComputeFHE, an open-source C++ library that facilitates the development of privacy-preserving applications based on the TFHE cryptosystem. The library provides encrypted integer and fixed-point data types together with arithmetic, logical, comparison, conditional, and oblivious array-access operations which allow developers to implement algorithms using a familiar imperative programming paradigm. ComputeFHE supports both conventional TFHE arithmetic based on standard two-input logic gates and an optimized Arithmetic Logic Unit (ALU) architecture utilizing FHE-friendly logic primitives. Experimental results demonstrate significant reductions in the number of required bootstrapping operations, achieving performance improvements of up to 3.9x for selected operations. In addition, the library includes a simulation mode that enables testing, debugging, and complexity analysis without performing actual cryptographic computations while providing circuit complexity and bootstrapping costs. Built on top of OpenFHE, ComputeFHE offers a practical and accessible framework for developing and evaluating privacy-preserving algorithms and applications.

38. 【2606.24366】MorfFlex: Handling Rich Morphology

链接：https://arxiv.org/abs/2606.24366

作者：Jaroslava Hlaváčová,Marie Mikulová,Barbora Štěpánková,Milan Straka,Jan Hajič

类目：Computation and Language (cs.CL)

关键词：dictionary architecture suitable, inflection and derivation, morphological dictionary architecture, architecture suitable, extensive regularity

备注： Accepted to LREC 2026

点击查看摘要

Abstract:We present MorfFlex, a morphological dictionary architecture suitable for languages with extensive regularity in both inflection and derivation. As the primary example of MorfFlex in use we introduce MorfFlex CZ, a morphological dictionary of Czech. It is distributed as a simple, unstructured list of wordform, lemma, tag triplets, however, its manually maintained, unpublished source files and conversion scripts encode a sophisticated system of inflectional and derivational patterns. These patterns dramatically reduce the otherwise enormous size of the dictionary, which currently contains over 100 million wordforms and more than 1 million lemmas. The MorfFlex CZ dictionary serves as an essential resource for ensuring the consistency of manual morphological annotation in the Prague Dependency Treebanks and underpins state-of-the-art automatic tools such as MorphoDiTa. In this paper, we focus on: (i) presenting an effective method for managing the rich morphological system within the dictionary, and (ii) demonstrating the utility of such a language resource for maintaining annotation consistency in corpora and supporting the development of advanced NLP applications.

39. 【2606.24359】Automatic Part-of-Speech Tagging of Arabic-English Dictionary Senses through WordNet

链接：https://arxiv.org/abs/2606.24359

作者：Diaa M. Fayed,Aly A. Fahmy,Mohsen A. Rashwan,Wafaa K. Fayed

类目：Computation and Language (cs.CL)

关键词：English POS tags, bilingual dictionary, paper proposed, bilingual dictionary senses, POS tags

备注： 10 pages, 3 figures, 5 tables, Published in Proceedings of the 15th Conference on Language Engineering, Egyptian Society of Language Engineering (ESOLE'15), Dec., 2015

点击查看摘要

Abstract:This paper proposed an algorithm for part-of-speech (POS) tagging senses of a bilingual dictionary. The algorithm is applied on the Al-Mawrid Arabic-English dictionary. The tagging task is accomplished by transferring the POS tags of the English translation equivalences (TEs) to the dictionary senses after dis-ambiguities process. The English POS tags of senses are acquired from the Princeton WordNet. POS tagging of bilingual dictionary senses is prerequisite to link a bilingual dictionary to WordNet and/or standardizing that dictionary into WordNet-LMF format where the synset (set of synonyms), not word, is the basic brick. The registered accuracy is high though the cost is little. Building NLP/HLT tools needs linguistic experts, large investments, and long time. For statistical approach, we need large annotated corpora and for rule-based approach, we need large lexicon that contains rich linguistic and world knowledge. That motivates the appearance of what are called resource-light approaches to develop natural language processing (NLP) tools for poor-resource languages.

40. 【2606.24346】PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation

链接：https://arxiv.org/abs/2606.24346

作者：Kirill Dubovikov(1),Omar El Mansouri(1),Hachem Madmoun(1),Yanda Li(1),Sandeep Kumar(1),Aya El Mir(1),Supriyo Ghosh(2),Writabrata Bhattacharya(2),Adrian Garcia-Garcia(2),Onkar Pandit(2),Sunil Kumar Sahu(2),Federico Castanedo(2),Larry Murray(2),Martin Takac(1),Salem Lahlou(1) ((1) Mohamed bin Zayed University of Artificial Intelligence, (2) Inception AI)

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Petroleum-engineering search exposes, strong general retrievers, relevant evidence exists, Petroleum Engineering Text, Petroleum-engineering search

备注：

点击查看摘要

Abstract:Petroleum-engineering search exposes a supervision gap for strong general retrievers: relevant evidence exists in public web text, but domain relevance labels are scarce. To address this gap, we propose PETRA, a large-scale Petroleum Engineering Text for Retrieval Adaptation dataset and pipeline that converts noisy public web data into a curated domain corpus and synthetic supervision for dense retrieval and reranking. PETRA contains 1.36M curated chunks, approximately 2B token equivalents, $\approx$859k, embedding training rows from $\approx$224k anchors, and roughly 400k teacher-scored reranker candidate rows. Its construction combines high-recall energy-domain curation, an energy-domain classifier with 98.4% test accuracy, chunk-grounded query generation, LLM-written hard negatives, and retrieval-mined candidate lists. PETRA improves first-stage in-domain Normalized Discounted Cumulative Gain (nDCG) from 0.703 to 0.763 through score fusion. Reranker adaptation improves the public Earth Science benchmark by 44% relative and a six-task reasoning-intensive panel by 23%. Failed training recipes show that high train-holdout accuracy on synthetic labels does not predict retrieval gains; retrieval-mined data helps only after being repackaged as teacher-scored candidate lists sampled from the inference-time candidate distribution.

41. 【2606.24337】Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies

链接：https://arxiv.org/abs/2606.24337

作者：Marie Mikulová,Barbora Štěpánková,Daniel Zeman,Jan Štěpánek,Milan Straka,Jan Hajič

类目：Computation and Language (cs.CL)

关键词：Prague Dependency Treebank, Prague Dependency, Prague Dependency Treebank-Consolidated, Universal Dependencies, Dependency Treebank

备注： Accepted to LREC 2026

点击查看摘要

Abstract:Czech has been part of Universal Dependencies since its first release in 2015. It has also been one of the best represented languages, with the Prague Dependency Treebank being order of magnitude larger than most other UD treebanks. More recently, three other datasets from the Prague family were added and the annotations thoroughly revisited, forming the "Prague Dependency Treebank-Consolidated" (PDT-C). In comparison to the original PDT, PDT-C is more than twice as large, but it is also much more diverse in terms of genres and domains. In this paper, we describe the conversion of the new resource to Universal Dependencies. While the two annotation schemes are relatively similar at the first sight, there are numerous small differences in topology of the dependency structures and in granularity of the POS and relation type inventories. We demonstrate a selection of such differences on examples, discuss the diverging motivations, as well as ways to overcome the differences during conversion. We argue that while PDT is less "universal" and more tightly bound to one language, its multi-layer annotation is rich and provides all information needed for basic UD trees, and much more.

42. 【2606.24331】ransformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment

链接：https://arxiv.org/abs/2606.24331

作者：Guruprakash J,Krithika L.B

类目：Computation and Language (cs.CL); Emerging Technologies (cs.ET)

关键词：natural language processing, Transformer-based language models, separate durable ideas, Transformer-based language, natural language

备注：

点击查看摘要

Abstract:Transformer-based language models have become the default substrate for natural language processing and the pace of new releases has made it hard for practitioners to separate durable ideas from the noise of incremental announcements. This review works at two levels. At the level of mechanism, we organise the main transformer families into a working taxonomy, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. We then extend the discussion to post-2023 developments that changed the picture in practice: instruction tuning, reinforcement learning from human feedback, direct preference optimisation, mixture-of-experts scaling, retrieval augmentation and the current flagship model families from OpenAI, Anthropic, Google, Meta, Mistral and DeepSeek. At the level of use, we survey deployments across healthcare, finance, legal, education, customer service, creative writing and scientific work. Based on this we link each to the specific capabilities that make a transformer the appropriate tool. The contribution of this paper is a critical assessment that is based on the survey. We compare architectures on four axes that matter to deployment decisions, we quantify the trade-off between parameter count and energy cost. We also discuss how alignment methods, data provenance and benchmark saturation change what it means to call a model "state of the art". The final section lists the research questions that we think deserve more attention.

43. 【2606.24324】Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme

链接：https://arxiv.org/abs/2606.24324

作者：Marie Mikulová,Jiří Mírovský,Milan Straka,Pavlína Synková,Jan Štěpánek,Barbora Štěpánková,Jan Hajič

类目：Computation and Language (cs.CL)

关键词：Prague Dependency Treebank, Dependency Treebank framework, Prague Dependency, Dependency Treebank, Treebank framework

备注： Accepted to LREC 2026

点击查看摘要

Abstract:The Prague Dependency Treebank framework is unique in its attempt to systematically include and link different layers of language, including a meaning representation with several types of inter-sentential phenomena, especially coreference and discourse relations. We present its second consolidated version (PDT-C 2.0), which concludes almost 30-years long project of sustained development of the resource to a uniformly and coherently annotated, genre-diversified, almost 4 million token language resource of Czech language, with accompanying fully compatible lexicons. In addition to continuous linguistic research, the richly linguistically annotated corpus is also widely used in international comparisons of the development of traditional and novel NLP tools as well as in conversions into other formalisms. The corpus and the trained parsers are available under the CC BY-NC-SA licence.

44. 【2606.24286】AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

链接：https://arxiv.org/abs/2606.24286

作者：Yijing Chen,Wenhui Tan,Xiaoyi Yu,Yuyue Wang,Xin Cheng,Kaisi Guan,Hao Jiang,Xiangyang Li,Guojie Zhu,Ruihua Song

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Omni-modal Large Language, Multimodal Large Language, achieved remarkable progress, comprehension remains challenged

备注：

点击查看摘要

Abstract:Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To address these bottlenecks, we propose AVOC, a framework for long-form audio-video understanding in Omni-modal Large Language Models. AVOC introduces a learnable token compression module between the modality encoders and the LLM backbone. We reframe multimodal token compression as a top-$K$ retrieval problem: given a fixed context budget, the module must retrieve a compact subset of tokens that best supports answering the user query. We draw inspiration from three classical Information Retrieval criteria for selecting informative units from a large candidate pool: relevance, importance, and diversity. AVOC instantiates each criterion as a tailored mechanism for audio-video understanding, and integrates them into a unified retrieval-style compression pipeline. Experiments show that AVOC achieves state-of-the-art performance on long-form audio-video benchmarks, surpassing the second-best model by 4.9 and 5.5 points in average accuracy on OmniVideoBench and LVOmniBench, respectively. Moreover, AVOC maintains robust performance on Audio-Video Needle-in-a-Haystack task at durations up to one hour.

45. 【2606.24281】CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

链接：https://arxiv.org/abs/2606.24281

作者：Conor Finlay,Joshua Kurien,Saurabh Dash,Marzieh Fadaee,Beyza Ermis

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：answer difficult questions, Reasoning language models, difficult questions, supervise confidence estimates, increasingly asked

备注：

点击查看摘要

Abstract:Reasoning language models are increasingly asked not only to answer difficult questions, but also to estimate their likelihood of success. Existing methods typically elicit confidence only once: either before thinking or after answering. We argue that confidence in reasoning models is state-dependent: before thinking, confidence should estimate the chance of the model correctly solving the prompt, while after thinking it should predict whether the realized answer is likely to be correct. This distinction determines the appropriate supervision target: prompt-level success should supervise confidence estimates made after seeing the prompt, while individual answer-level correctness should supervise confidence estimates made after answering. We introduce CALIBER (Calibration Before and After Reasoning), which elicits both estimates and supervises each with the target matched to its information state. Under this unified protocol, CALIBER reduces Expected Calibration Error (ECE) by 52.5% over the strongest single-confidence baseline on BigMathDigits for the 7B model, while achieving the best Brier score and AUROC, and remains within 2.1 points of the best accuracy. Further, on a larger 30B model, CALIBER achieves the best ECE on BigMathDigits while remaining competitive in Brier score and AUROC. Out of distribution, it achieves the best ECE and Brier score on GPQA and TriviaQA, and remains competitive on SimpleQA. Ablations further show that this position-target alignment is most beneficial under distribution shift where it consistently reduces calibration error across all out-of-distribution benchmarks.

46. 【2606.24267】Pigeonholing: Bad prompts hurt models to collapse and make mistakes

链接：https://arxiv.org/abs/2606.24267

作者：Hyunji Nam,Keertana Chidambaram,Dorottya Demszky,Natasha Jaques

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, effective in Large, Language Models, phenomenon we call

备注：

点击查看摘要

Abstract:While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call "pigeonholing." **Unintentionally bad** contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model's buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution, and (2) when the conversation context includes the assistant's previous (incorrect) responses. Our experiments across 10 verifiable and open-ended tasks with 10 different models show that pigeonholing manifests in several ways: (1) repeating the incorrect answers from context (leading to 38-40% performance drop), (2) converging on a narrow set of answers in coding and text generation without exploring alternatives, and (3) flipping stance on controversial topics to align with the user or the assistant's previous claims. We find that pigeonholing worsens almost monotonically with the number of conversation turns (performance drops by additional 14+% as repeated mistakes increase from 1 to 5), and pigeonholing-induced mode collapse can happen even when the provided example is correct. As a step toward mitigation, we propose RLVR with synthetic errors which improves models by 43-60% under bad contexts compared to vanilla RLVR baselines.

47. 【2606.24259】SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization

链接：https://arxiv.org/abs/2606.24259

作者：Noor Islam S. Mohammad,Ulug Bayazit

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Fine-tuned encoders deployed, mismatched inductive biases, NLP tasks face, heterogeneous NLP tasks, external lexical knowledge

备注： Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), ACL 2026, San Diego, California, USA. Available at [this https URL](https://openreview.net/forum?id=WJCalficPT)

点击查看摘要

Abstract:Fine-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical knowledge. We introduce \textbf{\surgellm}, a unified transformer framework that addresses each with a dedicated lightweight module: a \emph{surgical feature gate} (learned per-dimension sigmoid over curated lexical indicators and \texttt{[CLS]}; provably degenerates to identity when features are uninformative), \emph{task-conditioned prefix tokens} (quantized feature values and task identity prepended to every input), and \emph{Instance-Weighted Normalization} (IWN; removes class-prior bias from gate statistics). We prove an excess-risk bound linking gate benefit to \emph{surgical feature alignment}. Across four tasks, SST-2, multi-hop retrieval, LLM-prompt attribution, and authorship detection, covering 17,830 examples and eleven model variants over three seeds, the IWN variant achieves macro-F1 \textbf{0.940} ($+0.036$ over the strongest non-IWN baseline; $+0.130$ on authorship detection). A random-vocabulary control ($-0.028$ avg.\ F1) confirms gains are lexical, not parametric. Code, vocabularies, and a $99.5\%$-recovery auto-extraction recipe are released.

48. 【2606.24219】Decoherence as Defence and the Magnitude of Noise Regularisation: A Rigorous N -Qubit Theory of Stochastic Quantum Neural Networks for Adversarially Robust Network Intrusion Detection

链接：https://arxiv.org/abs/2606.24219

作者：Gautier-Edouard Edouard Filardo(CREOGN)

类目：Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：encode neuronal activations, quantum neural networks, Lindblad master equation, Stochastic quantum neural, neural networks

备注：

点击查看摘要

Abstract:Stochastic quantum neural networks (SQNNs) encode neuronal activations as qubits, synaptic topology as entanglement, and neural noise through a Lindblad master equation. A recent conference study applied a ring-entangled SQNN to collaborative intrusion detection and reached three conclusions: ring entanglement is \emph{essential} for non-local anomaly detection; an adversarial-resilience bound holds but is \emph{conservative}; and the depolarising channel \emph{fails} to act as a dropout-style regulariser, behaving instead as output noise. It left open whether a per-gate stochastic deactivation (``true quantum dropout'') could regularise where the depolarising channel could not, and whether the loose robustness bound could be replaced by a predictive theory. This paper resolves both and extends the framework to real data and to neutral-atom hardware. We give an $N$-qubit formulation through the stochastic master equation and its vectorised Liouvillian, and prove a \emph{decoherence-contraction theorem}: a depolarising channel of strength $\gamma$ over $L$ entangling layers contracts every weight-$w$ Pauli read-out by a factor $(1-4\gamma/3)^{wL}$ (for the weight-$1$ read-out used here, $(1-4\gamma/3)^{L}$); building on the general noise-as-defence result of Du et al., we make this quantitative and operational for intrusion detection. On the real NSL-KDD dataset under white-box FGSM and PGD attacks, a depolarising SQNN trained with the channel is, over seven seeds under strong $\ell_\infty$/$\ell_2$ attacks, significantly more robust than the noiseless circuit ($\ell_\infty$ PGD-$20$, $p=0.04$, large effect) and, critically, never suffers the catastrophic robustness collapse that the noiseless model and gradient-trained classical detectors (which fall from $95\%$ to $47\%$) do, cutting robustness variance roughly twofold; we show this robustness arises from a noise-reshaped training boundary rather than from attack-time gradient contraction. For generalisation, we derive an adaptive-penalty formula showing that per-gate dropout implements a curvature-weighted $L_2$ penalty $\tfrac{p(1-p)}{2}\sum\theta^2\partial^2_\theta L$ in weight space, maximised at $p=1/2$, whereas depolarising noise implements an output-space penalty. A $30$-seed study confirms the formula's quantitative prediction: both mechanisms reduce the train-test gap by a small but statistically significant margin ($\approx\!0.01$; $p10^{-4}$ and $p=0.004$), are statistically indistinguishable from each other, and the effect is concentrated where overfitting is largest; increasing the dropout rate past $1/2$ does not help, as the formula predicts. The single-seed dichotomy of prior work does not survive replication. We close with a neutral-atom realisation and a feasibility-by-$N$ analysis.

49. 【2606.24200】MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

链接：https://arxiv.org/abs/2606.24200

作者：Junhyeok Lee,Han Jang,Hyeonjin Goh,Kyu Sung Choi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Retrieval-augmented generation, clinical settings increasingly, settings increasingly requires, English evidence corpora, increasingly requires multilingual

备注： Under review. 15 pages, 3 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) in clinical settings increasingly requires multilingual retrieval against predominantly English evidence corpora. Multilingual medical retrieval demands three capabilities: cross-lingual alignment, concept discrimination, and evidence retrieval. However, existing benchmarks evaluate these only in isolation, leaving the interaction between biomedical expertise and multilingual coverage unmeasured. We introduce MMed-Bench-IR, a benchmark designed to disentangle these axes across 6 languages and three structurally heterogeneous tasks: (1) cross-lingual medical QA retrieval with 6,127 queries grounded in the Unified Medical Language System (UMLS), (2) concept discrimination over 4,975 confusion sets at three difficulty tiers, and (3) multilingual evidence retrieval for RAG with 2,040 quality-assured queries. The three tasks share zero concept and query overlap by design, ensuring that aggregate scores reflect genuine capability breadth. Evaluation of ten systems across six paradigm families reveals severe cross-lingual failure: biomedical encoders that score 0.818 nDCG@10 in English drop to 0.056 in Japanese, a gap that English-only benchmarks cannot detect.

50. 【2606.24194】Dialogue to Discovery: Attribute-Aware Preference Elicitation for Conversational Product Search Assistants

链接：https://arxiv.org/abs/2606.24194

作者：Sarthak Harne,Natwar Modani,Debabrata Mahapatra,Shubham Agarwal

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Conversational product search, search assistants offer, product search assistants, keyword-based product search, traditional keyword-based product

备注：

点击查看摘要

Abstract:Conversational product search assistants offer a more expressive, natural, and interactive alternative to traditional keyword-based product search. With limited screen space, showing only a few items increases the need for precise preference elicitation, which can prolong conversations, leading to user frustration and session abandonment. Conversely, rushing to recommend items without a clear understanding of preferences risks poor matches and a degraded user experience. We present Dialogue to Discovery (D2D), an attribute-oriented preference elicitation framework that dynamically exploits the structure of product attributes to efficiently steer conversations toward the user's desired item. D2D adaptively prioritizes the most informative queries and strategically times product recommendations, reducing premature or off-target suggestions that harm engagement. To evaluate D2D, we curate three datasets from the Amazon Reviews corpus. In simulated conversations modelled using a multi-factor utilitarian patience framework, D2D achieves a 22.2-29.9% improvement in target-finding accuracy, 6.6-16.1% reduction in abandonment, and 27.5% shorter average conversations over the state-of-the-art baselines. A complementary user study further confirms significant gains in both user satisfaction and perceived efficiency.

51. 【2606.24192】Co-occurring associated retained concepts in Diffusion Unlearning

链接：https://arxiv.org/abs/2606.24192

作者：Miso Kim,Georu Lee,Yunji Kim,Hoki Kim,Jinseong Park,Woojin Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：mitigate harmful content, harmful content generation, key technique, technique to mitigate, mitigate harmful

备注： Accepted as a poster at ICLR 2026. Code available at [this https URL](https://github.com/damilab/CARE)

点击查看摘要

Abstract:Unlearning has emerged as a key technique to mitigate harmful content generation in diffusion models. However, existing methods often remove not only the target concept, but also benign co-occurring concepts. As illustrated in Fig.1, unlearning nudity can unintentionally suppress the concept of person, preventing a model from generating images with person. We define these undesirably suppressed co-occurring concepts that must be preserved CARE (Co-occurring Associated REtained concepts). Then, we introduce the CARE score, a general metric that directly quantifies their preservation across unlearning tasks. With this foundation, we propose ReCARE (Robust erasure for CARE), a framework that explicitly safeguards CARE while erasing only the target concept. ReCARE automatically constructs the CARE-set, a curated vocabulary of benign co-occurring tokens extracted from target images, and leverages this vocabulary during training for stable unlearning. Extensive experiments across various target concepts (Nudity, Van Gogh style, and Tench object) demonstrate that ReCARE achieves overall state-of-the-art performance in balancing robust concept erasure, overall utility, and CARE preservation.

52. 【2606.24188】Aspect-Based Sentiment Evolution and its Correlation with Review Rounds in Multi-Round Peer Reviews: A Deep Learning Approach

链接：https://arxiv.org/abs/2606.24188

作者：Ruxue Hana,Haomin Zhoua,Jiangtao Zhong,Chengzhi Zhang

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：Mining sentiment information, scientific evaluation process, offers valuable insights, comments offers valuable, Mining sentiment

备注：

点击查看摘要

Abstract:Mining sentiment information from the textual content of peer review comments offers valuable insights into the scientific evaluation process. However, previous studies are often constrained by coarse-grained analysis and the lack of differentiation across review rounds. Notably, the dynamic shifts in reviewers' focus and sentiment tendencies throughout multiple review stages remain underexplored. To address this gap, the present study investigates the distribution and evolution of aspect-level sentiments and examines their correlation with the number of review rounds. We begin by segmenting the multi-round review comments of 11,063 accepted papers from Nature Communications and identifying fine-grained review aspect clusters. A manually annotated corpus of approximately 5,000 review sentences is then constructed. Using this dataset, we train a series of deep learning-based aspect sentiment classification models. Among them, the LCF-BERT-CDM model achieves the best performance, with a Macro-F1 score of 82.65%. Subsequent statistical analysis reveals a consistent trend: as the number of review rounds increases, the proportion of positive sentiments rises, while negative sentiments decline. Correlation analysis further indicates that aspect sentiment scores are negatively associated with the total number of review rounds. Key aspects exhibiting stronger correlations include "experiments", "research significance" and "result analysis".

53. 【2606.24177】Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy

链接：https://arxiv.org/abs/2606.24177

作者：Youran Sun,Xingyu Ren,Chugang Yi,Jiaxuan Guo,Kejia Zhang,Jianda Du,Haizhao Yang

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：Large language models, Large language, research production scalable, Agon, making research production

备注：

点击查看摘要

Abstract:Large language models are making research production scalable, shifting the bottleneck from producing artifacts to judging claims. We present \textsc{Agon}, a research orchestrator that validates what can be checked inside the workflow and leaves the remaining judgments to human scientists. \textsc{Agon} is built on six design principles: Prompt Economy, Future-Facing, Minimal Prompts, OmniDisciplinary, Massive Parallelism, and Zero-Code. We ran \textsc{Agon} across domains for 444 iterations of Prompt Economy loops, using only small starting topics and no human-written experimental code. These deployments demonstrate scalability while exposing new classes of failure. We organize these failures into a taxonomy along severity, fixability, visibility, and capability locus. The taxonomy separates failures the loops can see and fix from those that require human judgment. Together, these results show that \textsc{Agon} is pushing research toward a new paradigm: machine scales, human steers.

54. 【2606.24176】A Synthetic Reliability-Aware PINN Benchmark for Offshore Wind Turbine Support-Structure Monitoring with Bayesian Inverse Identification

链接：https://arxiv.org/abs/2606.24176

作者：Puneet Kant,Monika Tanwar

类目：Computation and Language (cs.CL); Computation (stat.CO)

关键词：Reliable structural health, fast state estimation, Reliable structural, requires fast state, offshore wind turbine

备注： 18 Pages, 8 Figures

点击查看摘要

Abstract:Reliable structural health monitoring (SHM) of offshore wind turbine (OWT) support structures requires fast state estimation from sparse measurements. Repeated high fidelity finite element or aeroelastic analyses are difficult to use directly in online monitoring loops, while purely data-driven surrogates can require large training sets. This paper presents Digi Turbine, a synthetic reliability-aware Physics Informed Neural Network (PINN) benchmark for OWT monopile support structure monitoring. The workflow embeds a simplified Euler Bernoulli beam equation with Winkler soil foundation in the training objective, couples it with Bayesian-prior-informed inverse identification, and adds First Order Reliability Method (FORM) screening. All validation uses synthetic configurations with analytical or finite-difference ground truth motivated by the NREL 5MW reference turbine context.

55. 【2606.24172】A Pāninian Foundation for Indic Language Processing

链接：https://arxiv.org/abs/2606.24172

作者：Ritwik Banerjee,Lav R. Varshney

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：billion people communicate, processing infrastructure serving, natural language processing, language processing infrastructure, fragmented and underdeveloped

备注： 16 pages, 0 figures

点击查看摘要

Abstract:More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks around individual languages or small subsets of genealogical language families, building separate analyzers, parsers, and datasets for each language and starting over for the next. This overlooks a deep regularity. Through more than two millennia of convergence around Sanskrit, Indic languages came to share a morphosyntactic architecture formalized in Pānini's grammar, the Astādhyāyī. This cuts across genealogical lines, uniting languages through a common framework. We argue that this Pāninian framework supplies a unifying computational architecture the field has lacked, and that benchmarks grounded explicitly in it would make Indic language systems more accurate, more data-efficient, and more transferable, effectively merging many apparently disparate and sparse Indic language resources into a single high-resource metalanguage bedrock. We propose a four-part benchmark suite to render this shared architecture explicit, measurable, and ready to be leveraged for practical applications. Moreover, we underscore the question it raises for interpretability research: whether neural models trained on these languages come to represent Pānini's categories on their own.

56. 【2606.24163】CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

链接：https://arxiv.org/abs/2606.24163

作者：Joeun Kim,HoEun Kim,Young-Sik Kim

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：strict false-positive control, Reliable provenance, LLM outputs requires, outputs requires multi-bit, maintaining strict false-positive

备注：

点击查看摘要

Abstract:Reliable provenance for LLM outputs requires multi-bit watermarks that remain robust under editing while maintaining strict false-positive control. Existing ECC-based LLM watermarks rely largely on hard-decision decoding, discarding token-level reliability information. We propose CORE-BREW, a Constant-hit-Rate Embedding extension of block-wise BREW for robust multi-bit watermarking. CORE-BREW calibrates the watermark channel by targeting a fixed hit rate p-star, yielding closed-form per-token log-likelihood ratios (LLRs) for principled soft-decision decoding. It supports two detection modes: Strict-Safe, which preserves the bounded-distance designated-codeword acceptance region, and FPR-Calibrated, which uses likelihood-based scoring and lightweight list decoding to characterize the FPR-TPR trade-off. Experiments on open-source LLMs under token-level edits and paraphrasing demonstrate improved low-FPR discrimination and robustness over prior multi-bit watermarking baselines while maintaining comparable semantic quality.

57. 【2606.24162】BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

链接：https://arxiv.org/abs/2606.24162

作者：Jin Huang,Yutong Xie,Wanli Song,Xingjian Zhang,Walter Yuan,Matthew O. Jackson,Qiaozhu Mei

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：behavioral science domains, behavioral, Foundation models, behavioral science, behavioral foundation models

备注：

点击查看摘要

Abstract:Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations. We introduce BehaviorBench, a comprehensive benchmark that evaluates foundation models along four core capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. Crucially, BehaviorBench evaluates model outputs at both the individual and distributional levels, capturing not only per-subject accuracy but also population-level alignment, an essential requirement for behavioral validity. Leveraging the tasks in BehaviorBench, we further develop this http URL-1.5, extending the this http URL family of behavioral foundation models fine-tuned on behavioral data. Our results reveal a considerable gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models, fine-tuned on behavioral data, achieve substantially stronger distributional alignment. Notably, this http URL-1.5 leads on distributional metrics and remains competitive on individual-level metrics, suggesting that proper behavioral adaptation can close the gap. Our results highlight the importance of distributional evaluation, establish BehaviorBench as a foundation for developing and assessing behaviorally aligned AI systems, and demonstrate this http URL-1.5's potential for a broad range of behavioral science studies. Our BehaviorBench and this http URL-1.5 models can be accessed via this https URL.

58. 【2606.24155】MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

链接：https://arxiv.org/abs/2606.24155

作者：Ding Jinru,Jiang Chuchu,Lu Lu,Pang Wenrao,Bian Mouxiao,Gao Zhuangzhi,Chen Jiangyuan,Peng xinwei,Chen Ruiyao,Ren Sijie,Lu Renjie,Han Bin,Liu Meiling,and Xu Jie

类目：Computation and Language (cs.CL)

关键词：Existing medical, atomic skill evaluation, lack process visibility, Medical Atomic Skills, Clinical Cognitive Responsiveness

备注：

点击查看摘要

Abstract:Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language, and agent systems) that moves from static QA to dynamic, process-oriented evaluation. MedBench v5 features: (1) a dual-dimensional framework combining Clinical Cognitive Responsiveness (14 sub-dimensions) and Medical Atomic Skills (4 agent environments), covering 63 tasks; (2) three switchable information-flow stressors (omission, contradiction, evidence delay) for factorized degradation analysis; (3) a dynamic process audit protocol with five reasoning nodes that produces model-specific failure fingerprints; (4) hallucination propagation monitoring across initiation, propagation, anchoring, and contradiction interaction-capturing silent hallucination. Experiments on frontier models show that strong overall task performance does not guarantee process stability: stressors mainly disrupt contradiction detection, diagnosis updating, hallucination propagation, and contradiction-based self-correction, while final evidence grounding can remain superficially stable. MedBench v5 provides a unified infrastructure for capability profiling, controllable stress testing, process auditing, and hallucination trajectory analysis in clinical AI evaluation.

59. 【2606.24151】Metis: Bridging Text and Code Memory for Self-Evolving Agents

链接：https://arxiv.org/abs/2606.24151

作者：Zijie Dai,Siuhin He,Hui Li,Qihui Zhou,Jiajun Li,Mingcong Song,Guoping Long,Hongjie Si,Xin Yao,Lin Zhang,James Cheng,Xiao Yan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Metis, memory, execution, distilling experience, experience

备注： Work in progress

点击查看摘要

Abstract:Self-evolving agents improve over time by distilling experience from past executions and reusing it in future tasks. Existing systems represent such experience either as natural-language text injected into the agent context or as code exposed as callable tools. However, the choice between these representations is typically made at design time rather than derived from the characteristics of the experience itself, leaving the trade-offs between them poorly understood. We present the first controlled study that isolates text memory and code memory over an identical set of experiences. Our results show that the two forms exhibit complementary trade-offs in construction cost, execution efficiency, and transferability, such that neither representation alone is sufficient. Guided by these findings, we propose Metis, a self-evolving agent system built on a hierarchical dual-representation memory. Metis organizes textual experience into execution plans, environment facts, and common pitfalls, and selectively crystallizes recurring plans into validated callable tools. This design combines the broad applicability of text memory with the execution efficiency of code memory while incurring tool-generation cost only when justified by repeated reuse. We evaluate Metis on AppWorld, a challenging benchmark for interactive agents. The results show that Metis improves task accuracy by up to 20.6% over ReAct while reducing execution cost by up to 22.8%. Compared with representative self-evolving agent systems, Metis consistently achieves a better balance between accuracy, execution efficiency, and memory-construction cost.

60. 【2606.24133】Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

链接：https://arxiv.org/abs/2606.24133

作者：Chenhao Dang,Jing Ma,Mingjie Liao

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, Online Data Mixing, cornerstone of Large, Language Model

备注： Our code is at [this https URL](https://github.com/DANG-ai/LLM-Training-Holistic-Data-Schedule)

点击查看摘要

Abstract:The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a promising direction to improve efficiency. However, existing methods are constrained by their reliance on a singular optimization perspective, which fundamentally overlooks the need for complex LLM pre-training to consider the dynamic data composition from multiple dimensions. To overcome this limitation, we introduce the Holistic Data Scheduler (HDS), a novel online data mixing framework. HDS formulates the data scheduling challenge as a reinforcement learning problem in a continuous control space and leverages the Soft Actor-Critic (SAC) algorithm for its stability and sample efficiency in exploring the high-dimensional policy space. At the core of HDS lies a novel multi-objective, holistic reward function that integrates three critical perspectives: a data-driven reward for quality, a loss-driven reward capturing inter-domain influence, and a model-driven reward based on weight norms. To validate our design and determine its optimal configuration, we conducted systematic experiments on LLMs of various sizes. On The Pile benchmark, HDS reaches the final validation perplexity of the next best method with 44% fewer training iterations. Furthermore, it achieves a 7.2% improvement on the MMLU 0-shot task along with consistent gains on other benchmarks, showcasing its ability to enhance both training efficiency and final model capability.

61. 【2606.24119】When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs

链接：https://arxiv.org/abs/2606.24119

作者：Lucky Verma,Pratik Yadav

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Discrete diffusion language, diffusion language model, fine-tuning inherits inexpensive, inherits inexpensive diagnostics, Discrete diffusion

备注： 14 pages, 3 figures. Code and result artifacts: [this https URL](https://github.com/lucky-verma/top1-fails-dlm-lora-monitors)

点击查看摘要

Abstract:Discrete diffusion language model (DLM) fine-tuning inherits inexpensive diagnostics from denoising-time confidence monitors, but their PEFT-training meaning is untested. We test top-1 argmax concentration as a collapse warning. Across 816 LoRA/PEFT configurations from three DLM families, the warning fires for every configuration while logs record 0/816 actual collapses at the 200 step horizon, giving zero precision. The cause is pre-equilibrium saturation: top-1 concentration is already high before optimization and quickly becomes insensitive to final training stability. We then evaluate max LoRA gradient norm, a parameter-side signal that samples gradient routing rather than token concentration. On a pooled held-out LLaDA-family split, a train-optimized threshold identifies top-decile final-loss configurations with precision 0.68 and F1=0.79, above the all-positive top-1 baseline even at the lower split-bootstrap confidence bound. Autoregressive controls and cross-family threshold failures bound the result to short-horizon DLM-LoRA inspection rather than a universal collapse detector. Workflow: drop top-1 as a PEFT alarm, log max-gradient early in training, and calibrate thresholds per DLM family before routing runs for inspection.

62. 【2606.24102】PORTER: Language-Grounded Event Representations for Portable Structured EHR Foundation Models

链接：https://arxiv.org/abs/2606.24102

作者：Lin Lawrence Guo,Adam Paul Yan,Emily Vettese,Lillian Sung

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：electronic health record, EHR foundation, EHR foundation model, EHR foundation models, foundation models encode

备注：

点击查看摘要

Abstract:Most electronic health record (EHR) foundation models encode clinical events as discrete event tokens from a fixed vocabulary and therefore cannot directly represent events containing unseen concepts or new combinations of concepts and attributes such as numeric values. This limits transfer across institutions and even across deployment pipelines within the same institution. We introduce PORTER, a language-grounded structured EHR foundation model that decouples event representation from this fixed vocabulary. PORTER represents events through their descriptions using a frozen text encoder, integrates numeric values through a dedicated pathway, and learns clinical dynamics over patient timelines with an autoregressively pretrained temporal backbone. Across 74 clinical prediction tasks at a pediatric hospital, PORTER matched the mean AUROC of a fixed-vocabulary model with the same temporal backbone and pretraining objective. When the same patient timelines were rendered using event descriptions not seen during pretraining, PORTER transferred without retraining or vocabulary mapping, recovering 97.1% of the mean AUROC of a model trained directly on the target vocabulary. When transferred to MIMIC, PORTER outperformed the fixed-vocabulary model, which dropped 69% of events because their tokens were unseen. Mechanistic analyses showed cross-vocabulary transfer tracked preservation of patient-level representation geometry rather than the scale of the text encoder, and the numeric pathway improved sensitivity to magnitude without disrupting clinical concept identity. PORTER also achieved higher AUROC than a task-specific text serialization comparator, at 329-fold lower amortized compute. PORTER is a step toward vocabulary-independent EHR foundation models that reduce the need for vocabulary harmonization while preserving in-domain performance and enabling efficient cross-task reuse.

63. 【2606.24099】Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers

链接：https://arxiv.org/abs/2606.24099

作者：Yuzhuo Wang,Chengzhi Zhang,Min Song,Seong Deok Kim,Youngsoo Ko,Juhee Lee

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：artificial intelligence, central to scientific, era of artificial, algorithm, influence

备注：

点击查看摘要

Abstract:Algorithms have become central to scientific research in the era of artificial intelligence (AI). Although algorithm mentions in papers are often used to indicate popularity and influence, existing studies usually evaluate individual algorithms in isolation and pay limited attention to the collective influence formed through their interconnections. This study constructs large-scale algorithm co-occurrence networks in natural language processing (NLP) based on the full text of academic papers and investigates algorithm influence from a network perspective. Using deep learning models, we extract algorithm entities and build overall, cumulative, and annual co-occurrence networks. We analyze their structural characteristics and apply multiple centrality measures to assess the group influence of algorithms across the whole field and over time. The results show that algorithm networks display typical features of complex networks, with increasingly dense connections developing over approximately two decades. Classic, high-performing algorithms and those located at the intersections of different research periods tend to have high popularity, control, centrality, and balanced influence. When the influence of an algorithm declines, it usually loses its core network position first, followed by weaker associations with other algorithms. This study is the first large-scale analysis of algorithm co-occurrence networks. Covering more than four decades of academic publications, it provides a temporal and structural view of algorithm influence and offers a foundation for future research on networks linking algorithms, scholars, and tasks.

64. 【2606.24093】Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

链接：https://arxiv.org/abs/2606.24093

作者：Chi-Sheng Chen,Hung-Yun Liu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Tang-dynasty poets leaves, China Biographical Database, detectable linguistic trace, Quan Tang Shi, Complete Tang Poems

备注：

点击查看摘要

Abstract:We ask whether the geographic origin of Tang-dynasty poets leaves a detectable linguistic trace in their work. Aggregating every poem attributed to each author in the Complete Tang Poems (Quan Tang Shi) and linking poets to their administrative circuit of origin via the China Biographical Database (CBDB), we build a poet-level corpus of 357 poets across the ten Tang circuits and frame origin prediction as multi-class classification. Using character $n$-gram TF-IDF together with interpretable domain features (imagery, season, and allusion), classical and neural models predict a poet's broad region (South vs.\ North) at $0.69$ accuracy, well above the $0.53$ majority baseline, and finer circuit-level origin above chance. Beyond classification, three findings emerge. (i) Linguistic distance between circuits grows with geographic distance (Mantel $r=0.40$, $p\approx0.09$ over nine circuits), evidence of a distance-decay effect in poetic language. (ii) The signal interacts with time: South/North separability is at chance in the High Tang and strongest in the Late Tang, consistent with court-driven homogenization at the empire's height followed by regional divergence. (iii) The model's confident errors are historically meaningful -- in the Early Tang, every misclassification is a southern poet read as northern, reflecting the prestige of the northern court idiom. We further show that, when given the whole corpus through a hierarchical frozen-encoder representation, a classical-Chinese transformer (GuwenBERT) only matches -- not beats -- simple TF-IDF, and that combining them adds nothing, indicating that character $n$-grams already capture the regional signal. Our results position interpretable machine learning as a hypothesis generator for literary history.

65. 【2606.24084】Blockwise Policy-Drift Gating for On-Policy Distillation

链接：https://arxiv.org/abs/2606.24084

作者：Liwen Zheng,Haiyun Jiang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：On-policy distillation, computed on trajectories, OPD, sampled-token OPD, teacher signals computed

备注： 8 pages

点击查看摘要

Abstract:On-policy distillation (OPD) trains a student policy using teacher signals computed on trajectories sampled by the student itself. Recent work shows that sampled-token OPD can be fragile on long-horizon reasoning tasks and that local teacher-support matching is a simple and effective repair. This paper introduces blockwise policy-drift gating, a lightweight student-only old-current drift controller for OPD under rollout reuse. The method computes log-probability shifts between the behavior student and the current student on the sampled token path, aggregates these shifts over fixed blocks or spans, and uses the resulting detached, mean-normalized gates to reweight OPD position losses. It does not change teacher targets, teacher top-K supports, or the rollout policy. In a six-variant Qwen3 math reasoning benchmark with a uniform 200-step training budget for all trained variants, we use pass@8 as the primary problem-level solve-rate metric. Fixed 64-token block gating improves sampled-token OPD mean pass@8 from 0.4978 to 0.5160 across AIME24, AIME25, MATH500, and AMC23. On Teacher-TopK/LSM, Block64 gives the best four-benchmark mean pass@8 among trained students. The results identify local old-current policy drift as a practical control signal for reused OPD rollouts and motivate block-level gating as a simple default for improving solve-rate robustness.

66. 【2606.24083】CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

链接：https://arxiv.org/abs/2606.24083

作者：Morayo Danielle Adeyemi,Ryan A. Rossi,Franck Dernoncourt

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Talk short, Talk, models, model, cost

备注：

点击查看摘要

Abstract:"Talk short. Drop grammar. Save token." This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user's prompt or the model's response) is being compressed. We present Cavewoman, a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model's unconstrained reference. We evaluate eight models on five datasets at five reduction levels, with both channels measured on the same items. Output compression cuts realized cost on most API models (1.4-2.4x per model, up to 3x in the best case) and on all four open-weight models under public-tier pricing. Input compression has the opposite effect, a strict lose-lose: it raises net cost rather than lowering it (~1.15x on the five-benchmark mean, up to 1.8x on the worst dataset and 2.7x under stronger compression), because models compensate with longer responses even as accuracy collapses. Under the same setting, surface text diverges from the unconstrained reference: on the non-reasoning models, roughly half of all generations are correct yet their surface text no longer entails the model's own unconstrained baseline generation. The divergence survives length-controlled re-scoring, multiple-comparisons correction, and replication under complementary semantic measures. Code and data are available at this https URL.

67. 【2606.24077】Sentence-Level Contextual Entrainment in Large Language Models

链接：https://arxiv.org/abs/2606.24077

作者：Yang Liu,Chenhui Chu

类目：Computation and Language (cs.CL)

关键词：newly discovered phenomenon, assign higher probabilities, Contextual entrainment, large language models, sentence-level contextual entrainment

备注： 16 pages, 3 figures

点击查看摘要

Abstract:Contextual entrainment, which is a newly discovered phenomenon in large language models (LLMs), refers to the tendency of a model to assign higher probabilities to tokens that appear in its context. In this work, we extend this phenomenon from the token level to the sentence level by examining the per-token mean log-probability of a sentence instead of the probabilities of individual tokens. We investigate sentence-level contextual entrainment across 26 LLMs from seven families and two datasets, which cover both subjective and objective tasks. We find that sentence-level contextual entrainment exists. This means that the sentences in the prompt (even if they are counterfactual statements) can significantly increase their probability during model inference time. As the model size increases, contextual entrainment gradually decreases. We also find that contextual entrainment is controlled by 2% to 4% of the attention heads. Turning off these attention heads can effectively mitigate contextual entrainment without hurting the model's performance.

68. 【2606.24066】VieSpeaker: A Large-Scale Vietnamese Speaker Recognition Dataset Beyond Visual Dependency

链接：https://arxiv.org/abs/2606.24066

作者：Viet Hoang Pham,Tran Trung Nguyen,Bao Thu Ho,Phuong Tuan Dat,Thi Thu Trang Nguyen

类目：ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：Vietnamese remains under-resourced, existing corpora limited, remains under-resourced, acoustic diversity, advanced rapidly

备注： 5 pages, 1 figure, 6 tables, Accepted at Interspeech 2026

点击查看摘要

Abstract:Speaker recognition has advanced rapidly with large-scale training datasets, yet Vietnamese remains under-resourced, with existing corpora limited in scale and acoustic diversity. Most large-scale datasets rely on facial cues to link speech with speaker identities, restricting data collection to recordings where speakers appear on camera. We propose a face-independent dataset construction pipeline and introduce VieSpeaker, a large-scale Vietnamese speaker recognition dataset. Our approach leverages textual metadata and large language model reasoning to infer speaker identities from transcripts and contextual information. VieSpeaker contains approximately 902 hours of speech from 4,715 speakers. Experiments show that models trained on VieSpeaker achieve improved robustness and generalization compared to existing Vietnamese datasets. This work demonstrates the feasibility of face-independent dataset construction and provides a new direction for building large-scale speech resources.

69. 【2606.24063】Selective Capability Unlearning in End-to-End Spoken Language Understanding

链接：https://arxiv.org/abs/2606.24063

作者：Akanksha Singh,Vinod Kumar Kurmi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Modern spoken language, spoken language understanding, Modern spoken, language understanding, systems are increasingly

备注： 5 pages, 3 figures, preprint

点击查看摘要

Abstract:Modern spoken language understanding (SLU) systems are increasingly deployed in real-world settings, where specific functionalities may need to be removed due to policy or safety constraints. In SLU, a functionality corresponds to an intent and its associated slot-generation behavior. However, in autoregressive models, suppressing a target intent does not eliminate the conditional mapping that generates slots conditioned on that intent. When the intent prefix is externally supplied, the model can reconstruct the original intent-slot structure. We identify this structural failure as \textbf{\emph{capability persistence}}. We propose \textit{\underline{B}inding \underline{S}ubspace (BSU)}, a representation-level framework that isolates and attenuates intent-conditioned directions underlying this mapping. Across SLU benchmarks, BSU substantially reduces forced-prefix recoverability while preserving retained performance.

70. 【2606.24055】Best Preprocessing Techniques for Sentiment Analysis

链接：https://arxiv.org/abs/2606.24055

作者：Saranzaya Magsarjav,Melissa Humphries,Jonathan Tuke,Lewis Mitchell

类目：Computation and Language (cs.CL)

关键词：enables monitoring public, monitoring public opinion, Sentiment analysis, Twitter datasets, analysis in Twitter

备注： 9 pages, 3 figures

点击查看摘要

Abstract:Sentiment analysis in Twitter datasets is important because it enables monitoring public opinion on products and analysis of political and social movements. One critical step is preprocessing: the automated processing of text for machine learning algorithms. Preprocessing plays a critical role in reducing noise and improving efficiency. However, little research has systematically examined the order in which preprocessing techniques are implemented. We find that, when accounting for order, spelling correction is the least impactful preprocessing technique, whereas tokenisation is the most impactful. Stemming and stop-word removal are interchangeable, and it is better to remove stop words without removing negation. The best order for applying the preprocessing techniques was tokenisation, text cleaning, stemming, and then stopword removal. Our results provide a systematic approach for practitioners to deploy preprocessing to improve model output without the costly preprocessing exploratory phase.

71. 【2606.24040】owards Version-aware Operations and Transaction Memories for Multi-layer MeMo

链接：https://arxiv.org/abs/2606.24040

作者：Peiran Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)

关键词：multi-layer correlation matrix, correlation matrix memories, explicit multi-layer correlation, proposes language models, multi-layer correlation

备注： Accepted by MeMo Workshop on Mechanistic Interpretability Neuro-symbolic Approaches by-design, Rome (Italy), 24/6/2026

点击查看摘要

Abstract:MeMo proposes language models with explicit multi-layer correlation matrix memories (CMMs), where memorization, retrieval, and forgetting are architectural operations. This paper asks how such memories can reduce the need for retraining when knowledge changes. For changes expressible as MeMo memory associations, the model's accessible knowledge can be updated by editing explicit memories rather than retraining the whole model. We propose a version-aware operation layer in which high-level operations such as replace, obsolete, keep-history, rollback, and trace are compiled into MeMo-native primitive calls over sequences and tokens. The key observation is that a version-aware operation is rarely a single MeMo association. It is an ordered transaction of primitive edits, for example forgetting one sequence-token chain, memorizing another, preserving a historical chain, and recording an inverse program. The framework introduces two auxiliary CMMs: a Version CMM (V-CMM) for mapping version transitions to transaction handles, and a Transaction CMM (T-CMM) for storing reusable change contents and inverse programs. It supports both direct sequence-level edits and structured diff-level inputs, and outlines an evaluation route for update success, rollback, traceability, locality, and transaction reuse.

72. 【2606.24033】RoPE-Aware Bit Allocation for KV-Cache Quantization

链接：https://arxiv.org/abs/2606.24033

作者：Fengfeng Liang,Yuechen Zhang,Jiaya Jia

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Existing low-bit KV-cache, Existing low-bit, low-bit KV-cache quantizers, flat vector, quantizers often treat

备注： Preprint. Code available at [this https URL](https://github.com/JIA-Lab-research/blockgtq)

点击查看摘要

Abstract:Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16's 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at this https URL.

73. 【2606.24014】Reinforcement Learning Towards Broadly and Persistently Beneficial Models

链接：https://arxiv.org/abs/2606.24014

作者：Akshay V. Jagadeesh,Rahul K. Arora,Khaled Saab,Ali Malik,Mikhail Trofimov,Foivos Tsimpourlas,Johannes Heidecke,Karan Singhal

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：high-stakes settings, systems are deployed, deployed across increasingly, increasingly diverse, diverse and high-stakes

备注： Blog: [this https URL](https://alignment.openai.com/beneficial-rl/)

点击查看摘要

Abstract:As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through reward hacking, deception, or other unintended strategies. We study whether RL on beneficial behavior, instantiated in realistic domains, can produce broad and persistent alignment generalization beyond the training distribution. We construct a dataset of realistic situations designed to measure and train beneficial traits, such as truthfulness, fairness, risk awareness, and corrigibility, spanning varied domains, including health, science, and education. We then train models with RL on this dataset and evaluate them on more than 50 independent benchmarks of alignment and beneficial behavior. Compared to a compute-matched baseline, beneficial trait RL improves performance on over 80% of these out-of-distribution benchmarks. We observe substantial out-of-distribution alignment transfer: a beneficial-behavior RL intervention entirely limited to one domain, health, produces broad improvements on non-health alignment evaluations, including reduced reward hacking, deception, and general misalignment. Finally, we study alignment persistence: whether behavior remains robustly aligned under attempts to steer models towards misalignment. Models trained with beneficial trait RL show improved persistence, including greater resistance to adversarial prompting and harmful finetuning; further work is required to isolate the sources of these effects. These results suggest that RL to reinforce beneficial behavior in realistic domains can produce models that are more robustly aligned with human flourishing.

74. 【2606.24004】owards Spec Learning: Inference-Time Alignment from Preference Pairs

链接：https://arxiv.org/abs/2606.24004

作者：Dhriti Krishnan,Tejas Goyal,Jaromir Savelka

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：desired behavior typically, Steering a large, behavior typically relies, large language model, large language

备注：

点击查看摘要

Abstract:Steering a large language model (LLM) toward a desired behavior typically relies on an iterative process of hand-crafting a prompt based on a careful inspection of the model's responses. This is an involved, brittle, and error-prone process. Preference-based fine-tuning is a more rigorous but often prohibitively expensive solution. We propose spec learning, a framework that relies on a brief user instruction and a small set of preference judgments. These are compiled into specifications in the form of natural-language prompts for an LLM. Specifications condition LLMs at inference time, and no parameter updates to the underlying models are required. We show that the responses generated based on the compiled specifications often outperform direct preference optimization (DPO) on datasets from specialized domains whose preference signal is dense. Unlike opaque weight updates, the resulting specifications are human-readable and double as interpretable and transparent written embodiments of the preference signal that produced them.

75. 【2606.23992】RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

链接：https://arxiv.org/abs/2606.23992

作者：Sumit Mukherjee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：clinical decision support, standardized terminology codes, quality measurement, decision support, clinical code systems

备注：

点击查看摘要

Abstract:Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark showed that direct zero-shot large language model (LLM) generation is poorly suited to this task: clinical code systems are large, version-controlled, and not reliably memorized by language models. We study a stage-wise alternative in which candidate-pool construction is optimized for recall and a constrained LLM adjudicator is optimized for candidate selection. On the full 3,744-value-set RASC test split, Qwen3-based retrieval with vocabulary-aware expansion and code-display rescue retrieval increases candidate-pool recall from the original RASC retrieval baseline of 0.553 to 0.730; on the held-out-publisher stratum, pool recall is 0.655. The higher-recall pool alone is not sufficient: applying the original SAPBert cross-encoder to this expanded pool gives full-test macro F1 of 0.287 and held-out-publisher macro F1 of 0.233. Replacing the stage-2 selector with blinded GPT-5 adjudication over the same pool increases full-test macro F1 to 0.549 and held-out-publisher macro F1 to 0.533. These results show that retrieval-constrained LLM adjudication can substantially improve value set completion while preserving the safety constraint that all returned codes must come from an auditable candidate pool.

76. 【2606.23989】Faithful by Construction: Claim-Anchored Attribution for Multi-Document Summarization

链接：https://arxiv.org/abs/2606.23989

作者：Shuo Guan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：generated post hoc, produce fluent multi-document, fluent multi-document summaries, summary statement hard, large language models

备注：

点击查看摘要

Abstract:End-to-end large language models (LLMs) produce fluent multi-document summaries but remain prone to hallucination, and the attributions they offer are typically coarse (whole documents or passages) and generated post hoc, leaving each summary statement hard to verify. We revisit the modular Extract--Select--Rewrite paradigm and recast its intermediate representation as the unit of attribution. We present CAMS, a Claim-Anchored Multi-document Summarization framework that (i) extracts atomic claims with token-level provenance from every source document, (ii) clusters equivalent claims across documents while flagging inter-source conflicts, (iii) selects a support-aware and salient subset, and (iv) rewrites the selection into a summary in which every sentence is anchored to a support-checked claim that links back to one or more source spans. Because content is localized before it is realized, the pipeline is attribution-oriented by construction and faithfulness-oriented by construction: it structurally preserves fine-grained, multi-source traceability while using support-aware selection, constrained rewriting, and verification to encourage, rather than guarantee, factual faithfulness. We evaluate quality, faithfulness, and localization on MultiNews, analyze conflict handling on DiverseSumm, and test zero-shot transfer on WCEP, using a two-regime protocol that separates reference-free citation quality from gold-aligned localization accuracy, and we add an evaluator-decoupled audit that tests citation precision with a support model never used for selection or verification. CAMS matches strong end-to-end and span-attribution baselines on summary quality while substantially improving faithfulness and citation precision, lifting multi-source attribution accuracy by roughly two-thirds, and exposing a controllable faithfulness--coverage trade-off that end-to-end models leave implicit.

77. 【2606.23959】Does My Embedding Reflect That $A = B$? Evaluating Mathematical Equivalence in Embedding Models

链接：https://arxiv.org/abs/2606.23959

作者：Jiaying Ye,Samarth Rao,Leo Carlin,Kedar Chintalapati,Saharsh Bhargava,Rachit Jaiswal,Michael Zhou,Jared Darlington,Jarod Alper,Vasily Ilin,Henry Kvinge

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：highly abstract, mathematics is highly, forms depending, Abstract, Mathematically Equivalent

备注： 18 pages, comments welcome

点击查看摘要

Abstract:Because mathematics is highly abstract, a single statement can take very different forms depending on what subfield it is framed in. There are many examples where breakthroughs occurred after researchers discovered that a question had already been answered in a different field. At the same time, the growth of new resources related to formalization has increased the need for tools that enable efficient and reliable navigation between mathematical 'languages' (e.g., from Lean to natural language). In this paper, we investigate whether current embedding models capture mathematical equivalence. To do this, we introduce the Mathematically Equivalent but Lexically Different Pairs (MELD) Dataset, a collection of mathematically equivalent statements that are expressed in very different language. We show that current state-of-the-art embedding models tend to group statements by the terminology used to make them instead of the underlying math. Motivated by this, we propose a contrastive approach to learning embeddings of mathematical text that focuses on aligning informal statements with different formalizations. Our experiments demonstrate that this leads to improvements not only on informal-formal retrieval tasks but also on MELD, which only contains natural language statements.

78. 【2606.23948】Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

链接：https://arxiv.org/abs/2606.23948

作者：Hamid Mojarad,Kevin Tang

类目：Computation and Language (cs.CL)

关键词：internal representations encode, African American English, Self-supervised and supervised, representations encode, investigate which linguistic

备注： This paper has been accepted for presentation at Interspeech 2026

点击查看摘要

Abstract:Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is consonant cluster reduction (CCR) in African American English (AAE), a widespread phonological process and a source of automatic speech recognition (ASR) disparity. To examine how CCR is represented, we conduct speaker-independent layer-wise probing of wav2vec2-base and Whisper-small using two tasks: segmental reduction detection and segmental restoration of underlying cluster identity. Both models distinguish reduced and canonical forms with high accuracy. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion. These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models.

79. 【2606.23943】QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

链接：https://arxiv.org/abs/2606.23943

作者：Maria Contreras

类目：Computation and Language (cs.CL)

关键词：NLP pipelines, step in NLP, capture morphological correctness, Southern Quechua, South America

备注： 4 pages, 3 tables, 1 figure. Code available at [this http URL](http://kaggle.com/code/macmaky/quechuatok)

点击查看摘要

Abstract:Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark comparing four tokenization strategies - BPE, Unigram LM, WordPiece, and a morphology-aware PRPE tokenizer - for Southern Quechua (quz), a low-resource agglutinative language spoken by 8-10 million people in South America. Using a 200k-sentence corpus and the SQUOIA finite-state morphological analyzer (Rios, 2016) as silver standard, we evaluate three metrics: fertility rate, OOV rate, and morphological boundary accuracy (MorphAcc). Our results show that BPE achieves the lowest fertility rate (1.636 at 16k vocab) by memorizing surface word forms, while achieving only 6.67% MorphAcc. PRPE achieves 83.33% MorphAcc - the highest of all systems - demonstrating that fertility rate alone is insufficient to evaluate tokenizers for agglutinative languages. All code and models are publicly available at this http URL

80. 【2606.23938】Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

链接：https://arxiv.org/abs/2606.23938

作者：Xiangbo Gao,Xiukun Huang,Boyu Lu,Junge Zhang,Mengjie Mao,Jiachen Li,Wei Xiong,Zhengzhong Tu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：leverage pretrained VLM, pretrained VLM representations, VLA models incorporating, rationale causally connected, Driving VLA models

备注：

点击查看摘要

Abstract:Driving VLA models incorporating Chain-of-Thought (CoT) reasoning are attractive because they leverage pretrained VLM representations and expose intermediate decisions in natural language, yet current rationales often lack the step-by-step decision semantics needed to keep the rationale causally connected to the planned motion. We introduce Neuro-Symbolic Drive, a neuro-symbolic driving framework that supervises a driving VLA with rule-grounded reasoning traces extracted directly from classical rule-based planners. Our key observation is that rule-based planners are symbolic AI systems that already function as executable reasoning engines: they reason about active safety constraints, search over candidate maneuvers, and select a final trajectory. We instrument these planners in simulation to capture both the executed trajectory and the internal decision trace at each rule-evaluation step. Each trace is serialized into structured rule-grounded reasoning and paired with the trajectory to fine-tune Qwen3.5-4B as a driving VLA. Because these traces are derived directly from the planner states that determine the action, they ensure reasoning is structurally coupled to motion generation by construction, rather than by post-hoc alignment. On our simulator-generated benchmark, detailed rule-grounded reasoning reduces ADE@3s from 0.47 to 0.26 and miss rate from 8.30% to 6.40% under three-camera perception, and from 0.54 to 0.26 and 10.13% to 5.99% under eight-camera perception. Neuro-Symbolic Drive thus converts neuro-symbolic planning logic into structured supervision. Code base: this https URL.

81. 【2606.23937】When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

链接：https://arxiv.org/abs/2606.23937

作者：Tianyu Ding,Juan Pablo De la Cruz Weinstein

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：downstream decision model, decision model, Exact-match retrieval recall, Exact-match retrieval, retrieved

备注：

点击查看摘要

Abstract:Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B classifiers. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning. We then replace the benchmark-designated policy clause with the top-ranked clause retrieved from decision-time context. Although the exact governing clause is retrieved at rank 1 for only 7% of airline states, the primary 3B classifier obtains macro-F1 0.58 with retrieved clauses versus 0.60 with gold clauses (Delta=-0.02, task-cluster 95% CI [-0.23,+0.21]); mismatched-policy and no-policy controls score 0.32 and 0.21. We do not detect a macro-F1 difference between retrieved and gold clauses in this configuration, although the interval remains too wide to establish non-inferiority. The same qualitative pattern appears with a second retriever and at 7B, while varying across fine-tuning configurations. These results indicate that exact-match clause recall can underestimate downstream policy utility in this benchmark setting, motivating evaluation with retrieved policies in the classification loop rather than recall alone.

82. 【2606.23915】Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

链接：https://arxiv.org/abs/2606.23915

作者：Tianyu Ding,Aditya Nannapaneni,Juan Pablo De la Cruz Weinstein

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：LLM retrieval-augmented generation, Practice often treats, treats automatic metrics, generation as interchangeable, retrieval-augmented generation

备注：

点击查看摘要

Abstract:Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) -- across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct. In the construct with the most multi-dataset human-labeled coverage -- generated-answer attribution (AttributionBench's four source datasets, n = 1,610, with independent HAGRID, n = 2,150) -- none does: the per-dataset metric rankings invert (Kendall tau = -0.64, p = 0.031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0.90) collapses to AUROC 0.53 (chance) on long-form LFQA, where BERTScore wins (0.91); the flip is not a length or truncation artifact. This instability has a concrete decision cost: a naive "best-on-average" rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0.172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others. A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic -- relocating, not removing, the validation burden.

83. 【2606.23885】Mind the Heads: Topological Representation Alignment for Multimodal LLMs

链接：https://arxiv.org/abs/2606.23885

作者：Davide Caffagni,Alberto Compagnoni,Federico Melis,Sara Sarto,Pier Luigi Dovesi,Mark Granroth-Wilding,Marcella Cornia,Lorenzo Baraldi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：Large Language Models, Multimodal Large Language, external vision encoder, Large Language, improve Multimodal Large

备注：

点击查看摘要

Abstract:Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.

84. 【2606.23884】One Year Later...The Harms Persist, But So Do We!

链接：https://arxiv.org/abs/2606.23884

作者：Annika Marie Schoene,Cansu Canca,Gautham Vijay Kumar,Anson Antony

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：General-purpose large language, mental health-related conversations, General-purpose large, large language models, safety safeguards remain

备注： 20 pages, 8 tables

点击查看摘要

Abstract:General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical conditions. This study evaluates six proprietary LLMs across 16 DSM-5 conditions using four adversarial attack variants, introducing an eight-dimension harm taxonomy and a multi-dimensional evaluation framework. Results show that safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. We argue that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly. Until such safeguards are in place, these models pose significant risks to vulnerable populations, making their growing integration into educational settings a particularly concerning.

85. 【2606.23881】Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

链接：https://arxiv.org/abs/2606.23881

作者：Qian Ma,Qiong Wu,Zhengyi Zhou,Yao Ma

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Visual Question Answering, Knowledge-Based Visual Question, Question Answering, requires grounding visual, Visual Question

备注： Accepted by ACL 2026 Findings. Project page [this https URL](https://github.com/VAN-QIAN/ACL26-IBA/)

点击查看摘要

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels. Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks. We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking. Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed. Our implementation is made public to ease reproducibility.

86. 【2606.23870】ESBMC-PLC+: A Unified IEC~61131-3 Formal Verification Framework as a PLCverif Successor

链接：https://arxiv.org/abs/2606.23870

作者：Pierre Dantas,Lucas Cordeiro,Waldir Junior

类目：Programming Languages (cs.PL); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：developed at CERN, PLC formal verification, mature open-source platform, PLC formal, Ladder Diagram

备注： 21pages

点击查看摘要

Abstract:PLCverif is the most mature open-source platform for PLC formal verification, developed at CERN and in production use since 2019. Yet it has two fundamental limitations: no support for Ladder Diagram (LD) programs, the dominant PLC notation, and reliance on CBMC as its primary backend, which restricts verification to bounded proofs. The PLCverif authors themselves identified ESBMC as the appropriate backend improvement. Prior work established ESBMC-PLC (a textual LD frontend with k-induction) and ESBMC-GraphPLC (graphical PLCopen XML support); together, they cover LD with unbounded proofs but not Structured Text (ST), and graphical LD with timer/counter function blocks remains unverifiable. This paper presents ESBMC-PLC+, a unified framework that closes both gaps: (1) an ST/SCL frontend via the MATIEC IEC 61131-3 compiler, routing C-compiled ST to ESBMC with nondeterministic input modeling and YAML property injection; (2) function block state semantics for graphical LD, extending the DFS resolver to model TON/TOF/TP timers, CTU/CTD counters, and R_TRIG/F_TRIG edge triggers as persistent scan-cycle state variables in the GOTO IR. ESBMC-PLC+ is the first open-source PLC verification framework to support all three major IEC 61131-3 input formats via a single ESBMC backend, enabling k-induction-unbounded safety proofs. A feature comparison with PLCverif and experimental evaluation on 8 benchmark programs, including programs with up to 8 integer timers, shows that ESBMC-PLC+ matches PLCverif's input coverage while providing stronger guarantees. Against nuXmv's BDD backend, ESBMC-PLC+ is 400-2,000x faster on timer programs and completes proofs where nuXmv BDD times out at 120s.

87. 【2606.23797】From Task-Guided Conversational Graphs to Goal-Oriented Dialogue Runtimes

链接：https://arxiv.org/abs/2606.23797

作者：Mariano Garralda-Barrio

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：large language model, multi-agent orchestration frameworks, orchestration frameworks make, frameworks make production, make production large

备注： 21 pages, 7 figure, 10 tables

点击查看摘要

Abstract:Graph and multi-agent orchestration frameworks make production large language model (LLM) workflows practical, but they do not by themselves solve conversational continuity when users maintain several interdependent objectives. This conceptual systems paper focuses on the high-complexity end of that design space, where goals can be suspended, resumed, revised, and invalidated by actions in other goals. We introduce the Goal-Oriented Dialogue Runtime (GODR), a framework-neutral design pattern that treats goals, task frames, lifecycle state, invalidation rules, and resumption contracts as first-class runtime objects while delegating bounded execution to graph runtimes, agents, tools, or application programming interfaces (APIs). GODR is not proposed as a replacement for workflow graphs in simple guided processes; it is intended for complex, multi-domain, interruptible conversations where objective continuity cannot be recovered reliably from agent identity, chat history, or execution-graph position alone. The paper formalizes the problem, proposes runtime objects and architecture-selection criteria, and frames evaluation as an agenda for future empirical validation rather than as a measured performance claim.

88. 【2606.23724】EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering

链接：https://arxiv.org/abs/2606.23724

作者：Fengchen Gu,Xiaotian Ren,Zhengyong Jiang,Zhilu Zhang,Ángel F. García-Fernández,Angelos Stefanidis,Mian Zhou,Huakang Li,Jionglong Su

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Large language models, outputs remain difficult, high-stakes financial workflows, Large language, earnings decks

备注：

点击查看摘要

Abstract:Large language models are increasingly used to answer questions over annual reports, earnings decks, and analyst notes, yet their outputs remain difficult to verify in high-stakes financial workflows. A fluent answer can blend directly grounded statements, weak synthesis, and unsupported claims across narrative text, tables, and charts. We present EvidenceLens, a visual analytics prototype that treats financial question answering as a claim-evidence alignment problem. The system decomposes an answer into atomic claims, summarizes support composition and confidence, support gaps, and coordinates claim-level inspection with source passages, table cells, and chart regions. Its core visual representation is a multimodal claim-evidence matrix that makes coverage, contradiction, and modality imbalance immediately visible. To support reproducibility, we also specify a JSON-based artifact schema, a lightweight multimodal alignment pipeline, and a deterministic review-priority ranking that maps backend signals into an auditable visual structure. Through representative report-auditing scenarios, we show how EvidenceLens helps analysts distinguish grounded claims from overconfident synthesis that conventional chat interfaces flatten.

89. 【2606.23701】Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability

链接：https://arxiv.org/abs/2606.23701

作者：Sherri Weitl-Harms,John Hastings

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：Qualitative product feedback, nuanced user experiences, reveal nuanced user, Product Desirability Toolkit, difficult to measure

备注： 20 pages, 6 figures, 11 tables. arXiv admin note: text overlap with [arXiv:2408.01527](https://arxiv.org/abs/2408.01527)

点击查看摘要

Abstract:Qualitative product feedback can reveal nuanced user experiences, but its implicit sentiment is difficult to measure. This paper presents a scalable and interpretable framework that uses large language models (LLMs) to quantify product desirability from such data. Using two Product Desirability Toolkit (PDT) datasets from ZORQ and CARMA comprising 106 respondent term groupings with gold-standard human annotation, zero-shot continuous numerical sentiment scoring and categorical sentiment classification are evaluated without relying on explicit review scores. Across the datasets, LLMs generated numerical sentiment scores directly from qualitative responses and closely matched expert labels, achieving Pearson correlations up to 0.97 and classification accuracy up to 94%. LLMs maintained robustness even when handling data presented in multiple forms and consistently expressed high confidence. In contrast, lexicon-based and transformer baselines did not produce statistically significant results. Among the models tested, GPT-4o-mini achieved performance comparable to larger models at 94% lower cost, supporting scalable deployment. The framework also incorporates model confidence ratings and human-readable rationale explanations (xAI), improving interpretability, transparency, and trust while supporting practical use in product satisfaction assessment. In general, using the PDT tool as a survey method along with a cost efficient LLM for sentiment analysis has the potential to provide for product evaluation with results that are rich in terms of sentiment scores (both numerical and classified sentiment) and in terms of the high-level user impressions of the product that can be used to identify ideas for product development and improvement, as well as marketing ideas for target audiences.

Comments:
20 pages, 6 figures, 11 tables. arXiv admin note: text overlap with arXiv:2408.01527

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

ACMclasses:
I.2.7; D.2.8; I.2.6; H.5.2

Cite as:
arXiv:2606.23701 [cs.CL]

(or
arXiv:2606.23701v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.23701

Focus to learn more

              arXiv-issued DOI via DataCite</p>

90. 【2606.23700】Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

链接：https://arxiv.org/abs/2606.23700

作者：Arush Tagade,Shaoheng Zhou,Jiaxin Wen,Shi Feng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：evil character traits, harmful content, Emergent misalignment, vectors and evil, operates through disruption

备注： 18 pages, 11 figures

点击查看摘要

Abstract:Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model's aligned character rather than direct learning of harmful content. Motivated by this connection, we study self-generated text recognition (SGTR) finetuning as a character-targeted intervention that is distinct from existing in-training defenses. We conduct two-stage finetuning experiments across three models (GPT-4.1, Qwen2.5-32B-Instruct, Seed-OSS-36B-Instruct) and multiple EM datasets to compare SGTR finetuning against benign finetuning baselines (correct domain-specific data, general knowledge, and word counting) to find it an effective defense in both reversal and prevention settings. We find that all interventions produce comparable EM reversal, but only when restoring capabilities that EM had degraded. For prevention, only SGTR finetuning consistently reduces misalignment without exacerbating any individual metric, suggesting that character fortification specifically drives prevention. We provide further evidence for EM's relation to the LLM's default character by showing that EM finetuning induces diversity into the LLM's identity self-reports, artificially corrupting self-recognition exacerbates misalignment caused by EM finetuning, and that removing the model's identity-bearing system prompt substantially reduces the effect of EM finetuning. Together, these findings reframe EM not as the adoption of a coherent misaligned persona but as the destabilization of aligned character.

91. 【2606.23695】Quantifying Prior Dominance in RAG Systems

链接：https://arxiv.org/abs/2606.23695

作者：Barak Or

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：grounds Large Language, Large Language Models, Retrieval-Augmented Generation, current evaluations rely, grounds Large

备注： 15 pages, Preprint

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ''epistemic blindness'' - failing to distinguish genuine contextual information extraction from parametric memory recall. To address this, we introduce the Normalized Context Utilization (NCU) metric, leveraging continuous token log-probabilities across zero-shot, oracle, and adversarial conditions to strictly quantify contextual information gain. Evaluating architectures ranging from 1.5B to 72B parameters alongside a proprietary commercial API reveals that for strict factual extraction (without Chain-of-Thought reasoning), traditional scaling laws exhibit extreme diminishing returns: highly efficient Small Language Models (SLMs) match or outperform high-capacity architectures. Furthermore, we demonstrate that ``Prior Dominance'' correlates with model scale and proprietary alignments. The evaluated commercial API not only overrode explicit external evidence in nearly half of adversarial conflicts, but also frequently suffered from systemic confidence collapse (Negative Transfer) when its parametric priors were contradicted. Our findings highlight the structural epistemic advantage and superior contextual adherence of SLMs in strict extraction workflows.

92. 【2606.23694】ModTGCN: Modularity-aware Graph Neural Networks for Text Classification

链接：https://arxiv.org/abs/2606.23694

作者：Rajarshi Misra,Aditya Sharma,Vinti Agarwal,Hari Om Aggrawal

类目：Computation and Language (cs.CL)

关键词：global community structure, strong class-consistent clustering, models typically rely, local neighborhood aggregation, overlook global community

备注： PAKDD2026

点击查看摘要

Abstract:Graph-based text classification models typically rely on local neighborhood aggregation and overlook global community structure, despite semantic document graphs exhibiting strong class-consistent clustering. Ignoring this can blur class boundaries and lead to over-smoothing. We propose ModTGCN, a modularity-aware graph neural network for text classification that jointly optimizes cross-entropy and a modularity-based auxiliary objective to promote class-coherent document communities while preserving discriminative representations. The modularity term is computed on a document-document similarity graph derived from transformer embeddings (pretrained or fine-tuned). To improve scalability, we decouple the original heterogeneous TextGCN graph into separate document-word and word-word components, achieving 2x-10x faster training. We further study graph construction strategies, label-aware edge reweighting, and supervision choices for modularity optimization. Experiments on five benchmarks show consistent gains, with larger improvements on complex, low homophily datasets such as Ohsumed and 20NG.

93. 【2606.23693】EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

链接：https://arxiv.org/abs/2606.23693

作者：Jaehoon Lee,CheolWon Na,Suyoung Bae,Jin-Seop Lee,Jihyung Lee,YunSeok Choi,Jee-Hyong Lee

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：executable SQL queries, Large Language Models, generating executable SQL, Language Models based, adopted Large Language

备注： 20 pages, 8 figures

点击查看摘要

Abstract:Text-to-SQL enables users to query databases using natural language by generating executable SQL queries. Recent methods have increasingly adopted Large Language Models based reinforcement learning (RL) to leverage execution feedback for training. However, existing RL methods assign uniform query-level rewards to all clauses in a SQL query, treating correct and incorrect clauses equally. This coarse-grained reward design leads to insufficient learning signals for correct SQL generation. To address this issue, we propose EXPO-SQL (EXecution-based clause-level Policy Optimization for Text-to-SQL) which provides fine-grained supervision through clause-level rewards. To assign clause-level rewards, our method identifies erroneous clauses by analyzing execution results, including error messages and clause-wise incremental execution. Experiments on widely-used Text-to-SQL benchmarks demonstrate that EXPO-SQL significantly outperforms existing supervised fine-tuning, prompting, and RL-based methods through fine-grained clause-level learning. Our code is available at https://github. com/jhn25/EXPO-SQL.

94. 【2606.24147】Progressive Alignment Objectives for Aligner-Encoder based ASR

链接：https://arxiv.org/abs/2606.24147

作者：Jaeyong Lee,Masato Mimura,Takafumi Moriya

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词：u-th encoder position, replace decoder attention, uth token directly, ASR models, encoder position

备注： Accepted to Interspeech 2026

点击查看摘要

Abstract:Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice, this alignment often forms abruptly in the upper layers, making training sensitive and brittle on long utterances. We propose InterAligner, which adds an intermediate Aligner objective so alignment can form progressively across depth, together with an intermediate CTC loss (InterCTC) to stabilize optimization. On LibriSpeech with a 17-layer Conformer, a final-only Aligner reaches 5.0/7.8 WER (test-clean/other). InterCTC improves to 3.4/6.0, and InterAligner further reduces WER to 3.1/5.6 with the largest gains on long utterances.

信息检索

1. 【2606.24775】Are We Ready For An Agent-Native Memory System?

链接：https://arxiv.org/abs/2606.24775

作者：Wei Zhou,Xuanhe Zhou,Shaokun Han,Hongming Xu,Guoliang Li,Zhiyu Li,Feiyu Xiong,Fan Wu

类目：Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)

关键词：large language model, simple retrieval-augmented mechanisms, supports persistent information, dynamic lifecycle governance, persistent information storage

备注： Paper list available at: [this https URL](https://github.com/OpenDataBox/awesome-agent-memory) . Source code available at: [this https URL](https://github.com/OpenDataBox/MemoryData)

点击查看摘要

2. 【2606.24346】PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation

链接：https://arxiv.org/abs/2606.24346

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Petroleum-engineering search exposes, strong general retrievers, relevant evidence exists, Petroleum Engineering Text, Petroleum-engineering search

备注：

点击查看摘要

3. 【2606.24204】Unified Dominance Graph for Interval-Predicate Approximate Nearest Neighbor Search

链接：https://arxiv.org/abs/2606.24204

作者：Kwun Hang Lau,Ruiyuan Zhang,Elton Chun-Chai Li,Wun Yu Chan,Xiaojun Cheng,Xiaofang Zhou

类目：Databases (cs.DB); Information Retrieval (cs.IR)

关键词：Approximate Nearest Neighbor, Nearest Neighbor Search, Approximate Nearest, Nearest Neighbor, unstructured data retrieval

备注：

点击查看摘要

Abstract:Approximate Nearest Neighbor Search (ANNS) is a core primitive for unstructured data retrieval. Real-world applications--such as temporal databases, financial data analysis, and retrieval-augmented generation--often require hybrid queries whose valid objects are constrained by continuous interval attributes, such as lifespans or price ranges. We study Interval-Predicate ANNS (IPANNS), where validity is determined by a predicate between an object interval and a query interval. Existing range-filtering ANNS (RFANNS) methods are designed for single-dimensional scalar filters, but interval predicates such as containment and overlap rely on two coupled endpoint constraints. Treating endpoints as independent scalar attributes can incur large intersection overhead, while containment-specific methods lack a generalized indexing abstraction. In this paper, we propose the Unified Dominance Graph (UDG), a graph-indexing framework for the closed two-bound conjunctive fragment of IPANNS. For a chosen interval predicate, UDG maps object and query endpoints into a normalized two-dimensional dominance space and builds a dominance-labeled graph over the transformed coordinates. Containment, overlap, and other supported endpoint-bound predicates therefore reuse the same construction and search algorithms after semantic mapping, while each UDG instance remains tied to its selected predicate. UDG compresses query-state-specific proximity graphs into one compact index. To improve graph search under restrictive interval filters, we add validity-preserving patch edges that provide routing choices when few objects remain valid. Extensive evaluations on standard benchmarks and real-world datasets show that UDG achieves stable query performance across multiple interval relations and workloads, significantly outperforming existing hybrid search baselines while maintaining low indexing overhead.

4. 【2606.24200】MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

链接：https://arxiv.org/abs/2606.24200

作者：Junhyeok Lee,Han Jang,Hyeonjin Goh,Kyu Sung Choi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Retrieval-augmented generation, clinical settings increasingly, settings increasingly requires, English evidence corpora, increasingly requires multilingual

备注： Under review. 15 pages, 3 figures

点击查看摘要

5. 【2606.24194】Dialogue to Discovery: Attribute-Aware Preference Elicitation for Conversational Product Search Assistants

链接：https://arxiv.org/abs/2606.24194

作者：Sarthak Harne,Natwar Modani,Debabrata Mahapatra,Shubham Agarwal

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Conversational product search, search assistants offer, product search assistants, keyword-based product search, traditional keyword-based product

备注：

点击查看摘要

6. 【2606.24188】Aspect-Based Sentiment Evolution and its Correlation with Review Rounds in Multi-Round Peer Reviews: A Deep Learning Approach

链接：https://arxiv.org/abs/2606.24188

作者：Ruxue Hana,Haomin Zhoua,Jiangtao Zhong,Chengzhi Zhang

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：Mining sentiment information, scientific evaluation process, offers valuable insights, comments offers valuable, Mining sentiment

备注：

点击查看摘要

7. 【2606.24099】Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers

链接：https://arxiv.org/abs/2606.24099

作者：Yuzhuo Wang,Chengzhi Zhang,Min Song,Seong Deok Kim,Youngsoo Ko,Juhee Lee

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：artificial intelligence, central to scientific, era of artificial, algorithm, influence

备注：

点击查看摘要

8. 【2606.24098】Is Higher Team Gender Diversity Correlated with Better Scientific Impact?

链接：https://arxiv.org/abs/2606.24098

作者：Chengzhi Zhang,Jiaqi Zeng,Yi Zhao

类目：Digital Libraries (cs.DL); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词：garnered substantial attention, Natural Language Processing, gender diversity, gender, scientific impact

备注：

点击查看摘要

Abstract:Collaborative research involving scholars of various genders constitutes a prominent theme in scientific research that has garnered substantial attention. While several studies have investigated the connection between gender-specific collaboration patterns and the scientific impact of paper, the specific gender diversity factors that contribute to enhanced scientific impact remain largely unexplored. In this study, we analyze the correlation between gender diversity and the scientific impact of papers using the examples of Natural Language Processing (NLP) and Library and Information Science (LIS) domains. Our findings reveal three key observations: First, significant gender disparities exist in both NLP and LIS domains, with underrepresentation of female scholars. The gender disparity is more pronounced in the NLP domain compared to the LIS domain. Second, based on papers from the NLP and LIS domains, we find that papers with different gender compositions achieve varying numbers of citations, with mixed-gender collaborations gradually obtaining higher average citation counts compared to same-gender collaborations. Lastly, there is an inverted U-shaped relationship between the gender diversity of paper collaborations and the number of citations received by those papers. Based on the most impactful gender diversity calculations, the ideal gender ratio for NLP and LIS teams within a range where one gender constitutes 5% to 15% of the total number of authors. This paper contributes to the exploration of the most impactful gender diversity in collaborative research and offers insights to guide more effective scientific paper collaboration.

9. 【2606.23997】ChartWalker: Benchmarking the Cross-Chart RAG Task

链接：https://arxiv.org/abs/2606.23997

作者：Ning Tang,Chenghan Xie,Hanyang Yuan,Yi Li,Renhong Huang,Qian Kou,Xiaofeng Shi,Hua Zhou,Jiarong Xu

类目：Information Retrieval (cs.IR)

关键词：complex multi-modal analytical, Cross-Chart Retrieval-Augmented Generation, multi-modal analytical tasks, critical for complex, complex multi-modal

备注：

点击查看摘要

Abstract:Cross-Chart Retrieval-Augmented Generation (RAG) is critical for complex multi-modal analytical tasks in scientific, business, and political domains. However, existing benchmarks either focus on tables, which are well-structured and textualized, or generate cross-chart questions by simply extracting key points, which often induces lexical overlap between queries and evidence and yields logically inconsistent reasoning chains. To address this, we introduce ChartWalker, a novel framework for constructing challenging cross-chart RAG tasks. ChartWalker features a hierarchical knowledge graph construction method tailored to charts, which organizes entities and relations by granularity to preserve analytical structure. We then propose a structure-aware sampling algorithm that synthesizes semantically coherent, multi-hop reasoning paths, enabling explicit control over query difficulty and granularity for QA generation. Built with this framework, we release ChartWalker-Bench, a comprehensive benchmark spanning diverse domains and cross-chart query types. Extensive evaluations across major RAG paradigms reveal significant performance gaps, underscoring the benchmark's difficulty and utility. Furthermore, we provide ChartWalker-Agent, an agentic baseline to facilitate analysis and inspire future system design.

10. 【2606.23919】Unified Multi-Task Relevance Modeling for E-Commerce: Comparing Task Routing Architectures Across LLMs and Cross-Encoders

链接：https://arxiv.org/abs/2606.23919

作者：Md Omar Faruk Rokon,Jhalak Nilesh Acharya,Shasvat Desai,Hong Yao,Kuang-chih Lee

类目：Information Retrieval (cs.IR)

关键词：product type similarity, query product matching, potentially conflicting learning, conflicting learning signals, pair relationship types

备注： Accepted at E-commerce workshop, SIGIR 2026

点击查看摘要

Abstract:How can we build a single relevance model that handles six different entity pair relationship types in e commerce from query product matching to product type similarity when each task has different data volumes, different semantic requirements, and potentially conflicting learning signals? This question is important because current industry practice relies on separate models for each task, preventing knowledge transfer and producing inconsistent relevance signals. Our work is driven by the following insight: encoder based and decoder only models encode task identity through different mechanisms, so the choice of task routing architecture how task identity is communicated to the shared model affects these two families in asymmetric ways. As our key novelty, we combine three ideas: (a) a unified multi task framework that jointly trains on six entity pair tasks under a shared three point relevance scale, (b) a systematic comparison of three task routing architectures (text prefix routing, multi head classification, and multihead with private transformer layers) across both LoRA adapted LLMs and fully finetuned cross encoders, and (c) a majority vote ensemble that exploits the diversity induced by private layer routing. First, we show that the MHP Ensemble (multi head with private layers) achieves 89.96% accuracy on 453K test examples the highest across all configurations . Second, we show that removing text prefixes without private layers causes severe degradation for decoder only LLMs while cross encoders remain robust , suggesting an encoder decoder asymmetry in task identity encoding. Third, we show that multi task training yields up to 14% improvement on low resource tasks over single task baselines.

11. 【2606.23915】Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

链接：https://arxiv.org/abs/2606.23915

作者：Tianyu Ding,Aditya Nannapaneni,Juan Pablo De la Cruz Weinstein

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：LLM retrieval-augmented generation, Practice often treats, treats automatic metrics, generation as interchangeable, retrieval-augmented generation

备注：

点击查看摘要

12. 【2606.23911】Scaling Dense Retrieval with LLM-Annotated Training Data: Structured Mining and Progressive Curriculum for E-Commerce Sponsored Search

链接：https://arxiv.org/abs/2606.23911

作者：Md Omar Faruk Rokon,Shasvat Desai,Jhalak Nilesh Acharya,Isha Shah,Kumar Priyam,Brahanyaa Somasundaram,Vamsee Tangirala,Minuteresa Thomas,Vivek Arora,Vijay Manchi,Hong Yao,Kuang-chih Lee

类目：Information Retrieval (cs.IR)

关键词：generate high-quality training, high-quality training data, generate high-quality, data for dense, dense retrieval models

备注： Accepted at E-Commerce Workshop, SIGIR 2026

点击查看摘要

Abstract:How can we generate high-quality training data for dense retrieval models at production scale, without relying on click signals or manual annotation? This question is critical for e-commerce sponsored search, where click-based training suffers from position bias and tail-query sparsity, and manual labeling at the scale of hundreds of millions of query-item pairs is economically infeasible. Our work is driven by the following insight: heterogeneous retrieval systems disagree on most items they retrieve, and this disagreement creates a natural source of structured training signal -- easy positives where all systems agree, hard positives that only lexical systems find, and hard negatives that fool exactly one system. As our key novelty, we combine three ideas into an end-to-end pipeline: (a) multi-channel retrieval mining with rank metadata from three production systems, (b) graded-relevance annotation by a calibrated three-model cascade ) that reaches 89.1% agreement with trained human annotators, and (c) three-stage progressive curriculum training that organizes 240M+ training examples across five difficulty levels. We deploy the trained two-tower BERT model on Walmart's sponsored search and evaluate it against 30K queries labeled by trained third-party human annotators. First, we show that the system achieves +5.1% NDCG@10 over the click-trained production baseline, with the largest gain on tail queries . Second, we show that embarrassing retrievals (rating 0) drop from 8.7% to 3.5%. Third, a two-week online A/B test with tens of millions of ad requests per arm confirms +2.80% ad spend, +1.4% CTR, +2.8% eCPM, and +2.9% click conversion rate. Overall, our work provides a practical and scalable blueprint for replacing click-based training with structured LLM-annotated supervision in production retrieval systems.

13. 【2606.23889】INSPIRE: Intent-aware Neural Sponsored Product Retrieval for E-commerce

链接：https://arxiv.org/abs/2606.23889

作者：Shasvat Desai,Hong Yao,Utkarsh Porwal,Kuang-chih Lee

类目：Information Retrieval (cs.IR)

关键词：Walmart holds, highest search traffic, ecommerce grocery market, drive a substantial, sponsored search revenue

备注： Accepted to ACM SIGIR E-commerce Workshop, 2026

点击查看摘要

Abstract:Walmart holds the largest share of the U.S. ecommerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences. From the advertisers perspective, many products are explicitly designed to target specific intents such as dietary preferences or size variants and must be surfaced at the right moment to be effective. Thus, we propose INSPIRE (Intent aware Neural Sponsored Product Retrieval for Ecommerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries. We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning that predicts intent attributes. We then introduce an intent augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a biencoder, enabling more precise matching between queries and sponsored products.

14. 【2606.23881】Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

链接：https://arxiv.org/abs/2606.23881

作者：Qian Ma,Qiong Wu,Zhengyi Zhou,Yao Ma

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Visual Question Answering, Knowledge-Based Visual Question, Question Answering, requires grounding visual, Visual Question

备注： Accepted by ACL 2026 Findings. Project page [this https URL](https://github.com/VAN-QIAN/ACL26-IBA/)

点击查看摘要

15. 【2606.23843】HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

链接：https://arxiv.org/abs/2606.23843

作者：Hoang-Bao Le,Aiden Durrant,Thai Son Mai,Binh T. Nguyen,Liting Zhou,Cathal Gurrin

类目：Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：capture semantic correspondences, natural language, typically pre-trained, datasets to capture, correspondences between visual

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode "what an image is not" alongside "what it is." HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.

16. 【2606.23724】EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering

链接：https://arxiv.org/abs/2606.23724

作者：Fengchen Gu,Xiaotian Ren,Zhengyong Jiang,Zhilu Zhang,Ángel F. García-Fernández,Angelos Stefanidis,Mian Zhou,Huakang Li,Jionglong Su

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Large language models, outputs remain difficult, high-stakes financial workflows, Large language, earnings decks

备注：

点击查看摘要

17. 【2606.23693】EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

链接：https://arxiv.org/abs/2606.23693

作者：Jaehoon Lee,CheolWon Na,Suyoung Bae,Jin-Seop Lee,Jihyung Lee,YunSeok Choi,Jee-Hyong Lee

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：executable SQL queries, Large Language Models, generating executable SQL, Language Models based, adopted Large Language

备注： 20 pages, 8 figures

点击查看摘要

计算机视觉

1. 【2606.24888】DiffusionBench: On Holistic Evaluation of Diffusion Transformers

链接：https://arxiv.org/abs/2606.24888

作者：Xingjian Leng,Jaskirat Singh,Zhanhao Liang,Ethan Smith,Martin Bell,Aninda Saha,Yuhui Yuan,Liang Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ImageNet, single evaluation setup, Diffusion transformer, generation, image generation

备注：

点击查看摘要

Abstract:Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.

2. 【2606.24883】BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases

链接：https://arxiv.org/abs/2606.24883

作者：Qi Chen,Wenxuan Li,Pedro R. A. S. Bassi,Xinze Zhou,Jakob Wasserthal,Ibrahim Ethem Hamamci,Sezgin Er,Ashwin Kumar,Yiwen Ye,Yuhan Wang,Yuyin Zhou,Akshay S. Chaudhari,Curtis Langlotz,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable success, Artificial intelligence, real-world clinical settings, achieved remarkable, remarkable success

备注：

点击查看摘要

Abstract:Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code

3. 【2606.24876】FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

链接：https://arxiv.org/abs/2606.24876

作者：Orest Kupyn,Goutam Bhat,Philipp Henzler,Fabian Manhardt,Christian Rupprecht,Federico Tombari

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image requires strong, requires strong generative, strong generative priors, Generating explorable, single image requires

备注：

点击查看摘要

Abstract:Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at this https URL

4. 【2606.24874】FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

链接：https://arxiv.org/abs/2606.24874

作者：Haorui Ji,Weizhe Liu,Hongdong Li,Hengkai Guo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Gaussian Splatting, preserve high-frequency visual, high-frequency visual details, input images due, sparse voxel latents

备注：

点击查看摘要

Abstract:Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.

5. 【2606.24849】IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

链接：https://arxiv.org/abs/2606.24849

作者：Zixuan Li,Haokun Lin,Yicheng Xiao,Zhiwei Li,Xinyang Song,Zelong Zheng,Yong He,Heng Yao,Ke Ding,Chao Yu,Chuan Yuan,Qi Li,Zhenan Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Unified multi-modal large, large language models, multi-modal large language, Unified multi-modal, spatial relations

备注：

点击查看摘要

Abstract:Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.

6. 【2606.24847】Spherical-to-ERP Epipolar Rectification for Single-Axis Disparity in 360 Stereo

链接：https://arxiv.org/abs/2606.24847

作者：Sahereh Obeidavi,Dieter Landes

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 7 Pages, 4 Figures, Conference

点击查看摘要

None

7. 【2606.24844】Bridging the Manifold Gap: Riemannian Residual Line Search for One-Step Image Editing

链接：https://arxiv.org/abs/2606.24844

作者：Hongzhu Yi,Zhongtian Luo,Tong Li,Yiyan Fan,Jungang Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：One-step diffusion editors, fixed update strength, update strength satisfies, single transport update, iterative optimization

备注：

点击查看摘要

Abstract:One-step diffusion editors are fast because they avoid inversion and iterative optimization, but a single transport update must be aggressive enough to realize the target prompt and conservative enough to preserve the source image--and no fixed update strength satisfies both demands across edit types. We treat this tension as a post-hoc candidate-selection problem on top of energy-field transport rather than as a new editing model. Our proposed method, Riemannian Residual Line Search, first builds a stronger edit by estimating the local time curvature of the prompt-delta field and projecting the corrected direction back onto the update norm of the original first-order energy-field transport estimation. It then forms a small residual path from the source image to this strong edit, retains the original first-order output as one candidate, and picks the final image by maximizing target-prompt CLIP alignment. On a 700-sample PIE-Bench++ evaluation across 10 edit type IDs, our method achieves state-of-the-art (SOTA) performance among current one-step update algorithms.

8. 【2606.24829】GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction

链接：https://arxiv.org/abs/2606.24829

作者：Chenrui Fan,Paolo Favaro

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 36 pages, 17 figures, 18 tables

点击查看摘要

None

9. 【2606.24817】High-Fidelity Synthetic Transmission Electron Microscopy Image Generation Using Diffusion Probabilistic Models for Data-Limited Semiconductor Metrology

链接：https://arxiv.org/abs/2606.24817

作者：Johannes Boehm,Bappaditya Dey

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Transmission Electron Microscopy, Advanced semiconductor nodes, Electron Microscopy, Transmission Electron, semiconductor nodes drastically

备注： To be presented at the 2026 International Symposium ELMAR, published by IEEE in the conference proceedings

点击查看摘要

Abstract:Advanced semiconductor nodes drastically increased demand for Transmission Electron Microscopy (TEM), yet destructive sample preparation, slow imaging and high costs severely limit the availability of diverse datasets needed for downstream machine learning (ML). Synthetic data generation is becoming essential, but current generative models often miss TEM-specific noise, structural detail, and stochastic variability crucial for evaluation. We present a Denoising Diffusion Probabilistic Model (DDPM) framework for synthetic TEM image generation under extreme data scarcity. A progressive patch-based training strategy scales from low-resolution patches to full images, enabling from-scratch training with only 15 samples. We integrate a custom TrivialAugment adaptation, cross-process domain transfer, classifier guidance, and RePaint-style inpainting, culminating in full-image generation that preserves global structural and spatial relationships in compliance with FAB metrology requirements. Beyond synthesis, we repurpose DDPM feature representations for segmentation, partitioning encoder feature maps to obtain coherent region masks. Our synthetic images achieve up to MS-SSIM 0.98 and qualitative expert assessment consistent with structural similarity results, facilitating downstream ML training for defect detection, segmentation, and metrology while preserving statistical and physical realism.

10. 【2606.24805】DDStereo: Efficient Dual Decoder Transformers for Stereo 3D Road Anomaly Detection

链接：https://arxiv.org/abs/2606.24805

作者：Shiyi Mu,Zichong Gu,Zhiqi Ai,Yilin Gao,Shugong Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：critical safety challenges, safety challenges, faces two critical, critical safety, open-set

备注：

点击查看摘要

Abstract:Stereo-based 3D object detection still faces two critical safety challenges: real-time performance and open-set generalization. Existing stereo 3D methods typically achieve twice the accuracy of monocular methods but suffer from significantly lower inference speeds, making them unsuitable for real-time applications. Meanwhile, recent advances in open-world detection have introduced open-set and open-vocabulary algorithms in monocular 2D and 3D settings, yet stereo-based open-set detection remains largely unexplored. To bridge this gap, we propose DDStereo, a novel Dual-Decoder Stereo Transformer for real-time open-set 3D object detection. DDStereo features two lightweight decoder branches: one for open-set foreground 2D detection and the other for 3D attribute regression. These decoders share object-level queries to achieve unified target-level alignment. To enhance inference efficiency, we designed a compact disparity feature extractor and a streamlined decoder architecture. Experiments on public stereo 3D benchmarks demonstrate that DDStereo achieves state-of-the-art accuracy under both closed-set and open-set protocols. Notably, our method surpasses existing stereo 3D detectors in inference speed and, for the first time, achieves real-time performance comparable to monocular approaches.

11. 【2606.24799】OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

链接：https://arxiv.org/abs/2606.24799

作者：Chenrui Fan,Paolo Favaro

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Gaussian Splatting scene, Gaussian Splatting, rich open-world scene, Deformable Gaussian Splatting, Gaussian Splatting reconstruction

备注： 40 pages, 33 figures, 19 tables

点击查看摘要

Abstract:Generic text-to-video models can be used as rich open-world scene priors. Despite the high quality of today's generated videos, they do not directly yield reliable 3D assets: camera motion is difficult to control, view coverage is partial, and frames often contain inconsistencies across time. We introduce OrbitForge, an adapter built from frozen video priors and per-prompt Gaussian Splatting reconstruction optimization that converts a single text-generated video into a canonical closed-orbit 3D Gaussian Splatting scene. We use 3D reconstruction as an anchor to improve the 3D consistency of the generated video. We obtain a preliminary 3D reconstruction from a first generated video via Deformable Gaussian Splatting with a robust MedianGS proxy. We render views from a prescribed orbit to detect missing viewpoints. OrbitForge uses the text-to-video model to complete only the missing views, and reconstructs the completed orbit into a final Gaussian Splatting scene. This design requires no task-specific video or multiview fine-tuning, avoids per-prompt score-distillation optimization, and does not progressively generate views one step at a time. We further argue that this setting demands coverage-aware evaluation: local smoothness alone rewards methods that never attempt a full orbit. On a frozen 300-prompt T3Bench-derived audit, OrbitForge reconstruction attains a 359.0-degree measured median span, raises originally unsupported-bin Q10 ImageReward from 8.07 to 16.36 relative to MedianGS-only reconstruction, while remaining competitive with VideoMV on the coverage-quality.

12. 【2606.24797】EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

链接：https://arxiv.org/abs/2606.24797

作者：Linpeng Huang,Weixing Chen,Zexin Chen,Yang Liu,Liang Lin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Video Large Language, Large Language Models, Large Language, Recent advances, video question answering

备注：

点击查看摘要

Abstract:Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evaluation protocol in which each QA pair is explicitly annotated with supporting temporal evidence, thereby requiring joint reasoning and precise evidence localization. EG-VQA is comprised of 2,067 videos and 11,838 QA pairs with fine-grained evidence annotations. To evaluate predicted evidence, Evidence-Grounded F1 (EG-F1) is introduced as a unified metric in which temporal alignment and semantic consistency against ground-truth evidence are jointly measured. Experimental evaluation reveals that even strong proprietary models struggle to accurately ground their predictions, exposing a fundamental discrepancy between answer correctness and faithful evidence localization. To bridge this gap, EG-Reasoner, an evidence-grounded reasoning model trained with explicit supervision, is proposed. State-of-the-art performance is achieved among open-source models, with results competitive against proprietary systems, particularly pronounced gains are observed on reasoning-intensive tasks such as counterfactual questions. These findings demonstrate that scaling alone is insufficient for robust video understanding and that structured evidence supervision is essential for the development of more reliable and interpretable VideoQA systems.

13. 【2606.24796】Pocket-SLAM: Rendering-Area-Aware Pruning for Memory-Efficient 3DGS-SLAM

链接：https://arxiv.org/abs/2606.24796

作者：Leshu Li,Jie Peng,Yang Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：garnered significant attention, capturing fine-grained geometry, fine-grained geometry features, Gaussian Splatting, attention in Simultaneous

备注： 2026 IEEE International Conference on Robotics and Automation(ICRA)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has garnered significant attention in Simultaneous Localization and Mapping (SLAM) due to its advances in capturing fine-grained geometry features and synthesizing novel views. For SLAM in large-scale scenes, such as autonomous driving, 3DGS-SLAM faces a critical limitation: memory consumption increases continuously over time as Gaussian points accumulate, leading to poor memory efficiency and limiting its applicability. In this work, we propose a rendering-area-aware pruning strategy that selectively removes Gaussians based on their contribution to the effective rendering area, rather than solely relying on Gaussian-level heuristics such as opacity or gradient magnitude. This perspective directly targets the sources of memory redundancy, effectively reducing the peak memory footprint of 3DGS-SLAM during runtime. Evaluations on the EuRoC and KITTI datasets demonstrate that our method consistently outperforms existing pruning approaches in large-scale outdoor scenes, achieving over 60% memory reduction and more than 2 times FPS improvement while preserving localization and mapping accuracy. These results highlight rendering-area-aware pruning as a promising direction for scaling 3DGS-SLAM to real-world autonomous driving scenarios. Our code is publicly available at this https URL.

14. 【2606.24786】Counting Trees from Satellite Imagery with Noisy Supervision

链接：https://arxiv.org/abs/2606.24786

作者：Dimitri Gominski,Maurice Mugabowindekwe,Qiue Xu,Xiaowei Tong,Martin Brandt,Hieu Le,Rasmus Fensholt,Dimitris Samaras,Loic Landrieu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely unexplored, environmental monitoring, fundamental task, task for environmental, remains largely

备注：

点击查看摘要

Abstract:Counting individual trees is a fundamental task for environmental monitoring, yet remains largely unexplored with satellite imagery. At these resolutions, isolated trees may still be identifiable, but crown boundaries become ambiguous in dense forests, making the notion of an individual tree inherently ill-defined. Moreover, large-scale manual annotations of individual trees are prohibitively expensive. While scalable supervision can be derived from airborne LiDAR, the resulting annotations are noisy and difficult to exploit effectively. We address these challenges by formulating tree counting as a spatial density matching problem supervised through Unbalanced Optimal Transport. This formulation naturally accommodates both precise localization of isolate trees and robust density estimation in dense forests. We further introduce a self-correction mechanism that leverages transport residuals to progressively refine noisy supervision during training. We evaluate our approach on TinyTrees, a new benchmark spanning three continents and three satellite sensors, comprising over 215 million tree annotations (including 773K manually verified instances) across 23,000 this http URL. Our method consistently outperforms detection-based, regression-based, and transport-based distribution-matching baselines, demonstrating the effectiveness of unbalanced transport and reliability-aware supervision for large-scale tree counting from satellite imagery. Code, data and models are available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.24786 [cs.CV]

(or
arXiv:2606.24786v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.24786

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

15. 【2606.24784】AerialFusionMapNet: Online HD Map Construction with Aerial-Onboard BEV Fusion

链接：https://arxiv.org/abs/2606.24784

作者：Daniel Lengerer,Mathias Pechinger,Klaus Bogenberger,Carsten Markgraf

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：automated driving perception, High-resolution aerial imagery, High-resolution aerial, scene understanding, onboard sensors

备注： Accepted at the IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026

点击查看摘要

Abstract:High-resolution aerial imagery has recently emerged as a complementary modality for automated driving perception and has shown potential to improve birds-eye-view (BEV) scene understanding when fused with onboard sensors. Prior work demonstrated performance gains for online high-definition (HD) map construction through aerial-onboard fusion; however, conventional end-to-end fusion does not fully exploit the structural information contained in aerial representations. In this work, we introduce AerialFusionMapNet, a fusion-based mapping framework with a structured two-stage training strategy that explicitly enhances the contribution of aerial features within a unified pipeline. The proposed training scheme enables more effective integration of structural aerial priors. On the nuScenes geographic split, AerialFusionMapNet achieves up to 54.7 mAP, improving over prior aerial-onboard fusion baselines from 48.8 mAP by +5.9 absolute and +12.1% relative. The results suggest that structured training design, rather than increased architectural complexity, plays a more decisive role in unlocking the full potential of aerial imagery for online HD map construction. Code and trained models are available at this https URL.

16. 【2606.24774】Revealing Training Data Exposure in Vision Language Large Models via Parameter Gradients

链接：https://arxiv.org/abs/2606.24774

作者：Zhihao Zhu,Hongyi Tang,Yi Yang,Ahmed Abbasi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-Language Large Models, massive crawled corpora, crawled corpora raise, corpora raise pressing, raise pressing copyright

备注：

点击查看摘要

Abstract:Vision-Language Large Models (VLLMs) trained on massive crawled corpora raise pressing copyright and data-provenance concerns. These concerns are particularly acute in healthcare, where patient medical images paired with clinical reports demand rigorous privacy safeguards. However, existing training data detection methods either fail in cross-modal scenarios or rely on superficial output signals with insufficient discriminative power. We introduce GradAudit, a gradient-based auditing framework that examines internal optimization dynamics rather than treating VLLMs as black boxes. Our approach builds on a key observation: model parameters converge to regions where gradients on training samples become stable and well-aligned, whereas gradients on non-training samples remain noisy and inconsistent. By analyzing these gradient signatures, GradAudit achieves strong separability and detects genuine image-text associations learned during training, not merely individual modality membership. Empirically, across both medical and general-domain datasets, GradAudit substantially outperforms state-of-the-art baselines in both pretraining and fine-tuning VLLMs. In a case study employing copyrighted content, we show that existing training data detection methods not only underestimate the extent of unauthorized data usage, but that this underestimation becomes more pronounced as models become more recent and more advanced.

17. 【2606.24767】Compact Object-Level Representations with Open-Vocabulary Understanding for Indoor Visual Relocalization

链接：https://arxiv.org/abs/2606.24767

作者：Zhaopeng Cui,Jiarui Hu,Jingbo Liu,Boming Zhao,Xiyue Guo,Boyin Feng,Haocheng Peng,Yujun Shen,Hujun Bao,Guofeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Indoor visual relocalization, Indoor visual, visual relocalization plays, embodied AI applications, plays a critical

备注： Accepted by RA-L 2026

点击查看摘要

Abstract:Indoor visual relocalization plays a critical role in emerging spatial and embodied AI applications. However, prior research was predominantly devoted to low-level vision schemes, struggling to perceive scene semantics and compositions, which limits both interpretability and applicability. In this paper, we explore the issue of how to organize rich object information in a scene, including semantics, layout, and geometry, into a structured map representation, thereby utilizing object units exclusively to drive the camera relocalization task. To this end, we propose OpenReLoc, a camera relocalization system designed to provide scene understanding and accurate pose estimation capabilities. Leveraging recent foundation models, we first introduce a multi-modal mechanism to integrate open-vocabulary semantic knowledge for effective 2D-3D object matching. Additionally, we design object-oriented reference frames as position priors, paired with a reference frame selection strategy based on the Distance-IoU (DIOU), enabling extension to scalable scenes. Moreover, to ensure stable and accurate pose optimization, we also propose a dual-path 2D Iterative Closest Pixel loss guided by object shape. Experimental results demonstrate that OpenReLoc achieves superior relocalization recall and accuracy across various datasets. Our source code will be released upon acceptance.

18. 【2606.24759】UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving

链接：https://arxiv.org/abs/2606.24759

作者：Xiaowei Gao,Pengxiang Li,Yitai Cheng,Ruihan Xu,James Haworth,Stephen Law,Yun Ye

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent multimodal large, multimodal large language, shown strong potential, large language models, Recent multimodal

备注：

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have shown strong potential for autonomous driving scene understanding, yet existing methods still face a fundamental trade-off between temporal reasoning and spatial precision. Models that rely on single-frame or low-resolution inputs often miss small, distant, or partially occluded hazards, while language-centric driving models frequently provide limited grounded evidence for their explanations. To address this gap, we propose UniDrive, a unified visual-language and grounding framework for interpretable risk understanding in autonomous driving. UniDrive combines a temporal reasoning branch that models scene dynamics from multi-frame visual input with a high-resolution perception branch that preserves fine-grained spatial details from the latest frame. The two branches are integrated through a gated cross-attention fusion module, enabling dynamic context to be aligned with precise spatial evidence. Based on the fused representation, UniDrive jointly generates natural-language risk descriptions and grounded bounding-box outputs for risk objects. Experiments on the DRAMA-Reasoning benchmark show that UniDrive outperforms representative image-based and video-based baselines in both captioning and risk-object grounding. In particular, UniDrive achieves the best overall performance on the validation split and demonstrates clear advantages in small-object localization, zero-shot generalization to NuScenes and BDD100K, and human-rated interpretability and trustworthiness. These results suggest that explicitly combining temporal semantics and high-resolution perception provides a stronger foundation for interpretable and safety-oriented autonomous driving systems. The code is available at this https URL.

19. 【2606.24756】Adaptive Hebbian Memory Routing in Vision Transformers for Few-Shot Learning

链接：https://arxiv.org/abs/2606.24756

作者：Mohammed Yusuf Mujawar,Noorbakhsh Amiri Golilarz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image recognition requires, Adaptive Hebbian Routing, Few-shot image recognition, Adaptive, Hebbian

备注：

点击查看摘要

Abstract:Few-shot image recognition requires models to adapt to new classes from a small labeled support set. Hebbian fast-weight memory can provide temporary associative information during an episode, but fixed memory behavior may not be appropriate for every few-shot task. In this work, we propose Adaptive Hebbian Routing for few-shot Vision Transformers. The method uses a lightweight MLP router to control the contribution of Hebbian memory, the strength of memory updates, and the retention of previous memory from support-set features. We study Adaptive Placement, Adaptive Plasticity, and Fully Adaptive Hebbian Routing. Experiments use ViT-Small, DeiT-Small, and Swin-Tiny under 5-way 1-shot evaluation on Omniglot, CIFAR-FS, and cross-domain transfer from CIFAR-FS to Omniglot. In the direct Swin comparison, fixed and adaptive Hebbian variants use the same memory location. Adaptive Plasticity improves the fixed Hebbian result from 96.74\% to 96.92\%, while Fully Adaptive Routing achieves the best result at 96.94\%. The fully adaptive Swin model also reduces inference time from 16.51 ms to 14.05 ms relative to fixed Hebbian Swin. On CIFAR-FS, adaptive variants improve performance across all three backbones, and the multi-shot evaluation shows that these gains remain useful as the number of support examples increases. These results show that adaptive plasticity and adaptive memory activation can improve few-shot Transformer representations beyond fixed Hebbian behavior.

20. 【2606.24740】BioMedVR: Confusion-Aware Mixture-of-Prompt Experts for Biomedical Visual Reprogramming

链接：https://arxiv.org/abs/2606.24740

作者：Jiaxiang Liu,Tianxiang Hu,Juwei Guan,Yujie Wu,Yusong Wang,Yao Mu,Zuozhu Liu,Mingkun Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, CLIP have demonstrated, demonstrated strong generalization, advances in vision-language, demonstrated strong

备注： Accepted at ECCV 2026. 19 pages, 6 figures. Project page: [this https URL](https://jxliu-ai.github.io/biomedvr-page/)

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) such as CLIP have demonstrated strong generalization across natural-image domains. However, adapting these models to biomedical imaging is non-trivial: full-model fine-tuning is computationally expensive, while medical data are often scarce and exhibit subtle, fine-grained inter-class differences, making parameter-efficient adaptation particularly critical. Visual Reprogramming (VR) offers a parameter-efficient alternative by injecting learnable perturbations into the input space, but existing VR approaches for VLMs mainly focus on positive class prompts and overlook confusing negatives, leading to miscalibrated predictions in fine-grained medical scenarios. We present BioMedVR, the first VR-based framework for biomedical imaging, enabling few-shot adaptation of pretrained VLMs through compact learnable VR modules. To mitigate class confusion, we introduce a Confusion Minimization Mechanism that leverages LLM-generated confusion-aware attributes together with a Confusion-Suppression Loss to explicitly reduce false-positive alignment. Moreover, the designed Mixture-of-Prompt Experts combines a positive expert for main-class discrimination and a negative expert for confusion suppression, balanced via adaptive gating. Extensive experiments on 18 datasets, including 11 biomedical datasets and 7 natural image benchmarks, demonstrate that BioMedVR achieves superior accuracy and generalization, effectively bridging VR and VLMs in biomedical domains.

21. 【2606.24737】VSANet: View-aware Sparse Attention Network for Light Field Image Denoising

链接：https://arxiv.org/abs/2606.24737

作者：Gargi Panda,Soumitra Kundu,Saumik Bhattacharya,Aurobinda Routray

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Light field, challenging due, high-dimensional structure, Light, view-aware sparse attention

备注：

点击查看摘要

Abstract:Light field (LF) image denoising is challenging due to the high-dimensional structure of LF data. While noise is independent across sub-aperture images, scene content exhibits strong cross-view correlations. We introduce VSANet, a view-aware sparse attention network for LF denoising. Specifically, we propose a view-aware sparse attention (VSA) block that represents the 4D LF feature map as a unified spatial-angular token space and performs cross-view aggregation via locality-sensitive hashing-based sparse attention. This enables global feature interactions with linear complexity, effectively exploiting LF correlations across views and spatial locations. In addition, we design a feature refinement (FR) block to emphasize informative features in spatial, angular, and epipolar subspaces. The VSA and FR blocks are integrated within a sequential attention refinement module, forming the core of VSANet. Experiments demonstrate VSANet outperforms stateof-the-art LF denoising methods.

22. 【2606.24726】SER: Learning to Ground Video Reasoning with Semantic Evidence Rewards

链接：https://arxiv.org/abs/2606.24726

作者：Sheng Xia,Zhengqin Lai,Tianxiang Jiang,Kanghui Tian,Shoujun Zhou,Bin Li,Yi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generating correct answers, correct answers based, fine-grained spatio-temporal reasoning, frames or objects, MLLMs often struggle

备注：

点击查看摘要

Abstract:Video MLLMs often struggle with fine-grained spatio-temporal reasoning, sometimes generating correct answers based on irrelevant frames or objects. Although outputting spatio-temporal evidence during reasoning is a promising direction, existing RL frameworks typically rely on geometry-only (IoU) rewards, which can be sensitive to boundary perturbations and overlook semantic alignment. To address this, we propose Semantic Evidence Reward (SER), which reformulates spatio-temporal evidence grounding as a constrained verification task. Instead of computing pixel-level overlap, SER uses a referee VLM as a local checker to evaluate model-generated evidence claims across two dimensions: relevance and localization quality, combined with a temporal penalty. This design reduces the reliance on dense box annotations and enables training directly on standard video QA data. On the V-STAR benchmark, SER achieves 49.6% mLGM, improving by 3.0 points over the strong evidence-grounded baseline Open-o3-Video, demonstrating its potential in enhancing both answer accuracy and evidence grounding.

23. 【2606.24716】Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

链接：https://arxiv.org/abs/2606.24716

作者：Jonas Klotz,Cassio F. Dantas,Pallavi Jain,Diego Marcos,Begüm Demir

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：measuring semantic correspondence, methods largely rely, Sparse autoencoders, vision language models, existing evaluation methods

备注： Accepted at ECCV 2026

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are increasingly used to extract interpretable concepts from vision and vision language models, yet existing evaluation methods largely rely on proxy metrics or qualitative inspection rather than measuring semantic correspondence. We present a human-grounded evaluation framework that quantifies alignment between SAE latents and human-annotated concepts, without requiring user studies, and validate this matching through targeted attribute perturbations. To enable this intervention-style evaluation in vision, we construct synCUB and synCOCO, synthetic benchmarks of paired images that differ in exactly one attribute. We introduce Fully-Binary Matching Pursuit (FBMP), a coalition-based matching procedure that supports many-to-one mappings between SAE latents and annotated concepts, and consistently outperforms one-to-one baselines. For functional validation, we propose a Targeted Attribute Perturbation Alignment Score (TAPAScore), which tests whether matched concepts respond selectively and in the expected direction under targeted image-level attribute perturbations. Under sanity checks, our matching and TAPAScore are the only evaluated metrics that reliably distinguish trained SAEs from untrained ones. Across SAEs trained on CLIP and DINOv2 embeddings, we find that increased overcompleteness can reduce perturbation alignment, indicating a reduction in interpretability. Our evaluation framework suggests that moderate dictionary sizes provide the best trade-off, yielding the most interpretable SAEs. Code and datasets are available at this https URL.

24. 【2606.24649】Agentic Collaborative Cognition for Zero-Shot 3D Understanding

链接：https://arxiv.org/abs/2606.24649

作者：Wenxin Wang,Bo Zhang,Feng Chen,Zixuan Wang,Wen Li,Changsheng Li,Yinjie Lei

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注： Accepted by ECCV 2026. Project page: [this https URL](https://zhangbo135.github.io/agentic-collaborative-cognition/)

点击查看摘要

Abstract:Recent advancements have explored agentic zero-shot 3D understanding by reformulating it as video keyframe understanding with Multimodal Large Language Models (MLLMs). However, existing methods face an intrinsic bottleneck due to the finite observation perspectives inherent in videos and the implicit perception of 3D scenes. In this paper, we propose a collaborative multi-agent framework that assigns a Planning Agent to handle high-level viewpoint planning and supplement novel perspectives, and a Perception Agent to explicitly summarize the 3D scene into a structured holistic cognitive map. Specifically, Planning Agent first analyzes this cognitive map to determine query-relevant viewpoints and supplements missing critical perspectives to ensure comprehensive observation. Subsequently, Perception Agent documents object-level attributes from these views by assigning consistent instance identifiers across viewpoints, thereby integrating fragmented observations into the holistic cognitive map. In parallel, it provides feedback to filter out mismatched candidate objects and guide subsequent viewpoint planning. Through this closed-loop iterative process, two agents collaboratively figure out candidates until Perception Agent determines that sufficient information has been captured to complete the task. Extensive experiments demonstrate that our method achieves state-of-the-art performance on 6 benchmarks, with improvements of 11.1\% Acc@0.5 on ScanRefer, 14.6 BLEU-1 on 3D-assisted dialog, and 2.1 EM on SQA3D.

25. 【2606.24628】ArtiTwinSplat: Interactable Digital Twin Reconstruction via Gaussian Splatting from RGB-D videos

链接：https://arxiv.org/abs/2606.24628

作者：Pranjal Mishra,René Zurbrügg,Max Wilder-Smith,Marco Hutter,Marc Pollefeys,Zuria Bauer,Hermann Blum

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：unstructured real-world environments, environments needs accurate, Deploying robots, Deploying, unstructured real-world

备注： Presented at the ICRA 2026 Workshop on Advances and Challenges in AI-Driven Automation and Robotic System Integration with Digital Twins, Vienna, June 2026

点击查看摘要

Abstract:Deploying robots in unstructured real-world environments needs accurate, interactive models of the objects. Constructing these models at scale remains a critical bottleneck for robotic system integration. We present ArtiTwinSplat, a framework that automatically constructs articulated, photo-realistic digital twins of objects directly from RGB-D videos, requiring no CAD models, simulation assets, or manual annotations. Our method is built on 3D Gaussian Splatting that preserve geometric fidelity and photometric realism, coupled with an unsupervised articulation discovery pipeline that recovers part structure and joint kinematics from observed motion alone. With tracking and optimization stages our method provides stable, queryable digital twins that support real-time rendering, viewpoint control, and interactive manipulation. Unlike prior methods confined to simulation, ArtiTwinSplat operates directly on real-world observations and produces twins that are immediately usable by downstream robot planning and learning systems. This method offers a practical, scalable pathway toward digital twin construction, lowering the integration barrier for articulated object manipulation in embodied AI and human-robot collaboration contexts.

26. 【2606.24602】ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering

链接：https://arxiv.org/abs/2606.24602

作者：Zhentao Guo,Chen Duan,Tongkun Guan,Zining Wang,Kai Zhou,Pengfei Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：temporally distributed textual, distributed textual cues, video text understanding, current MLLMs, text understanding

备注： Accepted by ECCV2026

点击查看摘要

Abstract:Despite remarkable progress in multimodal understanding, current MLLMs still exhibit limitations in video text understanding, particularly when semantics emerge through the integration of temporally distributed textual cues across multiple frames. This perception challenge fundamentally differs from static image text understanding, yet existing datasets fail to capture: the vast majority of questions remain answerable from single frames, inadequately reflecting real-world video text comprehension demands. To address this, we present ViTexQA, a large-scale video-text QA dataset, and FrameThinker for robust multi-frame temporal reasoning. We build ViTexQA via a quality-controlled Chain-of-Thought (CoT) annotation pipeline boosted with temporal constraints; all its QA pairs demand cross-frame text fusion to solve, enforcing true temporal reliance. FrameThinker adopts two-stage training for explicit temporal modeling: CoT-Guided Supervised Fine-Tuning (SFT) generates frame-aware reasoning chains, followed by Temporally-grounded Reinforcement Learning (RL) optimized with multi-frame coherence rewards. Evaluations show our method outperforms SOTA baselines on ViTexQA, lifting ROUGE-L by 6.3%.

27. 【2606.24586】EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

链接：https://arxiv.org/abs/2606.24586

作者：Nahuel Gonzalez,Marta Robledo-Moreno,Ivan DeAndres-Tame,Ruben Vera-Rodriguez,Ruben Tolosana

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Equal Error Rate, Error Rate, primary evaluation metric, Equal Error, Deep learning approaches

备注：

点击查看摘要

Abstract:Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate (EER). This paper introduces EERLoss: a subdifferentiable, arbitrarily accurate approximation to EER for training deep biometric models. Furthermore, this framework has the potential to be adapted to optimize any specific operating point on the DET curve, enhancing its generalizability. To validate this approach, EERLoss is evaluated on a particularly demanding behavioral biometric modality: keystroke dynamics verification. This task is characterized by its high intra-class and low inter-class variability. Experiments are conducted on the large-scale KVC-onGoing benchmark, incorporating data from over 185,000 subjects across different scenarios. A comprehensive ablation study initially demonstrates the superiority of EERLoss in comparison to existing state-of-the-art loss functions. It also converges substantially faster compared to other losses, reducing the overall training cost. Additionally, a comparison is made between the proposed loss and the KVC-winning architecture by re-training it with EERLoss, demonstrating that the proposed approach significantly outperforms the original SoTA, achieving a relative EER reduction of up to approx. 30\%. This improvement on a challenging, large-scale benchmark validates the effectiveness of EERLoss as a task-aligned training objective specifically suited for high-variance biometric traits.

28. 【2606.24570】Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning

链接：https://arxiv.org/abs/2606.24570

作者：Julien Khlaut,Charles Corbière,Baptiste Callard,Amaury Prat,Leo Butsanets,Antoine Saporta,Théo Danielou,Leo Machado,Korentin Le Floch,Tom Boeken,Pierre Manceron,Corentin Dancette

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-language contrastive pretraining, Vision-language contrastive, leveraging the large, clinical practice, dominant recipe

备注：

点击查看摘要

Abstract:Vision-language contrastive pretraining has become the dominant recipe for 3D medical foundation models, leveraging the large volumes of paired scans and reports produced in clinical practice. However, medical images usually span dozens of organs, and radiological reports are much longer than typical natural image captions and are composed of multiple structured sections. CLIP-style pretraining compresses this structure by encoding each modality into a single global token, at the risk of losing important details. We introduce ConQuer (Concept Queries), an image-text pretraining method that augments CLIP's global alignment with a set of localized alignments, one per concept. ConQuer splits the report into concept-specific sections and learns cross-attention queries that pool the matching image features without using any segmentation mask or spatial supervision. Contrastive learning is then applied independently for each concept. Concepts can be any unit of semantic localization; here, they are anatomical regions, one query per organ or gross body region. As a byproduct, each query learns attention maps focused on its concept, providing built-in spatial interpretability. We use ConQuer to train Jolia, a 3D CT foundation model on chest and abdominal CT. Jolia consistently outperforms a CLIP baseline on findings classification, report generation, and cross-center transfer, and sets a new state of the art across multiple public benchmarks. Jolia's weights will be released upon acceptance.

29. 【2606.24567】Multilevel Stochastic Plug-and-Play for Sparse-View CT Reconstruction

链接：https://arxiv.org/abs/2606.24567

作者：Antoine De Paepe,Alexandre Bousse,Dimitris Visvikis

类目：Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词：Sparse-view computed tomography, reduces radiation exposure, projection views makes, problem severely ill-posed, Sparse-view computed

备注： 12 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Sparse-view computed tomography (SVCT) reduces radiation exposure and acquisition time, but the limited number of projection views makes the reconstruction problem severely ill-posed and leads to streak artifacts when analytical methods are used. Plug-and-Play (PnP) methods provide an effective way to combine data fidelity with learned image priors, while stochastic PnP methods further improve robustness by matching the denoiser input distribution through re-noising. However, these methods often require many iterations to converge, which limits their practical efficiency. In this work, we propose a multilevel (ML) stochastic PnP method for SVCT that accelerates stochastic PnP reconstruction. We highlight that, in the stochastic setting, directly enforcing prior coherence across levels would require accurately estimating fine-level prior gradients through multiple denoiser function evaluations, which substantially increases the computational cost. Motivated by this observation, we perform the multilevel steps in multiresolution analysis (MRA) approximation spaces. This choice is supported by the structure of the wavelet decomposition, which causes the prior-coherence correction to vanish in expectation, thereby avoiding costly estimation of fine-level stochastic prior gradients for the coarse-level corrections. Experiments on SVCT reconstruction show that our method, called Multilevel Stochastic Plug-and-Play (ML-SPnP), achieves reconstruction quality comparable to state-of-the-art methods while substantially reducing runtime.

30. 【2606.24564】PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments

链接：https://arxiv.org/abs/2606.24564

作者：Zhenyang Li,Lutao Jiang,Yizhou Zhao,Ying-Cong Chen,Xin Wang,Weikai Chen,Yifan Peng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reconstructing realistic, physically plausible garments, physically plausible, single image remains, fundamental challenge

备注： 11 pages, 6 figures

点击查看摘要

Abstract:Reconstructing realistic, physically plausible garments from a single image remains a fundamental challenge. Template-free methods capture surface geometry but lack explicit sewing structure for simulation; while programmatic systems are simulation-ready but constrained by predefined templates. This reveals a fundamental representation gap between geometric reconstruction and structured garment construction. We present PatternGSL, a structured garment representation in the form of a template-free and learnable specification language that encodes complete sewing patterns, including panel boundaries, parameterized seams, and explicit stitch topology, in a compact and standardized form. PatternGSL preserves the physical rigor of pattern-based models while removing template dependence, elevating sewing structure as a first-class target for generative modeling. We further propose a vision-language framework that predicts PatternGSL specifications directly from a single image and decodes them into garments using lightweight deterministic validity handling, without optimization-based refinement or manual cleanup. In addition, we introduce PatternGSLData, the first large-scale image-to-GSL paired dataset comprising 300K samples with complete sewing pattern annotations, enabling supervised VLM training for structured garment reconstruction. Experiments demonstrate improved pattern accuracy over prior baselines, explicit sewing-structure recovery, reliable cloth simulation, and pattern-level editing through the same deterministic decoding pipeline. Code and data-processing scripts will be released at this https URL.

31. 【2606.24561】Quantum CT via Dynamic Interval Encoding and Prior-Balanced QUBO Reconstruction

链接：https://arxiv.org/abs/2606.24561

作者：Ao Wang,Yikuang Yuluo,Yujie Liu,Shuangyang Zhong,Yuwen Zhang,Zihao Wang,Fenglin Liu,Andreas Maier,Haijun Yu,Yixing Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：quantum computed tomography, based quantum computed, computed tomography, encodings increase QUBO, Quadratic unconstrained binary

备注： 10 pages, 10 figures

点击查看摘要

Abstract:Quadratic unconstrained binary optimization (QUBO)-based quantum computed tomography (CT) casts reconstruction as a binary quadratic problem for quantum annealing and hybrid quantum--classical solvers. For grayscale CT, however, image encoding is constrained by the binary-variable budget: fixed global bit-plane encodings increase QUBO size and coupling complexity as gray-level precision improves, whereas low-bit encodings introduce quantization error. We propose a QUBO-based grayscale CT reconstruction framework that combines dynamic interval encoding with prior-balanced optimization. Each refinement round encodes active pixels only within local gray-level intervals around the current estimate, and a boundary-hit-guided update rule adaptively switches between search expansion and local refinement. To improve optimization stability, the method balances projection-domain data consistency and an edge-preserving quadratic prior before forming the final QUBO. Sparse-view and limited-angle fan-beam CT experiments show that the proposed method recovers structures and gray-level distributions more faithfully than the evaluated analytic, iterative, variational, and representation-based baselines. Expressivity analysis and ablation studies further indicate that the improvement mainly arises from effective gray-level representation through dynamic local encoding and more stable data-fidelity--prior coupling. Experiments on the D-Wave hybrid binary quadratic model (BQM) solver further demonstrate that the formulation is executable on a hardware-backed hybrid quantum--classical backend.

32. 【2606.24557】Heterogeneous Knowledge Distillation via Geometry Decoupling and Momentum-Aware Gradient Regulation

链接：https://arxiv.org/abs/2606.24557

作者：Wuming Yang,Xiang Zhang,Hongmin Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Heterogeneous Knowledge Distillation, Transformer to CNN, Heterogeneous Knowledge, transfer knowledge, severe training instability

备注： Preprint. Under review

点击查看摘要

Abstract:Heterogeneous Knowledge Distillation (HKD) aims to transfer knowledge across varying architectures (e.g., from Transformer to CNN) but inherently suffers from severe training instability. We reveal that this instability stems from two highly coupled challenges: massive feature norm discrepancies that cause optimization drag, and severe gradient conflicts between the primary and distillation objectives arising from distinct inductive biases. To achieve stable distillation, we propose SPOFA, a framework built upon a novel Feature and Gradient Dual Stabilization mechanism. Specifically, at the feature level, we introduce a LayerNorm-based decoupling projector that explicitly decouples feature magnitude from direction, creating a bounded and stable space for semantic alignment. At the gradient level, we propose a momentum-driven Exponential Moving Average (MEMA) dynamic scaler. By establishing a robust historical baseline of the optimization trajectory, MEMA actively evaluates instantaneous gradient conflicts and adaptively penalizes harmful distillation signals, guaranteeing stable convergence. Importantly, SPOFA achieves this dual stabilization with an extremely lightweight parameter footprint. Extensive experiments on two mainstream benchmarks demonstrate that SPOFA achieves state-of-the-art accuracy, significantly outperforming computationally expensive methods while introducing only minimal computational overhead compared to standard baselines.

33. 【2606.24548】Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

链接：https://arxiv.org/abs/2606.24548

作者：Jiayi Lei,Yuandong Pu,Xingyu Han,Rongpeng Zhu,Jing Xu,Jinyao Wang,Zijian Zhou,Bin Fu,Yuewen Cao,Yihao Liu,Yongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, producing visually realistic, natural language prompts, visually realistic images, achieved remarkable

备注： 10 pages, 7 figures. Project page: [this https URL](https://github.com/jylei16/CF-World.github.io)

点击查看摘要

Abstract:Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspired by Russell's inductivist turkey, we introduce Counterfactual-World (CF-World), a counterfactual benchmark designed to investigate whether text-to-image models can generate images under rules that systematically contradict real-world priors. CF-World organizes each scenario into three progressive levels: factual generation under ordinary world knowledge, explicit counterfactual generation with direct visual instructions, and implicit counterfactual generation requiring causal deduction from altered rules. We evaluate both open-source and closed-source T2I models using a Vision Language Model (VLM)-based evaluator (CF-Eval). Furthermore, we introduce two metrics: Prior Resistance Rate (PRR), which measures a model's ability to overcome entrenched real-world priors, and Reasoning Retention Rate (RRR), which assesses whether models can maintain reasoning-dependent counterfactual generation without explicit visual cues. Experiments show that all models exhibit sharp degradation from factual to counterfactual settings. Further analyses suggest that these failures arise because current T2I models encode world knowledge and visual appearances as tightly coupled patterns. Consequently, their heavy reliance on frequent visual co-occurrences within the training data forces them to default to familiar commonsense priors when tasked with rendering counterfactual worlds.

34. 【2606.24539】PointVG-R: Internalizing Geometric Reasoning in MLLMs for Precise Pointing Localization via Visual Chain of Thought

链接：https://arxiv.org/abs/2606.24539

作者：Ling Li,Bowen Liu,Zinuo Zhan,Jianhui Zhong,Ziyu Zhu,Bingcai Wei,Kenglun Chang,Zhidong Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：precisely locate target, locate target objects, complex spatial relationships, deciphering complex spatial, grounding requires models

备注：

点击查看摘要

Abstract:Pointing-based visual grounding requires models to precisely locate target objects by deciphering complex spatial relationships between the visual scene and pointing gestures. Traditional methods typically encode input images into static feature representations and perform reasoning primarily within the linguistic domain, often overlooking the rich perceptual cues and explicit spatial geometry inherent in images. In this study, we aim to mitigate the cognitive vulnerability of models in interpreting gestural spatial relations by proposing PointVG-R, a reasoning-guided Multi-modal Large Language Model (MLLM). PointVG-R introduces geometric-aware reasoning for pointing-based grounding, enabling the model to think with images through the strategic integration of Reinforcement Learning (RL) and cold-start data. Specifically, we design a novel geometric reasoning pipeline that simulates the iterative cognitive process humans employ when interpreting pointing gestures. Furthermore, we construct EgoPoint-CoT, a high-quality visual Chain-of-Thought (CoT) dataset featuring detailed reasoning trajectories to guide the model via Supervised Fine-Tuning (SFT) and RL. To address the varying quality of learning signals encountered during training, we further propose an Adaptive Importance Weighting strategy based on Group Variance, which dynamically adjusts reward signals to optimize the learning process. Experimental results demonstrate that PointVG-R achieves SOTA performance, outperforming the baseline by $\textbf{15.86}$ points in mIoU. Extensive ablation studies further validate the efficacy of our proposed modules. Code: this https URL.

35. 【2606.24538】ForensicsTok: Forensics-Guided Tokenized Modeling for Image Tampering Localization

链接：https://arxiv.org/abs/2606.24538

作者：Lei Xu,Haowei Wang,Shen Chen,Taiping Yao,Bin Li,Changsheng Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, offer powerful reasoning

备注： 16 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) offer powerful reasoning for forensic tasks, yet existing approaches utilizing exogenous segmentation decoders often suffer from suboptimal localization. The reliance on stitched pipelines introduces information bottlenecks during backpropagation, which dilutes spatial signals and is limited by semantic priors of the segmentor. To address these limitations, we propose ForensicsTok, which reformulates image manipulation localization as an autoregressive sequence generation task. ForensicsTok directly generates spatially grounded token sequences, enabling precise mask prediction without intermediary supervision. Specifically, we introduce a Token Splatting Decoder (TSD) to map tokens to binary masks via codebook-aware code smoothing, which mitigates sharp gradients from deterministic detokenizers. Furthermore, to capture diverse tampering clues, we propose a Hierarchical Expert Fusion (HEF) module that injects multi-scale features from a forensic expert model. This unified architecture effectively compensates for the lack of forensic priors in standard MLLMs. Extensive experiments on six benchmarks show that ForensicsTok substantially improves over existing MLLM-based baselines and slightly improves over strong forensic expert baselines, while exhibiting stronger robustness to perturbations.

36. 【2606.24525】VisCritic: Visual State Comparison as Process Reward for GUI Agents

链接：https://arxiv.org/abs/2606.24525

作者：Jiachen Qian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：show strong potential, long-horizon scenarios due, vision-language models show, models show strong, automating digital tasks

备注： 17 pages, 4 figures; ECCV 2026 submission; supplementary material uploaded as ancillary file

点击查看摘要

Abstract:GUI agents powered by vision-language models show strong potential for automating digital tasks, yet frequently fail in long-horizon scenarios due to the absence of step-level verification. Existing process reward models verify actions through textual reasoning alone, missing the visual nature of GUI state changes. We introduce VisCritic, a visual process reward framework that verifies agent actions by directly comparing pre-action and post-action screenshots in visual feature space. VisCritic employs a Siamese vision transformer to extract change-aware representations, coupled with an Action-Aware Critic Head that jointly evaluates action success, task progress, and error type. A critic-training data construction pipeline generates weakly supervised samples from existing trajectories without additional human labels for critic training. Experiments and offline analyses across five benchmarks demonstrate that VisCritic serves as a plug-and-play enhancement for diverse GUI agents, generally improving benchmark metrics while providing visual diagnostic cues.

37. 【2606.24516】What Do Flow-Based Inverse Solvers Approximate? A Posterior-Transport View

链接：https://arxiv.org/abs/2606.24516

作者：Jian Xu,Delu Zeng,John Paisley,Qibin Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deterministic probability-flow ODE, solve imaging inverse, probability-flow ODE, pretrained flow-matching prior, imaging inverse problems

备注：

点击查看摘要

Abstract:A growing family of training-free solvers -- FlowDPS, FLOWER, PnP-Flow and their diffusion ancestors (DPS, DAPS) -- repurpose a pretrained flow-matching prior to solve imaging inverse problems by adding a measurement-guidance term to the deterministic probability-flow ODE. Despite strong empirical results, what these per-step corrections actually approximate -- and how far the resulting samples are from the true posterior $p(x\mid y)$ -- has not been characterized. We give a posterior-transport account of flow-based inverse problem solving. Our starting point is a simple but consequential fact: for a \emph{deterministic} flow prior, Bayesian conditioning is realized entirely by a \emph{reweighting of the source distribution}, not by a drift correction; pushing the reweighted source through the \emph{unmodified} velocity field yields exact posterior samples. From this we show that trajectory-guidance solvers can be read as the minimum-kinetic-energy \emph{correction} field needed to morph the unconditional source into the posterior, and that FlowDPS / FLOWER / PnP-Flow correspond to distinct zeroth-order / Gaussian / proximal approximations of this single object; we bound the resulting posterior bias in Wasserstein distance. A controlled $2$D study with a closed-form posterior confirms the theory decisively: source reweighting matches the true posterior to the Monte-Carlo floor on every metric, whereas trajectory guidance incurs $200$--$800\times$ larger error and collapses posterior modes, \emph{regardless of guidance strength}. Guided by the analysis we propose a cheap, principled velocity-correction solver that is competitive across two in-domain priors (AFHQ, CelebA) and two out-of-distribution settings while, unlike point-estimate source-space optimizers, producing diverse posterior samples with uncertainty that correlates with reconstruction error.

38. 【2606.24499】GeoIMO: Geometry-Driven Independent Motion Classification for Event Cameras

链接：https://arxiv.org/abs/2606.24499

作者：Anil Bayram Gogebakan,Filippo Marostica,Alessio Caviglia,Alessandro Savino,Stefano Di Carlo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing automotive event, motion-aware event perception, Existing automotive, frame pipelines, making them poorly

备注：

点击查看摘要

Abstract:Existing automotive event datasets rely on appearance-based annotations from frame pipelines, making them poorly suited for motion-aware event perception. We present a geometry-driven, annotation-free framework that classifies detected objects as static or independently moving by exploiting ego-motion structure directly from the event stream. A Focus of Expansion model with yaw compensation estimates global background motion, while objects are labeled as moving when local motion deviates from this prediction, as quantified by a scale-invariant residual. Temporal stabilization improves robustness across consecutive event windows. The method requires no learning, no manual motion labels, and works with any input bounding boxes. Experiments on MVSEC and the Prophesee 1 Megapixel Automotive Detection dataset demonstrate consistent performance across diverse driving scenarios, with yaw compensation improving results during turns and a simple translational local model offering a favorable accuracy-efficiency trade-off.

39. 【2606.24498】VistaRef: Boosting Visual Spatial Orientation Awareness for Pointing-to-Object Detection

链接：https://arxiv.org/abs/2606.24498

作者：Ling Li,Zhizhen Cai,Xinkun Wu,Ziyu Zhu,Jiaqing Lyu,Bowen Liu,Zhidong Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Grounding deictic gestures, seamless spatial interaction, human-robot collaboration, providing a basis, deictic gestures

备注：

点击查看摘要

Abstract:Grounding deictic gestures in natural images is fundamental to AR and human-robot collaboration, providing a basis for seamless spatial interaction. While Transformer-based visual models have achieved significant progress in general object detection, their global attention mechanisms often neglect micro-geometric relationships, degrading orientation accuracy. In pointing tasks, this deficiency manifests as an inability to accurately capture the pointing ray implied by finger poses, which results in pointing drift and localization ambiguity when dealing with distant or densely packed objects. To address this, we propose VistaRef, a framework designed to explicitly enhance spatial orientation awareness. First, we develop the Local Hand Entity Modeling (LHEM) module, which incorporates hand-pose embeddings to strengthen the model's capability to capture subtle finger deviations. Second, drawing inspiration from multi-view geometry, we construct the Geometric Ray Modeling (GRM) module to transform implicit orientation information into explicit spatial geometric features, guiding feature aggregation and deep fusion via attention mechanisms. Furthermore, we introduce a novel Orientation-Consistent Alignment Loss (OCAL) to synergistically supervise hand presence and pointing consistency, ensuring that all architectural improvements collectively serve the core objective of spatial localization. Experimental results demonstrate that VistaRef significantly outperforms the baseline, achieving a 14-point absolute gain in grounding accuracy. Qualitative analysis further confirms that VistaRef effectively models the geometric correlation from hand to target, bridging the spatial perception gap inherent in traditional Transformers for complex scenarios. Code: this https URL.

40. 【2606.24488】RetiSEM: Generalising Causal Models for Fragmented Biomedical Data

链接：https://arxiv.org/abs/2606.24488

作者：Inam Ullah,Imran Razzak,Shoaib Jameel

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Methodology (stat.ME)

关键词：Learning causal models, jointly observed, Learning causal, data is challenging, fragmented biomedical data

备注：

点击查看摘要

Abstract:Learning causal models from fragmented biomedical data is challenging because clinical, molecular, and imaging variables are often incomplete or not jointly observed. We propose RetiSEM, a domain-constrained structural equation modelling (SEM) framework for causal graph recovery and mediation analysis under limited multimodal resources. This proposed work organises variables into biologically informed blocks, applies forbidden-edge constraints, and decomposes pathway-level effects into TE, NDE, and NIE components. We evaluate RetiSEM across ten synthetic benchmark scenarios that vary in dimensionality, nonlinearity, causal depth, and pathway structure, together with a fragmented real-world setting that combines NHANES clinical variables with externally derived retinal representations. This approach achieves lower structural error and higher causal accuracy than unconstrained baselines across the synthetic benchmarks. In the real-data analysis, retinal variables behave mainly as downstream biomarker-like indicators, with smaller but detectable indirect effects. These findings support our strategy as an interpretable framework for testing structured causal hypotheses in limited-resource biomedical AI. The code and resources for this work are publicly available at: this https URL.

41. 【2606.24484】Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

链接：https://arxiv.org/abs/2606.24484

作者：Xingsong Ye,Yongkun Du,Jiaxin Zhang,Haojie Zhang,Chong Sun,Chen Li,Jing Lyu,Zhineng Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scene TExt Recognition, TExt Recognition, making WordArt-oriented scene, WordArt-oriented scene TExt, general Scene Text

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at this https URL.

42. 【2606.24479】MambaRaw: Selective State Space Modeling for Efficient 4K Raw Image Reconstruction

链接：https://arxiv.org/abs/2606.24479

作者：Peize Li,Fanhu Zeng,Tongda Xu,Xingguo Xu,Xinjie Zhang,Xingtong Ge,Haotian Zhang,Yan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：In-camera JPEG previews, In-camera JPEG, negligible storage cost, JPEG previews, storage cost

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:In-camera JPEG previews are ubiquitous in raw image formats and provide an sRGB reference at negligible storage cost. Although existing metadata-based reconstruction frameworks can exploit this side information when recovering raw images, their context models often become computationally expensive especially at high resolution, eg, 4K raw image, given that attention mechanisms scale quadratically with feature maps, hindering its practical application. To address these limitations, we propose MambaRaw, a JPEG-conditioned metadata-based raw image reconstruction framework that uses State Space Models (SSMs) to estimate entropy parameters efficiently. Our key contribution comprises a Spatial-Energy Coupled Context Modeling mechanism with two lightweight modules: (1) TileMambaBlock, which performs Mamba-style selective scanning only on information-dense tiles to improve the efficiency; and (2) Energy-Aware Refinement (EAR), an identity-initialized residual module that enhance feature representation to match the long-tail energy distribution of raw signals. Extensive experiments on three camera datasets (Sony, Olympus, Samsung) show consistent improvements over strong metadata-based baselines and set a new state of the art for JPEG-guided raw reconstruction with great efficiency. Notably, at low metadata bitrates, MambaRaw increases PSNR by 1.2--1.4 dB and reduces end-to-end coding latency by about 9%. Code is released at this https URL.

43. 【2606.24477】video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

链接：https://arxiv.org/abs/2606.24477

作者：Yixuan Li,Guangzhi Sun,Yudong Yang,Wei Li,Zejun MA,Chao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词：reduced frame rates, miss critical information, Video large language, large language models, memory budgets

备注：

点击查看摘要

Abstract:Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering (QA). A practical and efficient solution is a two-stage paradigm: first perform coarse video understanding to localize relevant segments, and then re-watch these segments at higher temporal or spatial fidelity. In this paper, we present video-SALMONN-R$^3$, the first end-to-end video-LLM that enables re-watch through reinforcement learning without relying on chain-of-thought (CoT) cold-start. This design removes the need for costly CoT data annotations and avoids CoT-based supervised fine-tuning (SFT), which can otherwise degrade the pretrained video understanding abilities. To address the mismatch between the reasoning-first behavior induced by re-watch and the answer-first tendency of pretrained video-LLMs, we propose a re-answer strategy, in which the model first produces a direct answer in the first watch and then refines it after re-watching. Finally, to improve question adherence during re-watching, we propose a re-ask mechanism that re-injects the query when revisiting localized segments. Experimental results show that video-SALMONN-R$^3$ consistently outperforms both the base model and the QA-SFT baseline, while surpassing prior re-watch-based approaches with significantly lower computational cost. Code, models, and data will be publicly released upon acceptance.

44. 【2606.24464】Boosting Text-Driven Video Segmentation via Geometry-Aware Distillation

链接：https://arxiv.org/abs/2606.24464

作者：Tianyu Zhu,Yingping Liang,Hesong Li,Ying Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Referring Video Object, segment target objects, Text-driven Referring Video, Text-driven Referring, target objects

备注： Accepted by ECCV2026

点击查看摘要

Abstract:Text-driven Referring Video Object Segmentation (RVOS) aims to locate and segment target objects in videos given natural language. However, existing models are typically trained on 2D image or video datasets with naive segmentation losses, which overlooks the geometric consistency across frames and leads to weak spatial understanding. In this paper, we propose Geometry-enhanced Language-guided Video segmentation (GeoLaV), a two-stage framework that distills 3D geometric knowledge from images to enhance text-driven video segmentation. In the first stage, we perform monocular geometry pretraining with monocular novel-view synthesis, enabling the model to acquire geometry-consistent visual representations via spatial alignment on large-scale single-image datasets. In the second stage, we introduce geometry-aware distillation and fine-tune the model on video segmentation datasets, transferring 3D structural knowledge from a general 3D prior model. This process reinforces 3D awareness and improves both spatiotemporal coherence and language grounding in segmentation. Extensive experiments show that our method using only image segmentation data already provides notable zero-shot generalization in RVOS. When combined with geometry-aware distillation for fine-tuning on videos, our method achieves state-of-the-art performance across multiple RVOS benchmarks. The code is available at this https URL.

45. 【2606.24457】Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

链接：https://arxiv.org/abs/2606.24457

作者：Junpeng Jing,Ronglai Zuo,Zhelun Shen,Shangchen Zhou,Rolandos Alexandros Potamias,Stefanos Zafeiriou,Krystian Mikolajczyk,Jiankang Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：additional foundation-model priors, Recent advances, heavy computation, achieved remarkable accuracy, foundation-model priors

备注：

点击查看摘要

Abstract:Recent advances in stereo matching have achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast, efficient stereo models offer faster inference but are commonly considered less capable of strong zero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shot stereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-only cost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combines synthetic supervision, self-distillation, and real-world knowledge distillation. To improve the reliability of real-world pseudo supervision, we further introduce pseudo-label filtering and an error-clamping operation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, including feed-forward variants for different efficiency budgets and an iterative variant for higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at this https URL.

46. 【2606.24449】SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reasoned Memory for Visual Tracking

链接：https://arxiv.org/abs/2606.24449

作者：Mohamad Alansari,Yonathan Michael,Hasan AlMarzouqi,Muzammal Naseer,Naoufel Werghi,Sajid Javed

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：memory update mechanism, identify confidence-only mask, confidence-only mask selection, visual object tracking, rapid motion

备注： Accepted for publication at the European Conference on Computer Vision (ECCV 2026)

点击查看摘要

Abstract:We revisit the memory update mechanism in SAM2-based visual object tracking and identify confidence-only mask selection as the dominant cause of drift under occlusion, rapid motion, and distractors. We introduce SENTRY, a training-free, plug-and-play, refine-before-write module that validates each memory update for short-horizon temporal consistency before committing it. SENTRY aggregates diverse segmentation hypotheses per frame, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching against recent trajectories to favor temporally and geometrically consistent masks. It leaves the base architecture untouched, replacing confidence-driven writes with consistency-validated ones. For fair evaluation, we re-evaluate major open-source SAM2-based trackers across all available scales and datasets, filling gaps in prior reports. Integrated into five strong baselines, SENTRY delivers consistent gains across nine benchmarks, achieving new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi. Despite these checks, the SAM2-L version runs at 32.8 FPS on an A100, and across compatible hosts adds only about 0.4--0.6 GB VRAM. Our results provide the first unified all-scale evaluation of SAM2-based trackers and show that enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining.

47. 【2606.24447】P-MTP: Efficient Document Parsing via Multi-Token Prediction with Progressive Depth Scaling

链接：https://arxiv.org/abs/2606.24447

作者：Le Xiang,Chenxi Zhai,Shu Wei,Jingjing Wu,Qunyi Xie,Xiao Tan,Kunbin Chen,Wei He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significant latency bottleneck, Progressive Multi-Token Prediction, mapping from images, structured text, imposing a significant

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have revolutionized document parsing by enabling end-to-end mapping from images to structured text, imposing a significant latency bottleneck, particularly for token-dense documents. While Multi-Token Prediction (MTP) has emerged as a promising approach for accelerating inference, its potential is constrained by optimization instability when scaling to deeper look-ahead depth. In this paper, we propose \textbf{P-MTP}, a framework that leverages \textbf{Progressive Multi-Token Prediction} with a lightweight MTP module to scale the look-ahead depth for high-throughput document parsing. Specifically, we introduce Progressive Curriculum Loss that adaptively re-weights different look-ahead depths using cumulative path reliability and retrospective target consistency. By effectively suppressing gradient noise in long-range predictions, P-MTP, facilitates an automated easy-to-hard optimization transition, enabling the model to master increasingly distant look-ahead depths. Furthermore, we propose Confidence-Gated Dynamic Drafting to maximize the effective look-ahead depth and acceptance rate by adaptively calibrating speculative length during inference, thereby minimizing computational waste and further pushing the boundaries of inference speedup. Experimental results across multiple benchmarks and architectures demonstrate that P-MTP, achieves up to a $5\times$ speedup with negligible loss in accuracy, providing the first successful validation of extensive look-ahead MTP in the document parsing domain.

48. 【2606.24441】S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing

链接：https://arxiv.org/abs/2606.24441

作者：Qingxiao Li,Zikai Wang,Qingli Wang,Nan Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scientific image understanding, scientific image, image, image generation, image understanding

备注： 32 pages, 15 figures

点击查看摘要

Abstract:We present S1-Omni-Image, an open-weight unified multimodal model for scientific image understanding, generation, and editing. Unlike general-purpose image generation models, scientific image tasks require not only high-fidelity synthesis, but also robust understanding of scientific semantics, structural relations, domain knowledge, and task intent. To this end, S1-Omni-Image builds on the scientific multimodal reasoning backbone S1-VL-32B and couples its understanding capability with an image generation module under a unified think-before-generate paradigm. Given a user instruction, the model first produces a task-oriented reasoning trace, a textual answer, and a task special token; their hidden states are then injected into the generation module to condition image generation or editing. S1-Omni-Image supports scientific image understanding, generation, and editing in a unified framework. For generation, it focuses on scientific illustrations and text rendering, including logical diagrams, relational comparisons, data charts, and realistic scientific visualizations. For editing, it casts segmentation and other domain-specific vision tasks as native image editing problems, enabling multi-turn illustration editing, medical and geographic image segmentation, medical image translation, and scientific image super-resolution. We construct SciGenEdit, a 314K-sample training dataset, and release the model weights, inference code, and SciGenEdit-10K. Experiments show that S1-Omni-Image substantially improves scientific image generation and editing while preserving the scientific image understanding capability inherited from S1-VL-32B. It outperforms open-source models on GenExam and TechImage-Bench, achieves state-of-the-art results on four editing benchmarks including MSD, cigRockSEM, SynthRAD2025, and IXI, and maintains stable performance on scientific image understanding evaluations.

49. 【2606.24433】MedPCFM: Improving Medical Point Cloud Completion by Integrating Point Transformers and Flow Matching

链接：https://arxiv.org/abs/2606.24433

作者：Kamil Kwarciak,Marek Wodzinski

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：downstream clinical workflows, remains insufficiently studied, setting remains insufficiently, Medical point cloud, point cloud completion

备注： 25 pages, 9 figures

点击查看摘要

Abstract:Medical point cloud completion is important for anatomical reconstruction and downstream clinical workflows, yet generative modeling in this setting remains insufficiently studied. We investigate completion through continuous-time generative modeling and introduce PCFM, a PTv3-backed flow matching approach for medical point cloud completion. We evaluate on SkullFix and SkullBreak, and additionally on the more recent Mandibular Defect dataset. We build strong baselines by adapting PTv3 to a deterministic encoder-decoder completion model and by instantiating diffusion completion (PCDiff) with both PVCNN and PTv3 denoisers. PCFM with PTv3 is competitive with the deterministic PTv3 baseline and achieves state-of-the-art generative performance across datasets, while requiring substantially fewer sampling steps than diffusion. At the best operating points, PTv3 also yields clear throughput gains, providing up to a 7$\times$ speed-up for PCFM compared to a PVCNN backbone. Finally, we study empirical scaling trends by varying model size and point cardinality, showing consistent gains with higher point resolution and informative trade-offs across model scales.

50. 【2606.24430】ransformation Behavior of Images in Latent Space

链接：https://arxiv.org/abs/2606.24430

作者：Christian Zöllner(1),Mozzam Motiwala(1),Aysel Ahadova(1),Gerrit Anders(4),Robert Hüneburg(2 and 3),Jacob Nattermann(2 and 3),Matthias Kloor(1) ((1) Department of Applied Tumor Biology Institute of Pathology Heidelberg University Hospital, (2) National Center for Hereditary Tumor Syndromes University Hospital Bonn, (3) Department of Internal Medicine I University Hospital Bonn, (4) Leibniz Institut für Wissensmedien)

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：histopathology classification tasks, classification tasks typically, tasks typically relies, Training of neural, Meta Research Team

备注：

点击查看摘要

Abstract:Training of neural networks for histopathology classification tasks typically relies on data encoding into latent space, which reduces complexity and improves performance. There are several encoder networks available, either pretrained on general image datasets such as ImageNET, or specifically on histopathological images. Training of encoder networks should be adapted to downstream tasks, allowing encoding of biologic/diagnostic content while rendering networks invariant to label-irrelevant transformations. This paper investigates the effect of classical image transformation on the latent space, using networks provided by Lunit Inc. and Bioptimus, both focusing on pathological images, and by Meta Research Team. We assess variance of embeddings resulting from standard data transformations by comparing original and transformed image embeddings and by contrasting them with random, unrelated embeddings, using image tiles from hematoxylin/eosin-stained sections available in a colorectal tissue dataset and the publicly accessible TCGA dataset. Our findings show that embeddings of original and transformed images are closer to each other than to random embeddings, indicating robustness to transformations. However, they are not fully invariant, revealing that the encoder networks do not completely neutralize transformation effects in latent space, explaining why transformation-mediated augmentation of datasets can improve performance. Significant differences were observed between general and histopathology-specific encoder networks.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.24430 [cs.CV]

(or
arXiv:2606.24430v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.24430

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

51. 【2606.24422】EgoSAT: A Comprehensive Benchmark of Egocentric Streaming Interaction Understanding

链接：https://arxiv.org/abs/2606.24422

作者：Yijia Lei,Jinzhao Li,Yichi Zhang,Jiacheng Hua,Yin Li,Miao Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modern vision-language models, capabilities of modern, modern vision-language, streaming interaction understanding, comprehensive benchmark

备注： Accepted to ECCV 2026. Project page: [this https URL](https://leiyj23.github.io/EgoSAT/)

点击查看摘要

Abstract:We introduce EgoSAT, the first comprehensive benchmark for egocentric video reasoning in streaming settings, designed to evaluate the capabilities of modern vision-language models (VLMs). The benchmark targets streaming interaction understanding, where video frames arrive sequentially and models must continuously interpret evolving visual context. EgoSAT unifies several previously distinct tasks within a single streaming framework. In this formulation, queries about completed events correspond to retrospective reasoning, queries about ongoing activities require online understanding, and queries about future actions involve prospective anticipation. This unified setting requires models to reason about the past, present, and future while operating under the constraint that only previously observed frames are available. EgoSAT contains 1,997 unique videos spanning 165 hours of egocentric footage and around 4,800 high-quality question-answer pairs, carefully designed to probe reasoning across varying temporal contexts. Using this benchmark, we evaluate a diverse set of both open-weight and closed-weight VLMs, providing a systematic assessment of their ability for streaming interaction understanding. By distinguishing answerability and conducting diagnostics on confidence of models, we find existing models not only struggle with prospective and retrospective modeling, but also exhibit severe mis-calibration: confidence often fails to track inherent answerability, leading to dangerous "confidently wrong" behaviors. Project page: this https URL

52. 【2606.24404】Modality-Aware Out-of-Distribution Detection for Multi-Modal Action Recognition

链接：https://arxiv.org/abs/2606.24404

作者：Lars Doorenbos,Duc Manh Vu,Serdar Ozsoy,Juergen Gall

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：action recognition models, recognition models increases, multi-modal OOD detection, OOD detection, action recognition

备注： Accepted at ECCV '26

点击查看摘要

Abstract:The incorporation of additional modalities into action recognition models increases their performance across a wide range of settings. However, how this additional information can contribute to making the models more robust remains underexplored, particularly for the case of multi-modal out-of-distribution (OOD) detection. While methods exist that regularize the multi-modal training process with OOD detection in mind, they still apply off-the-shelf OOD detectors designed for the uni-modal case during inference, discarding important information. Based on an interesting relationship we find between the multi-modal and uni-modal predictions, we propose to use this signal to build a post-hoc detector explicitly designed for the multi-modal scenario. We combine this new source of information with a feature-space score, which detects off-manifold samples in the multi-modal space, and normalize them by the multi-modal logits. In doing so, the proposed hybrid detector is compatible with existing training-time approaches and consistently improves performance. Experiments on a wide range of established datasets from the MultiOOD benchmark show that, on average, our approach outperforms the state of the art. Our results show the importance of explicitly considering the different modalities at inference time for multi-modal OOD detection.

53. 【2606.24375】MATCH: Flow Matching for Multi-View Anomaly Detection

链接：https://arxiv.org/abs/2606.24375

作者：Mathis Kruse,Melissa Schween,Bodo Rosenhahn

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：increasing production efficiency, Detecting anomalies, multi-view anomaly detection, Flow Matching, anomaly detection

备注： Accepted at ECCV 2026

点击查看摘要

Abstract:Detecting anomalies in industrial objects is an important topic for increasing production efficiency. More complex objects often require the analysis of several view points, which has led to the field of multi-view anomaly detection. We present MATCH, the first multi-view anomaly detection method based on Flow Matching (FM). With the ODE formulation of Flow Matching, we can estimate likelihoods and thereby derive an anomaly score to detect anomalies in multi-view image data at object, image, and pixel-level. The architectural flexibility of FM models allows us to efficiently transform features of different spatial sizes to the normal distribution. We evaluate thoroughly on the already established Real-IAD data set and are also the first to provide a comprehensive evaluation of popular anomaly detection methods for the MANTA-Tiny data set. MATCH achieves state-of-the-art performance in both anomaly detection and segmentation, all while running on consumer-level hardware. By omitting the costly divergence term needed for likelihood estimation, we ensure that MATCH is usable in real-time production scenarios. Lastly, several ablation studies are conducted to validate the methodological choices.

54. 【2606.24371】Structural Kolmogorov-Arnold Convolutions: Learnable Function on the Values or the Filter Shape as Parameter-Efficient Alternative to Per-Edge Convolutional KANs

链接：https://arxiv.org/abs/2606.24371

作者：Stefano Mereu,Oleksandr Kuznetsov,Gabriele Marchello,Alessandro Galdelli,Emanuele Frontoni,Adriano Mancini,Ferdinando Cannella

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Arnold Networks, Convolutional Kolmogorov, convolutional kernel, replace the fixed, fixed weights

备注：

点击查看摘要

Abstract:Convolutional Kolmogorov--Arnold Networks (KANs) replace the fixed weights of a convolutional kernel with learnable univariate functions. The dominant formulation attaches one such function to every kernel entry and lets it act on pixel values, expressive but parameter-heavy and prone to overfitting. We argue that the learnable functions are better placed in the \emph{structure} of the convolution than on each edge, and we organise the design space along a single axis: whether the function acts on the pixel \emph{values} or on the filter \emph{shape}. We study three realisations. SV-KAN applies one shared univariate function to the values and leaves the spatial filter free and static, aa classical convolution with a single learnable shared activation. AG-KAN keeps the shared value function but supplies the spatial structure through a content-adaptive Gaussian gate. RF-KAN instead moves the learnable functions onto the filter shape, building each filter from oriented ridge profiles expanded in a localised oscillatory (Morlet) wavelet basis with content-adaptive amplitudes. Under a matched four-layer protocol with in-run references and three seeds, RF-KAN and SV-KAN reach $88.47\pm0.10\%$ and $88.20\pm0.31\%$ on CIFAR-10 and $64.40\pm0.19\%$ and $64.57\pm0.30\%$ on CIFAR-100, at about $0.4$M parameters. At this matched scale the shape model and the simplest value model meet at the top, both above a plain convolution and every per-edge KAN we tested, including the official Gram variant, at roughly a fifth of the parameters. A controlled study attributes the RF-KAN gain to an intrinsically localised oscillatory basis and to content adaptivity, and an ablation that removes the learned shape entirely, leaving only the shared value function, collapses accuracy by over forty points, identifying the learned shape as the load-bearing ingredient at this scale.

55. 【2606.24361】SignNet-1M: Large-Scale Multilingual Sign Language Video Dataset with Downstream Benchmarks

链接：https://arxiv.org/abs/2606.24361

作者：Zhewen He,Junyi Hu,Haomian Huang,Zhenhua Li,Yu-Shen Liu,Yi Fang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：German Sign Language, Sign language models, real-world distribution shifts, Sign language, dataset spanning ASL

备注： 25 pages. Accepted to ECCV 2026

点击查看摘要

Abstract:Sign language models are typically trained on datasets captured under constrained conditions, with limited viewpoint, background, and signer-identity diversity, leading to poor robustness under real-world distribution shifts. We introduce SignNet-1M, a large-scale augmented dataset spanning ASL, CSL, and German Sign Language (DGS). SignNet-1M synthesizes realistic variations along three axes: (i) novel-view rendering (rotation and zoom) via 3D Gaussian Splatting (3DGS), (ii) scene/identity editing via diffusion models for background replacement and signer substitution while preserving sign motion and linguistic content, and (iii) post-rendering augmentations that emulate capture and compression artifacts (e.g., pose/temporal perturbations and video-level corruptions) to better match in-the-wild recordings. Beyond data release, we provide a unified benchmark suite across downstream tasks (e.g., translation and recognition) and ablations that isolate each augmentation component. Experiments across backbones show that training with SignNet-1M consistently improves generalization under cross-view, cross-background, cross-identity, and post-rendering shifts, while maintaining strong in-distribution performance. The dataset, full augmentation pipeline, and benchmark are available at this https URL.

56. 【2606.24353】Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints

链接：https://arxiv.org/abs/2606.24353

作者：Hojun Choi,Seulbin Hwang,Dae Jung Kim,Kisung Kim,Hyunjung Shim,Jinhan Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：fuses multi-camera images, unified top-down representation, perception fuses multi-camera, autonomous driving, fuses multi-camera

备注： This paper has been accepted by ECCV 2026

点击查看摘要

Abstract:Bird's-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them vulnerable to unpredictable real-world environments. In this work, we introduce open-vocabulary BEV segmentation (OVBS), which leverages vision-language models (VLMs) to recognize categories beyond the training set while maintaining precise BEV perception and real-time efficiency. A key challenge in OVBS lies in the 3D geometric inconsistency inherent in the ill-posed lifting of 2D VLM semantics into BEV. To address this, we propose OVBEVSeg, a geometry-aware OVBS framework that enhances efficient Gaussian splatting (GS)-based unprojection by leveraging robust 3D geometric constraints across three progressive stages: (1) 2D-to-BEV pseudo-labeling via reliable 3D projection for OV generalization; (2) joint 2D-BEV per-scene optimization with BEV structural constraints for 3D geometric consistency; and (3) 3D geometric distillation for online efficiency. On the nuScenes dataset, OVBEVSeg achieves state-of-the-art performance, outperforming closed-set methods by 15.3 mIoU on unseen categories. Remarkably, even with no novel-class ground-truth labels, it remains competitive with self- and semi-supervised baselines trained with up to 40% of ground-truth annotations. Furthermore, it achieves 2.5x faster inference with only 0.22x the memory consumption of projection-based methods. Project page: this https URL.

57. 【2606.24336】IGER: Taming Identity, Geometry, and Generative Priors for High-Quality Face Video Restoration

链接：https://arxiv.org/abs/2606.24336

作者：Yang Zhou,Wenxue Li,Peng Zhang,Yifei Chen,Fei Wang,Daiguo Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Face Video Restoration, recover high-fidelity facial, high-fidelity facial videos, Face Video, Video Restoration

备注：

点击查看摘要

Abstract:Face Video Restoration (FVR) aims to recover high-fidelity facial videos from degraded input while preserving identity and semantic consistency across frames. Existing methods often struggle to simultaneously address three key challenges: identity shift, viewpoint-entangled guidance, and perceptual realism. To tackle these issues, we propose TIGER, a structured tri-prior fusion framework that Tames Identity, Geometry, and gEnerative pRiors for high-quality FVR. Specifically, an Identity Prior is first established by injecting subject-discriminative embeddings into the latent space, effectively anchoring the subject's identity against severe degradations. Then, to provide temporally consistent structural guidance for dynamic videos, TIGER constructs a Geometry Prior by lifting 2D reference cues into a disentangled 3D parameter space, creating a geometric anchor through cross-source parameter fusion. Moreover, to achieve maximum efficiency without compromising realism, we harness the video generation model's Generative Prior through a one-step rectified flow. We further design a progressive three-stage training optimization strategy that refines structural fidelity, textural reconstruction, and distribution-level realism to ensure robust optimization. We also construct a large-scale FVR dataset to facilitate robust training and standardized evaluation. Extensive experiments demonstrate that TIGER achieves state-of-the-art performance in both identity fidelity and temporal stability, delivering a high-quality, efficient and identity-consistent FVR. Project page: this https URL.

58. 【2606.24335】Ill-Posed by Design: Probing Evidence Use in VLMs

链接：https://arxiv.org/abs/2606.24335

作者：Boaz Meivar,Shaked Perek,Shani Shvartzman,Eli Schwartz,Shai Avidan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cues independently support, limited on well-posed, independently support, Counterfactual analysis, scene geometry

备注：

点击查看摘要

Abstract:Counterfactual analysis is widely used to study evidence use in vision-language models, but its diagnostic value is limited on well-posed tasks: when several cues independently support the same answer, removing one may not change the prediction. We propose monocular metric object-size estimation as an ill-posed diagnostic setting for evidence selection: because physical size cannot be determined from a single uncalibrated image, models must rely on imperfect cues category priors, target appearance, local context, apparent image size, and scene geometry. We assemble Metric VQA ($10{,}813$ dimension queries from Objectron and $331$ tape-measured in-the-wild scenes) and evaluate $12$ open-weight VLMs ($3$--$397$\,B parameters) with counterfactual analysis decomposing six visual and language evidence channels. Even the largest VLMs tested (Qwen3-VL-235B, Qwen3.5-397B, InternVL3.5-241B) trail a text-only frontier LLM on the in-the-wild split. The diagnostic analysis shows: target identity is the most load-bearing cue, target pixels and local context help only some models, apparent size shifts predictions without a directional readout, and global scene geometry is largely unused. We analyze LoRA fine-tuning as an actionable intervention specific to metric estimation: while the task is learnable, the models do not learn to leverage scene geometry.

59. 【2606.24333】UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation

链接：https://arxiv.org/abs/2606.24333

作者：Jiahao Lyu,Pei Fu,Zhenhang Li,Shaojie Zhang,Jiahui Yang,Yu Zhou,Can Ma,Zhenbo Luo,Jian Luan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：In-Image Machine Translation, In-Image Machine, translate scene text, Machine Translation, aims to translate

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:In-Image Machine Translation (IIMT) aims to translate scene text in an image and render the translated text back into the original regions while preserving the overall visual appearance. Recent unified multimodal models provide a promising solution by combining visual-text understanding and image generation within a single framework. However, directly adapting such models to IIMT remains challenging. In particular, they often suffer from understanding-generation conflicts, where the translation inferred during understanding is inconsistent with the text supervision used in generation, and spatial position misalignment, where the rendered text does not accurately match the target text regions. To address these issues, we present UniTranslator, a unified multimodal framework for IIMT that tightly couples translation understanding and text editing. Specifically, we introduce an Understand-Generation Alignment Module (UGAM) to bridge the representation gap between understanding and generation, encouraging semantic consistency between translated content prediction and text rendering. We further propose a Spatial Mask Decoder (SMD) with pixel-level supervision over text regions to improve spatial grounding, geometric alignment, and layout controllability during generation. Extensive experiments on multiple benchmarks demonstrate that UniTranslator achieves state-of-the-art performance across diverse language directions and complex real-world layouts. Moreover, our results reveal a strong mutual reinforcement effect between translation understanding and image generation, highlighting the advantage of unified translation multimodal learning. Code is available at this https URL.

60. 【2606.24330】REDI-Match: Rotation-Equivariant Distillation for Efficient and Robust Dense Matching

链接：https://arxiv.org/abs/2606.24330

作者：Yinji Ge,Guixu Zheng,Wulong Guo,Qian Feng,Xu Wu,Kai Zhou,Xinyuan Liu,Fei Xing

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, Vision Foundation, Foundation Models, significantly advanced dense, dense feature matching

备注：

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have significantly advanced dense feature matching, yet severe in-plane rotation remains a critical challenge. Existing solutions face a fundamental dilemma: data-driven methods require inefficient parameter scaling to implicitly learn rotations, whereas strictly equivariant networks lack the semantic capacity of modern VFMs. Consequently, current frameworks typically freeze VFMs and shift the entire burden of rotation generalization to the downstream decoder. To break this architectural bottleneck, we propose REDI-Match, an efficient framework driven by a novel Rotation-Equivariant Distillation (REDI) paradigm. Instead of relying on rotation data augmentation to establish rotational correspondences, REDI distills the non-equivariant semantic representations of a VFM into a lightweight, strictly rotation-equivariant encoder, leveraging an equivariant geometric architecture to constrain robust high-dimensional semantics. To fully exploit these features, we equip the decoder with an entropy-driven spatial alignment module. By evaluating discrete rotation hypotheses, this mechanism explicitly locks onto the canonical coordinate system, eliminating global ambiguity before continuous refinement. Extensive experiments demonstrate that REDI-Match establishes a new state-of-the-art (SOTA) across multiple benchmarks. Notably, it achieves a 13.89% absolute pose accuracy improvement on the highly challenging SatAst dataset while operating 1.9x faster than the current SOTA (RoMa v2), enabling real-time inference (~41 FPS) on a single RTX 4090 GPU. Code: this https URL.

61. 【2606.24302】rOCR for Medieval HTR: A Systematic Ablation Study with Cross-Dataset Validation

链接：https://arxiv.org/abs/2606.24302

作者：Sachin Sharma,Michele Flammini,Federico Simonetta

类目：Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词：transformer-based handwritten text, Fine-tuning transformer-based handwritten, handwritten text recognition, visual domain, handwritten text

备注： Accepted at Document Analysis Systems Workshop 2026 (ICDAR Satellite event)

点击查看摘要

Abstract:Fine-tuning transformer-based handwritten text recognition (HTR) models on medieval manuscripts is challenging because these models are pre-trained on modern text and must adapt to a very different visual domain. This paper studies how three controllable fine-tuning choices (contrast normalization, data augmentation, and layer freezing) affect recognition accuracy when adapting TrOCR to small historical datasets. We run controlled experiments on a 13th-century Italian manuscript (I-CT 91 "Cortonese") and replicate the same experimental grid on the public READ-16 benchmark as robustness evidence. On Cortonese, our best configuration achieves 8.03% character error rate (CER). Statistical comparisons across 13 configurations show that freezing up to three encoder layers or six decoder layers does not significantly harm accuracy, while deeper freezing becomes progressively detrimental. Removing contrast normalization (CLAHE) yields 7.84% CER, comparable to a domain-specialized baseline, suggesting strong optimization can reduce reliance on image preprocessing. Cross-dataset validation on READ-16 shows that decoder freezing thresholds transfer more robustly than encoder thresholds, and combined freezing strategies require dataset-specific re-validation. Finally, we use Grad-CAM gradient attributions and decoder cross-attention maps to diagnose error patterns and failure modes revealed by the ablations. Source code is available at this https URL

62. 【2606.24301】MM-TRELLIS: Point-Cloud Guided Multi-Modal 3D Vehicle Generation in Autonomous Driving

链接：https://arxiv.org/abs/2606.24301

作者：Hongli Xiao,Youjian Zhang,Yucai Bai,Chaoyue Wang,Yaohui Jin,Xiaoguang Ren,Wenjing Yang,Long Lan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：building simulation environment, synthesizing training data, Recovering realistic, simulation environment, autonomous driving scenes

备注：

点击查看摘要

Abstract:Recovering realistic 3D vehicle models from autonomous driving scenes is crucial for synthesizing training data and building simulation environment. However, most existing vehicle generation methods fail to fully exploit multimodal sensors i.e. multi-view images and LiDAR point clouds) and rely on neural rendering based reconstruction, leading to low-quality mesh. Recently, native 3D generative models have made significant progress, yet they are not built for arbitrary multi-view inputs and often struggle with in-the-wild driving images. In this work, we present MM-TRELLIS, a multi-modal version of TRELLIS for in-the-wild 3D vehicle generation that integrates LiDAR and image sensors from autonomous driving datasets into native 3D generative models. Specifically, multi-view images are cycled as conditioning inputs, while LiDAR point clouds provide test-time guidance to ensure geometric accuracy and cross-view consistency. During denoising, we first align the guidance point cloud with the model priors, then enforce consistency between the generated geometry and the guidance point cloud. Finally, we introduce a voxel filtering strategy based on the opacity of 3D Gaussian Splatting to suppress floaters and produce clean meshes. Comprehensive experiments on Waymo dataset demonstrate our method outperforms existing methods in high-fidelity 3D vehicle generation. Code is available at this https URL.

63. 【2606.24297】raining-free Cross-domain Few-shot Segmentation via Robust Semantic Representation and Matching

链接：https://arxiv.org/abs/2606.24297

作者：Sujun Sun,Mingwu Ren,Haofeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Cross-domain Few-shot Segmentation, Few-shot Segmentation, transfer knowledge learned, segmenting unseen target, unseen target classes

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:Cross-domain Few-shot Segmentation (CD-FSS) aims to transfer knowledge learned from source domain to distinct target domains, segmenting unseen target classes with only a few annotated samples. Although existing methods have made significant progress, they still rely on training or fine-tuning processes, which incur high computational costs and risk overfitting. We observe that when powerful and general-purpose vision foundation models are incorporated into these methods, their performance shows only marginal improvement or even degrades due to overfitting. To address this, we eliminate trainable parameters and propose a training-free framework to avoid both training overhead and overfitting. Built upon the self-supervised vision encoder DINOv3, our framework addresses cross-domain challenges through three core modules. First, the Semantic-aware Feature Re-fusion (SAFR) module identifies and re-fuses features that emphasize semantic patterns, generating representations with enhanced semantic discriminability. Additionally, the Adaptive Support Enhancement (ASE) module narrows semantic gaps between support and query through robust query information aggregation. Finally, the Hybrid Prototype Matching (HPM) module integrates matching results from diverse prototypes to adapt to varying semantic complexity across domains. Extensive experiments on four target domain datasets demonstrate that our method achieves state-of-the-art performance in CD-FSS without any training.

64. 【2606.24296】Hierarchical Spatial and Channel Aggregation for Cross-domain Few-shot Segmentation

链接：https://arxiv.org/abs/2606.24296

作者：Sujun Sun,Mingwu Ren,Haofeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Cross-domain Few-shot Segmentation, enabling accurate segmentation, Cross-domain Few-shot, learn generalizable segmentation, generalizable segmentation capability

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:Cross-domain Few-shot Segmentation (CD-FSS) aims to learn generalizable segmentation capability from abundant annotated samples in the source domain, enabling accurate segmentation of novel classes in the target domain with only a few annotated samples. Existing CD-FSS methods mainly focus on mitigating feature distribution shifts caused by style gaps while ignoring significant differences in class semantic granularity and discriminative attributes across domains, leading to two key degradations in support-query matching: semantic over-alignment and attribute over-alignment. To this end, we propose the Dual Hierarchical Aggregation Network (DHANet), which comprises three key modules. First, the Hierarchical Spatial Aggregation (HSA) module performs multi-scale region aggregation of pixel features along the spatial dimension, generating hierarchical semantic-enhanced features to alleviate semantic over-alignment. Additionally, the HCA module conducts multi-scale attribute aggregation along the channel dimension, generating hierarchical attribute-enhanced features to mitigate attribute over-alignment. Finally, we propose the Online Probabilistic Semantic Bank (OPSB), which progressively constructs and updates class probability distributions from query predictions during inference, and samples multiple pseudo-prototypes as additional support information to mitigate insufficient support. Extensive experiments on four target-domain datasets demonstrate that our method achieves state-of-the-art performance.

65. 【2606.24292】ActiveScope: Actively Seeking and Correcting Perception for MLLMs

链接：https://arxiv.org/abs/2606.24292

作者：Yajing Wang,Chao Bi,Junshu Sun,Shufan Shen,Zhaobo Qi,Shuhui Wang,Qingming Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated impressive vision-language

备注： ICML 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive vision-language understanding, yet still struggle with fine-grained perception in high-resolution images. While existing training-free methods typically rely on attention-based localization or coarse-to-fine search, they are often misled by distractors and fail to locate multiple targets. Our investigation attributes these failures to Contextual Dominance, where salient distractors overwhelm target attention and cause inaccurate localization, and Semantic Bias, where global semantics cause the model to fixate on the most salient concept, resulting in incomplete localization in multi-object scenarios. Built on these insights, we propose ActiveScope, a training-free framework that enhances MLLMs by actively seeking and correcting perception. ActiveScope features two modules. The Semantic Anchor Localization (SAL) utilizes fine-grained semantic anchors to independently localize key targets, thereby mitigating semantic bias. The Interference-Suppressed Refinement (ISR) refines localization by suppressing attention on salient distractions to overcome contextual dominance. Extensive experiments on high-resolution image understanding benchmarks demonstrate that ActiveScope outperforms existing training-free methods (e.g., 96.34 percent accuracy on $V^{*}$ Bench), validating the superiority of the active search and self-correction paradigm. Our code is available at this https URL.

66. 【2606.24286】AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

链接：https://arxiv.org/abs/2606.24286

作者：Yijing Chen,Wenhui Tan,Xiaoyi Yu,Yuyue Wang,Xin Cheng,Kaisi Guan,Hao Jiang,Xiangyang Li,Guojie Zhu,Ruihua Song

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Omni-modal Large Language, Multimodal Large Language, achieved remarkable progress, comprehension remains challenged

备注：

点击查看摘要

67. 【2606.24282】UniRED: Unified RGB-D Video Frame Interpolation with Event Guidance

链接：https://arxiv.org/abs/2606.24282

作者：Yinuo Zhang,Guangshun Wei,Yuanfeng Zhou,Yiran Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：High frame-rate RGB-D, High frame-rate, including motion analysis, frame-rate RGB-D videos, dynamic scene understanding

备注：

点击查看摘要

Abstract:High frame-rate RGB-D videos are crucial for a variety of downstream tasks, including motion analysis, dynamic scene understanding, and 3D reconstruction. However, due to hardware and sensing constraints, practical RGB-D cameras are typically limited to low frame rates, making it difficult to capture rapid scene dynamics. Existing video interpolation methods have achieved strong performance on RGB data, but they are not readily applicable to RGB-D scenarios, where they often yield blurry boundaries, visible artifacts, and degraded geometric consistency. Furthermore, motion estimation from only two boundary frames is inherently under-constrained in complex dynamic scenes. Event cameras, by contrast, provide asynchronous measurements with ultra-high temporal resolution, offering dense motion cues. In this paper, we propose a unified multimodal framework for RGB-D video interpolation that jointly exploits RGB appearance, depth geometry, and event-based temporal cues. Specifically, it first extracts and fuses RGB, depth and event cues, then estimates bidirectional flow with motion basis refinement for RGB and Z-axial refinement for depth, and finally synthesizes the target RGB-D frame via bidirectional warping and soft blending. In addition, we construct a new RGB-D-Event dataset to alleviate the scarcity of tri-modal training data. Extensive experiments on a public benchmark and the proposed dataset demonstrate that our method achieves superior photometric fidelity for RGB interpolation and stronger geometric accuracy for depth interpolation than existing approaches.

68. 【2606.24263】MotifGen: Spatiotemporal interpolation of misaligned satellite images via multi-source generative modeling, in an application to tropical cyclones

链接：https://arxiv.org/abs/2606.24263

作者：Clément Dauvilliers(Inria),Claire Monteleoni(Inria)

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：potentially missing rapid, storm evolution phases, satellite imagery plays, missing rapid storm, rapid storm evolution

备注：

点击查看摘要

Abstract:Microwave satellite imagery plays a crucial role in monitoring tropical cyclone precipitation and intensity worldwide, but suffers from long revisit times, potentially missing rapid storm evolution phases. While this raises the need for an interpolation method, it is made challenging by the high level of heterogeneity of microwave data coming from different instruments. In this work, we introduce the first generative model that can be applied to multiple geospatial sources that change across samples, occur at irregular time intervals, are misaligned geographically, and come from instruments with varying characteristics. We apply this model to the case of spatio-temporal interpolation of tropical cyclone microwave images from other microwave and infrared instruments. We train using a self-supervised task in which a random source is masked and reconstructed, and show that it leads to a significant decrease in Continuous Ranked Probability Score over supervised training. We show a further improvement by combining infrared and microwave data compared to microwave only. Using these improvements, the generative model produces an ensemble mean on par with that of a deterministic model, while generating a power spectrum significantly closer to that of true observations. To the best of our knowledge, this is the first generative model that interpolates microwave images of cyclones by combining multiple microwave instruments and infrared observations at irregular time intervals.

69. 【2606.24257】3DCarGen: Scalable 3D Car Generation via 3D-consistent Multi-view Synthesis

链接：https://arxiv.org/abs/2606.24257

作者：Hongli Xiao,Youjian Zhang,Yaohui Jin,Xiaoguang Ren,Wenjing Yang,Long Lan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving simulation, driving simulation, assets are essential, essential for autonomous, autonomous driving

备注：

点击查看摘要

Abstract:High-quality 3D vehicle assets are essential for autonomous driving simulation. Although multi-view diffusion-based paradigms enable controllable single-image reconstruction, they typically produce limited viewpoints and exhibit cross-view geometric inconsistencies, thereby reducing reconstruction fidelity in real-world scenarios. In this work, we introduce 3DCarGen, a scalable single-view 3D car generation framework designed for real-world images by synthesizing an arbitrary number of 3D-consistent multi-view images. Specifically, given a single image as input, we first synthesize a set of images from fixed viewpoints. These images are then fed into a feed-forward reconstruction model, resulting in a coarse 3D representation based on 3D Gaussian Splatting. Conditioned on this explicit 3D prior, our multi-view diffusion model generates 3D-consistent images from arbitrary camera viewpoints. We further extend a fast mesh reconstruction algorithm by incorporating color-normal joint optimization to recover detailed and coherent 3D vehicle models from the synthesized dense views. Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves robust geometric consistency and reconstruction fidelity compared to existing methods. Code and models will be released.

70. 【2606.24256】rimming the Long-Tail of Visual World Modeling Evaluation

链接：https://arxiv.org/abs/2606.24256

作者：Bingxuan Li,Yining Hong,Cheng Qian,Hyeonjeong Ha,Jiateng Liu,Zhenhailong Wang,Yue Guo,Yunzhu Li,Heng Ji

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dominates human experience, interactions remains underrepresented, interactions dominates human, long-tailed distribution, remains underrepresented

备注：

点击查看摘要

Abstract:Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.

71. 【2606.24255】Social Structure Matters in 3D Human-Human Interaction Generation

链接：https://arxiv.org/abs/2606.24255

作者：Zhongju Wang,Beier Wang,Yatao Bian,Pichao Wang,Zhi Wang,Daoyi Dong,Hongdong Li,Huadong Mo,Zhenhong Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieved strong progress, synthesizing realistic single-person, HHI requires modeling, governs phase progression, realistic single-person motions

备注：

点击查看摘要

Abstract:Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human-human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying \textbf{social structure} that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can \textit{think} by recovering phase decompositions and partner-aware roles, but cannot directly \textit{move}, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner-executor paradigm, \textbf{Think with LLM, Move with Motion Skill}. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.

72. 【2606.24253】uringViT: Making SOTA Vision Transformers Accessible to All

链接：https://arxiv.org/abs/2606.24253

作者：Qiman Wu,Hanlin Chen,Lyujie Chen,Rui Xin,Jianlei Zheng,Mingyuan Wang,Jiahui Hu,Da Zhu,Yuecheng Ma,Yuhua Wei,Yizhao Wang,Hua Zhou,Yuheng Zhang,Anhua Liu,Shaman Tang,Yue He,Pengfei Diao,Shuang Su,Haotong Xin,Weichao Huang,Hang Zhang,Xianming Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：VLA systems commonly, diverse downstream requirements, systems commonly adopt, Modern VLMs, Turing Linear Attention

备注：

点击查看摘要

Abstract:Modern VLMs and VLA systems commonly adopt off-the-shelf ViTs such as SigLIP2 as visual encoders, but diverse downstream requirements in latency, temporal modeling, and VLM integration often call for customized SOTA-level ViTs. Training such encoders remains beyond the reach of much of the community, as it requires massive image-text data, while standard softmax attention makes high-resolution or dynamic-resolution pretraining prohibitively costly and often forces low-resolution pretraining followed by post-hoc adaptation. TuringViT addresses these challenges with three key designs: Turing Linear Attention (TLA) for efficient sequence modeling, VISTA-Curation to construct supervision-rich image-video training data, and native dynamic-resolution pretraining that supports flexible inputs from the start and transfers seamlessly to downstream VLMs. As a result, TuringViT outperforms leading open-source ViT baselines with only 10% of the data, achieves stronger downstream VLM performance, and delivers substantially better latency scaling on high-resolution inputs. Our scaling-law analysis further shows that TuringViT continues to improve predictably with curated data scale, far from saturation. Its fast adaptation, hardware-friendly design, and efficient deployment have made it a unified visual foundation across XPeng's AI systems. More broadly, TuringViT provides a reproducible pipeline that dramatically lowers the cost for the community to train, customize, and deploy SOTA-level ViTs, moving toward making such Vision Transformers accessible to all.

73. 【2606.24248】M^2C-EvDet: Multi-Domain Multi-Order Cross-Modal Knowledge Distillation for Event-based Object Detection

链接：https://arxiv.org/abs/2606.24248

作者：Wei Bao,Siqi Li,Shouan Pan,Yi Xie,Yue Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Event-based object Detection, wide dynamic range, demanding high temporal, high temporal resolution, visual perception paradigm

备注：

点击查看摘要

Abstract:Event-based object Detection (EvDet), as a biologically inspired visual perception paradigm, demonstrates superior performance in scenarios demanding high temporal resolution and a wide dynamic range. Nevertheless, the inherent sparse representations and inadequate visual semantics of event data result in a considerable performance disparity between EvDet and frame-based object detection. Previous works attempt to alleviate this cross-modal discrepancy through knowledge distillation, yet they only focus on spatial visual semantics or pair-wise relational information, thus limiting performance in more complex scenarios. To address this challenge, this paper proposes M^2C-EvDet, a Multi-domain and Multi-order Cross-modal knowledge distillation framework for EvDet. Built upon frequency learning and hypergraph computation, M^2C-EvDet integrates two specialized modules: Adaptive Frequency-Decoupled Feature Distillation (AF^2D^2) and Multi-Order Relational Distillation (MORD).

74. 【2606.24234】From Open Waters to Enclosed Cabins: ProteusVPR for Cross-Scene Visual Place Recognition in Maritime Perception and Cabin Inspection

链接：https://arxiv.org/abs/2606.24234

作者：Zexi Chena,Zitai Huang,Qiwen Gu,Zhiqi Li,Shengli Dong,Chenlei Wang,Junqiao Zhao,Hongdong Wang,Bing Han

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Visual Place Recognition, Autonomous robotic inspection, Place Recognition, presents unique challenges, Autonomous robotic

备注：

点击查看摘要

Abstract:Autonomous robotic inspection in maritime environments presents unique challenges for Visual Place Recognition (VPR) due to cross-scene perceptual shifts. Robots navigating ship-borne environments must transition between visually distinct domains: open decks with sparse textures and severe illumination changes, and enclosed cabins with repetitive structures and high visual ambiguity. Existing VPR methods, designed primarily for urban or indoor scenes, fail to generalize reliably across these starkly different scenarios. To address this, we propose ProteusVPR, a two-stage retrieval-refinement framework. The first stage employs any standard VPR model for initial image retrieval. The second stage introduces a geometric-visual estimation network that fuses the retrieved image with two temporally preceding frames, incorporating geometric descriptors, a local affine coordinate system, and camera azimuth encoding to achieve precise localization. To support this task, we introduce the XHZ dataset, an 8K-panoramic ship-borne dataset collected from an operational vessel, featuring multi-floor cabin structures, deck transition zones, and strict query-database separation for rigorous evaluation. Extensive experiments on the XHZ dataset demonstrate that ProteusVPR consistently improves the localization accuracy across multiple VPR backbones, reducing mean localization error by over 60\% on average and that ProteusVPR offers an effective and robust solution for precise visual localization in challenging, cross-scene maritime environments.

75. 【2606.24233】Latent Visual States for Efficient Multimodal Reasoning

链接：https://arxiv.org/abs/2606.24233

作者：Xiuwei Chen,Wentao Hu,Yongxin Wang,Zisheng Chen,Likui Zhang,Kun Xiang,Jianhua Han,Hui-Ling Zhen,Jingyuan Zou,Hang Xu,Xiaodan Liang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large multimodal models, multimodal models, evidence has significantly, significantly enhanced, enhanced the capabilities

备注：

点击查看摘要

Abstract:The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke external tools, a process that introduces rigid dependencies and substantial latency. To overcome these limitations, we propose {EVA} (LatEnt Visual StAtes), a novel framework that natively generates continuous latent visual representations. These internal representations manifest as an adaptive sequence of Latent\_slot tokens, serving as intermediate visual thoughts during the reasoning process. These Latent\_slot tokens are then trained end-to-end with the discrete text tokens. This co-optimization, notably, causes extreme policy deviation in the 'transition window' following the Latent\_slot tokens. We develop D-GSPO (Decouple-GSPO) to target this root cause by decoupling the optimization of latent and discrete components. To support SFT, we construct EVA-230K, a high-quality text-image interleaved CoT dataset encompassing a diverse range of real-world scenes, documents, charts and OCR tasks. Extensive experiments across multiple benchmarks confirm that EVA achieves significant performance gains while enhancing inference efficiency.

76. 【2606.24232】FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image

链接：https://arxiv.org/abs/2606.24232

作者：Kim Youwang,Zhengyu Yang,Liuhao Ge,Yu Rong,Timur Bagautdinov,Su Zhaoen,Nir Sopher,Jovan Popović,Teng Deng,Tae-Hyun Oh,Chen Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：instant Gaussian Codec, Codec Avatar generation, Gaussian Codec Avatar, single portrait image, Avatar generation pipeline

备注： Project page: [this https URL](https://kim-youwang.github.io/FiCA)

点击查看摘要

Abstract:We introduce FiCA, a Feed-forward, instant Gaussian Codec Avatar generation pipeline that creates lifelike avatars from a single portrait image. Generating a photorealistic and drivable avatar from just a single image is significantly challenging due to the limited visual information available to accurately infer the 3D appearance and geometry of human heads. To address this, we develop a novel system that combines human-centric vision foundation models with a diffusion model. This system is designed to fully exploit partial visual observations to generate lifelike human avatars. Our proposed diffusion model learns a generative mapping from these partial observations to complete and authentic 3D mesh reconstruction. Additionally, we introduce a feed-forward mesh refinement network that enhances the fidelity and identity preservation of the generated avatars, eliminating the need for person-specific test-time optimization. By leveraging a universal prior model that decodes a generated mesh into a set of 3D Gaussians, we generate a photorealistic 3D Gaussian avatar, capable of being driven with novel expressions in real-time. Our experiments demonstrate that the avatars generated by our feed-forward approach faithfully represent diverse identities and surpass the visual quality of avatars produced by recent competing methods.

77. 【2606.24225】Geometry-Instructed Video Editing

链接：https://arxiv.org/abs/2606.24225

作者：Chirui Chang,Xiaoyang Lyu,Yi-Hua Huang,Haoru Tan,Shizhen Zhao,Yikang Ding,Jianmin Bao,Xin Tao,Pengfei Wan,Xiaojuan Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digital content creation, including translating, content creation, routine operations, operations in digital

备注：

点击查看摘要

Abstract:Object-level geometric edits, including translating, rotating, scaling, duplicating, or removing an object, are routine operations in digital content creation (DCC) workflows, yet they remain unreliable in generative video editing. The key challenge lies in specifying the target object's 3D state change unambiguously across viewpoint and time, while consistently updating geometry-dependent secondary effects such as shadows and reflections. We introduce GIVE, a geometry-instructed video editing framework that represents edits through a unified object-state formulation. Two video-aligned geometry streams describe the target object before and after editing: a depth-box encoding coarse 3D placement and extent, and an orientation-box providing an appearance-agnostic orientation cue. Together, these streams provide a compact pre/post geometric specification for object-state transitions. To provide paired supervision for learning these edits, we build a scalable graphics-engine pipeline that executes object-level edit programs and renders controlled before/after pairs, isolating the intended geometric edit while keeping secondary effects consistent with the transformation. Experimental results demonstrate that GIVE produces faithful geometric edits with temporal coherence and consistent secondary effects across operators in a unified framework, and shows promising transfer to in-the-wild videos. Project page: this https URL

78. 【2606.24214】MorVess: Morphology-Aware Pulmonary Vessel Segmentation Network

链接：https://arxiv.org/abs/2606.24214

作者：Fuyou Mao,Yifei Chen,Beining Wu,Lixin Lin,Jinnan Dai,Zhiling Li,Yilei Chen,Yaqi Wang,Hao Zhang,Yan Tang,Huiyu Zhou,Feiwei Qin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, Accurate pulmonary vessel, vessel segmentation remains, segmentation remains challenging, Accurate pulmonary

备注：

点击查看摘要

Abstract:Accurate pulmonary vessel segmentation remains challenging due to the sparse, tortuous, and multi-scale nature of vascular structures, where small branches are easily lost and topology integrity is difficult to preserve under voxel-wise supervision. Existing deep segmentation models primarily optimize binary masks, lacking explicit geometric constraints, thus struggling to recover continuous tubular morphology and fine vascular connectivity. In this study, we introduce MorVess, a morphology-aware segmentation framework that integrates differentiable geometric priors with large-scale foundation model adaptation to achieve fine-grained vascular parsing. MorVess jointly predicts vessel masks, distance maps, and thickness maps, providing explicit supervision for vascular boundaries, centerline consistency, and smooth diameter transitions. A lightweight 2.5D adapter bridges 3D spatial context and 2D SAM representations, while a global-local fusion block aggregates multi-level semantics and geometric cues for high-fidelity topology reconstruction. Across two challenging pulmonary CT benchmarks, MorVess delivers superior Dice, clDice, and HD95 scores, substantially improving small-vessel recovery and global connectivity. These results demonstrate that embedding geometric intelligence into pretrained vision models offers a principled and scalable pathway toward precise vessel analysis and clinically reliable structural quantification. Our source code is available at this https URL.

79. 【2606.24206】Inclusive Interactive Collisions for Multi-View Consistent Compositional 3D Generation

链接：https://arxiv.org/abs/2606.24206

作者：Chang Liu,Mingwen Shao,Xiang Lv,Xinyuan Chen,Lingzhuang Meng,Qiao Zhang,Zhengyi Gong,Jinghao Hu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent breakthroughs, Score Distillation Sampling, Adaptive Score Distillation, Score Distillation, Distillation Sampling

备注：

点击查看摘要

Abstract:Recent breakthroughs in 3D generation have advanced notably with the development of text-to-image diffusion model. However, existing methods remain two practical challenges: (1) They primarily generate single 3D object, but struggle to generate multi-object compositional 3D assets due to the lack of the modeling for Gaussian primitives in reasonable interactions. (2) They often suffer from cross-view inconsistency during 3D optimization, as Score Distillation Sampling inherently performs on each single view, inevitably resulting in cross-view hallucinations. To solve above issues, we propose I2C-3D, a novel optimization-based method to generate multi-view consistent compositional 3D assets with reasonable interactions. Specifically, we propose an Inclusive Interactive Collisions strategy to guide Gaussian primitives appearing in reasonable interaction regions naturally, thereby ensuring objects in the compositional scene interact in a physically plausible and visually coherent way. Additionally, to enhance multi-view consistency, Multi-View Adaptive Score Distillation Sampling is devised to distill multi-view consistency prior and layout prior from pre-trained diffusion model by modulating attention map of instance token and spatial token across viewpoints. Benefiting from above elaborate designs, I2C-3D not only generates high-fidelity multi-view consistent compositional 3D assets but also supports 3D editing flexibly, facilitating complex scene generation. Extensive experiments demonstrate our I2C-3D outperforms existing methods in generation quality and multi-view consistency.

80. 【2606.24192】Co-occurring associated retained concepts in Diffusion Unlearning

链接：https://arxiv.org/abs/2606.24192

作者：Miso Kim,Georu Lee,Yunji Kim,Hoki Kim,Jinseong Park,Woojin Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：mitigate harmful content, harmful content generation, key technique, technique to mitigate, mitigate harmful

备注： Accepted as a poster at ICLR 2026. Code available at [this https URL](https://github.com/damilab/CARE)

点击查看摘要

81. 【2606.24187】owards Fast and Effective Long Video Understanding of Multimodal Large Language Models via Adaptive Quasi-Gaussian Sampling

链接：https://arxiv.org/abs/2606.24187

作者：Kun Zhang,Chenxin Fang,Tao Chen,Baiyang Song,Yunhang Shen,Yiyi Zhou,Rongrong Ji

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Long video understanding

备注： NeurIPS 2026 submission. 15 pages, 8 figures

点击查看摘要

Abstract:Long video understanding remains a daunting challenge for \emph{Multimodal Large Language Models} (MLLMs) due to the excessive computation and memory footprint. Thus, \emph{keyframe selection} is often adopted to mitigate this shortcoming, which however still suffers from low flexibility and high noise due to its hard sampling principle. In this paper, we define video frame selection as a problem of \emph{Quasi-Gaussian Sampling}, and propose an adaptive and training-free approach termed \textbf{\emph{AdaQ}}. Inspired by the $3$-$\sigma$ rule of Gaussian distribution, the objective of AdaQ is to achieve the optimal $3$-$\sigma$ interval for different examples, \emph{i.e.}, a smaller $3$-$\sigma$ interval for the local query and a larger one for the global query, thereby facilitating robust and adaptive frame sampling. To validate AdaQ, we apply it to four MLLMs with three embedding models. The extensive experimental results not only show its obvious performance gains over the default MLLMs and the SOTA keyframe selection methods, \emph{e.g.}, helping Qwen3-VL-8B outperform GPT4o by 15.8\% on average by using only 64 frames, but also confirm its superior robustness and high efficiency for long-video understanding, \emph{e.g.}, \textbf{only 1 hyper-parameter} needs to be set. \textbf{Our code project} is given at \href{this https URL}{this https URL}.

82. 【2606.24180】Deep Learning Approaches for 3D Medical Scene Completion: From Geometric Modeling to Generative Paradigms

链接：https://arxiv.org/abs/2606.24180

作者：Afifa Khaled,Said Jadid Abdulkadir,Majdy Mohamed Eltayeb Eltahir

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Three-dimensional scene completion, including autonomous navigation, Three-dimensional scene, research contributions made, contributions made

备注：

点击查看摘要

Abstract:Three-dimensional scene completion has evolved as a major problem in computer vision and robotics, and its applications are diverse, including autonomous navigation and augmented reality. In this study, a systematic review has been conducted to compile the research contributions made in the last ten years, i.e., 2016 to 2026, which has revolutionized the field from the voxel semantic completion paradigm represented by SSCNet to the latest paradigm that combines generative diffusion priors with real-time rendering using a Gaussian splatting technique. The evolution in representation paradigms, such as voxel grids, point learning, implicit neural fields, transformer networks, diffusion networks, and the latest paradigm based on rendering-aware 3D Gaussian primitives, has been discussed in this study. A comprehensive analysis has been carried out on the contributions made in the last ten years, and a taxonomy has been developed to provide a clear idea about the contributions made in the field. The study has also discussed the research contributions made in the field, along with the challenges that still need to be addressed. Finally, the study has presented a research agenda that will provide a clear idea about the directions that can be followed in the development of the next-generation system

83. 【2606.24178】Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring

链接：https://arxiv.org/abs/2606.24178

作者：Dominik Lindner,Johann Schmidt,Tom Siegl,Martin Becker,Sebastian Stober

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Pretrained vision models, object class unchanged, Pretrained vision, class unchanged, object class

备注：

点击查看摘要

Abstract:Pretrained vision models often misclassify inputs that are rotated, scaled, or sheared, even though these affine transformations leave the object class unchanged. Robustness is usually restored either by building equivariance into the architecture or by retraining with augmentation, both of which require changing or retraining the model. Test-time canonicalization instead leaves the classifier untouched. It undoes the transformation of each input, mapping it to a canonical form near the training distribution before classification. Existing canonicalizers, however, rely on a narrow set of logit-based energy scores and bespoke search procedures, leaving the design space of scoring functions and optimizers unexplored. We reframe canonicalization as out-of-distribution (OOD) detection, which lets any OOD score serve as the energy minimized over transformations. Across benchmarks ranging from handwritten characters and sketches to natural images and 3D point clouds, we systematically evaluate around twenty OOD scores and nine search algorithms, finding that distance-based scores paired with random search and local refinement perform best overall. Because canonicalizing an already-aligned input can hurt accuracy, we add a gated mechanism that transforms an input only when its OOD score indicates this is needed, preserving most in-distribution accuracy while retaining the robustness gains on transformed inputs. Code is available at this http URL.

84. 【2606.24175】ri-Efficient Transfer Learning for Point Cloud Videos

链接：https://arxiv.org/abs/2606.24175

作者：Yiding Sun,Dongxu Zhang,Jihua Zhu,Haozhe Cheng,Zhengqiao Li,Pengcheng Li,Chaowei Fang,Yonghao Dong,Lin Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：prohibitive annotation costs, cloud video understanding, cloud foundation models, significantly advanced point, severe memory bottlenecks

备注：

点击查看摘要

Abstract:While point cloud foundation models have significantly advanced point cloud video understanding, existing parameter-efficient fine-tuning (PEFT) methods still suffer from two critical limitations: prohibitive annotation costs for large-scale point cloud datasets and severe memory bottlenecks. In this paper, we aim to mine richer supervision signals from existing data rather than blindly scaling datasets. A further key principle is that the memory footprint of fine-tuning must be drastically reduced compared to full fine-tuning, which remains elusive for current PEFT techniques. Driven by these challenges, we identify three core desiderata: data-, parameter-, and memory efficiency, and present PoinTriE, a unified framework that excels along all three dimensions. For pre-training, pseudo-motion trajectories are synthesized via rigid transformations, paired with text corpora and 2D projections derived from raw point clouds. We then propose a Geometric-Motion Duality Network optimized via multimodal contrastive learning, rigid rotation prediction, and motion distribution divergence to produce dense self-supervision. During fine-tuning, we freeze the pretrained backbone and only update a lightweight Spatio-temporal Side Network built with LoRA units. Equipped with a gradient flow masking strategy, PoinTriE simultaneously reduces memory consumption and parameter overhead. Extensive experiments confirm that PoinTriE establishes new state-of-the-art results on action recognition and semantic segmentation tasks.

85. 【2606.24165】Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models

链接：https://arxiv.org/abs/2606.24165

作者：Bin Chen,Yuxiang Cai,Yadan Luo,Yi Zhang,Jianwei Yin,Zhi Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, accelerating Multimodal Large, Language Models, Large Language

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Reducing visual token redundancy is critical for accelerating Multimodal Large Language Models (MLLMs) without degrading cross-modal reasoning performance. Existing token pruning methods typically rely on single-layer signals, such as attention scores or token similarities, which overlook the cross-layer transformation of visual representations and may exhibit positional bias in multimodal token sequences. To address this limitation, we propose a training-free token pruning framework based on Cross-Layer Spectral Evolution (CLSE). Instead of measuring token importance from single-layer feature magnitudes, CLSE quantifies how token representations evolve across Transformer layers in the frequency domain. This evolution reflects the transition from high-frequency structural details to low-frequency semantic abstractions. We observe that tokens with stronger spectral redistribution across layers are more likely to be semantically active and should therefore be preserved. By modeling cross-layer token dynamics, CLSE provides a stable importance criterion that mitigates positional bias. Extensive experiments on both image and video benchmarks demonstrate that CLSE achieves a superior trade-off between efficiency and accuracy under aggressive token reduction. Across multiple MLLMs, CLSE reduces FLOPs, KV cache memory, and latency while maintaining competitive or improved performance.

86. 【2606.24161】Dual-Branch Cross-Projection Debiasing through Diffusion-based Disentanglement

链接：https://arxiv.org/abs/2606.24161

作者：Xiangqian Zhao,Xinyang Jiang,Zhipeng Xu,Lingfeng He,Zilong Wang,Dongsheng Li,De Cheng,Nannan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Foundation models trained, Foundation models, resulting in poor, trained on biased, poor generalization

备注：

点击查看摘要

Abstract:Foundation models trained on biased datasets often rely on spurious correlations between target labels and non-causal attributes, resulting in poor generalization on minority groups. Bias mitigation remains challenging due to two fundamental issues. First, when group labels are unavailable, existing group-unsupervised methods typically infer spurious attributes implicitly from model behavior, making it difficult to identify spurious factors that are semantically aligned with real-world biases. Second, even with pseudo spurious supervision, most existing debiasing methods follow a single-branch design that operates within a single shared feature space, where target and spurious attributes are intrinsically entangled. To address the first challenge, we introduce Confidence-guided Bias Concept Mining (CBCM), which leverages diffusion-disentangled, semantically grounded concept representations to identify reliable spurious attributes without attribute annotations. To address the second challenge, we propose Dual-branch Cross-projection Debiasing (DCD), a prompt-tuning framework that separates target and spurious representations into two branches and explicitly removes spurious information through cross null-space projection while preserving target-relevant semantics. Extensive experiments on four benchmark datasets show that our method achieves state-of-the-art worst group accuracy among group-unsupervised approaches, while tuning at most 0.22% of the model parameters. The source code is available in the supplementary materials.

87. 【2606.24156】Accelerating Multimodal Large Language Models with Prior-Corrected Token Reduction

链接：https://arxiv.org/abs/2606.24156

作者：Zengjie Chen,Yuxiang Cai,Jingcai Guo,Taotao Cai,Jianwei Yin,Zhi Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, accelerating Multimodal Large

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Visual token reduction has emerged as an effective strategy for accelerating Multimodal Large Language Models (MLLMs). Many existing methods prune tokens by ranking text-visual attention scores. However, we show that attention is often dominated by a model-induced prior: even without textual instruction, MLLMs tend to focus on certain task-agnostic regions. Consequently, the attention scores of instruction-conditioned tokens are suppressed, increasing the risk that these tokens are discarded during pruning. To address this issue, we propose Prior-Corrected Token Reduction (PriorTR), a training-free token reduction method that explicitly separates task-conditioned attention from the model-induced prior. PriorTR estimates the attention map of the prior, and contrasts it with the task-conditioned attention distribution to measure the additional usable information contributed by each visual token. Importantly, PriorTR computes both the model-induced prior and the task-conditioned posterior within a single forward pass by introducing a null token that serves as an instruction-agnostic probe in the attention block. This design avoids duplicated propagation. Extensive experiments across multiple multimodal benchmarks and MLLMs demonstrate that PriorTR consistently improves the trade-off between accuracy and efficiency over strong training-free baselines, particularly under aggressive token budgets.

88. 【2606.24153】Differential Unfolding: Efficient Unfolding Reconstruction for Video Snapshot Compressive Imaging

链接：https://arxiv.org/abs/2606.24153

作者：Muyuan Zhang,Jiancheng Zhang,Haijin Zeng,Yin-ping Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Snapshot Compressive Imaging, video Snapshot Compressive, Deep Unfolding Networks, dominate video Snapshot, Compressive Imaging

备注：

点击查看摘要

Abstract:While Deep Unfolding Networks (DUNs) dominate video Snapshot Compressive Imaging (SCI), they remain constrained by a uniform design philosophy. Existing methods repeatedly stack high-complexity priors with identical structures, ignoring the fact that optimization trajectories converge toward static states. This results in representation stagnation, where high-cost computations are wasted on minimal feature updates. To address this inefficiency, we present Differential Unfolding (DU), a heterogeneous framework that replaces uniform repetition with dynamic evolution. Central to DU is the Differential Evolutionary Framework (DEF), which partitions the unfolding process into two complementary roles: structural anchoring and differential evolution. In this scheme, high-parameter general stages are sparsely deployed to generate high-fidelity feature foundations. Complementing these, lightweight differential stages employ a Differential Representation Prior (DRP) to propagate and refine these foundational features through a differential mechanism. By integrating Differential Representation Attention (DRA) for evolving attention maps and a Differential Modulated FFN (DM-FFN) for feature rectification, DRP effectively models cross-stage variations with minimal overhead. By focusing computational resources on dynamic evolution rather than static redundancy, DU achieves a superior trade-off between accuracy and efficiency. Extensive experiments verify that our method establishes new state-of-the-art results while significantly slashing computational overhead. this https URL

89. 【2606.24152】Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models

链接：https://arxiv.org/abs/2606.24152

作者：Xin Wang,Wenxuan Liu,Tongtong Feng,Wenwu Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Existing literature claims, Existing literature, video generation essentially, video generation, literature claims

备注： 5 pages, 1 figure

点击查看摘要

Abstract:Existing literature claims that video generation essentially is world modelling. On the one hand, the claim is productive because it pushes generative AI beyond static images and toward temporally extended physical scenes. On the other hand, this claim dangerously relies on the belief that scaling visual prediction alone will automatically yield physical agents. We prefer a more accurate statement: video generation models learn a partial, implicit spatiotemporal world model, but not a fully grounded or controllable one. The reason is as follows: a model may generate a plausible video of a drone crossing a forest or a robot arm manipulating a cup, yet still fail to know which variables are controllable, which constraints belong to a particular body and which futures remain valid under intervention. The frontier in essence is not predictive realism alone, instead it emphasizes a self-evolving generative nature that requires the decisive criterion to be counterfactual controllability: the capability of asking what would happen under an action, to test whether the generated future can survive embodiment constraints and to feed the resulting action knowledge back into future imagination (generation). Therefore, in this paper we present a new perspective, i.e., autonomous video generation with counterfactual controllability is one promising way to realize self-evolving world models.

90. 【2606.24144】Geometry-Aware Style Transfer in 3D Gaussian Splatting

链接：https://arxiv.org/abs/2606.24144

作者：Min Hyeok Bang,Jun Hyeong Kim,Seung-Wook Kim,Se-Ho Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：simultaneously transfers appearance, transfers appearance attributes, geometric structures, appearance attributes, attributes and geometric

备注： 14 pages, 7 figures, accepted at ECCV 2026

点击查看摘要

Abstract:In this paper, we present a novel geometry-aware style transfer framework for 3D Gaussian splatting (3DGS) that simultaneously transfers appearance attributes and geometric structures. Unlike prior works that primarily focus on color-based stylization and often overlook structural adaptation, our method explicitly incorporates geometry adaptation through a decoupled optimization scheme that alternately updates color and geometry parameters. This strategy alleviates potential interference between color and geometry updates, leading to stable and consistent scene-level geometry transformation. The decoupled optimization is enabled by the proposed geometry-aware contrastive feature matching (GCFM). GCFM integrates RGB, depth, and edge cues into a contrastive objective and is employed in both optimization phases to effectively transfer structural characteristics from style images to Gaussian primitives. Extensive experiments show that our approach achieves superior performance in both qualitative fidelity and quantitative metrics, significantly outperforming existing 3DGS-based stylization methods. Our code is available at \href{this https URL}{this https URL}.

91. 【2606.24138】Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image

链接：https://arxiv.org/abs/2606.24138

作者：Tongyan Hua,Dongli Wu,Jinjing Zhu,Yinrui Ren,Zhongcheng Hong,Ying-Cong Chen,Hui Xiong,Wufan Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：single satellite image, urban simulation, digital twins, geospatial intelligence, Generating explicit

备注：

点击查看摘要

Abstract:Generating explicit 3D city assets from a single satellite image is important for digital twins, urban simulation, and geospatial intelligence. Unlike satellite-to-street-view synthesis, the task requires a reusable textured mesh with plausible geometry and controllable appearance rather than a 3D proxy optimized only for rendering a small set of images or videos. The ICCV Sat2City framework made a first step by conditioning cascaded sparse-voxel latent diffusion on satellite-derived height maps, but its appearance was random, its training data were synthetic, and its task-specific VAE did not scale well to noisy real-world reconstructions. We present Sat2City v2, a journal extension that adapts a pretrained native structured-latent 3D foundation model to weakly aligned satellite images and textured meshes. We build a real-world dataset with 16,241 satellite-mesh pairs across 24 regions in 9 cities. Instead of learning a 3D representation from noisy city meshes, Sat2City v2 encodes each mesh into a pretrained native 3D latent space, fine-tunes a satellite-conditioned geometry flow, and uses the decoded shape to anchor satellite-conditioned texturing. This retains Sat2City's geometry-to-appearance cascade while enabling appearance-controllable generation from the satellite input. Experiments on metric-scale DSM reconstruction and generative city-asset benchmarks for geometry and appearance show that Sat2City v2 achieves the best overall performance among evaluated baselines. Overall, Sat2City v2 advances satellite-to-city generation from rendering-oriented 3D proxies to explicit textured mesh assets, supported by, to the best of our knowledge, the first documented satellite-mesh paired dataset collected from matched geographic crops for this asset-level task. Project page: this https URL

92. 【2606.24122】Bengal-HP_RU: A Dataset of Bengal People For Head Pose Estimation

链接：https://arxiv.org/abs/2606.24122

作者：Md. Ahanaf Arif Khan,Md. Tawhidur Rahman,Sangeeta Biswas,Md. Iqbal Aziz Khan,Subrata Pramanik,Sanjoy Kumar Chakravarty,Bimal Kumar Pramanik

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：East Asian origin, leaving South Asian, South Asian populations, Existing head pose, Western or East

备注：

点击查看摘要

Abstract:Existing head pose datasets predominantly feature subjects of Western or East Asian origin, leaving South Asian populations, particularly Bengali individuals, largely underrepresented. We introduce Bengal-HP_RU, the first publicly available head pose dataset centred on Bengali subjects, comprising 12,894 labelled head images annotated with continuous yaw, pitch, and roll values. Images were collected from Wikimedia Commons under free licences and processed through an automated pipeline followed by manual label correction. The dataset is partitioned by Wikimedia uploader identity to prevent data contamination, yielding 10,494 training and 2,400 test images across 296 unique uploaders. Bengal-HP_RU exhibits substantial diversity in subject age, gender, occlusion, illumination, and background, reflecting realistic in-the-wild conditions. The dataset is publicly available at this https URL.

93. 【2606.24120】Flood Mapping from RGB imagery using a Vision Foundation Model

链接：https://arxiv.org/abs/2606.24120

作者：Vladyslav Polushko,Tilman Bucher,Ronald Rösch,Thomas März,Markus Rauhut,Andreas Weinmann

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：damage assessment, Timely, extent around settlements, settlements are essential, essential for emergency

备注：

点击查看摘要

Abstract:Timely, high-resolution maps of flood extent around settlements are essential for emergency response and damage assessment. We consider airborne RGB imagery for flood mapping as it can be collected rapidly at low cost. To produce flood maps, deep learning models for water segmentation are often used. CNN based and small vision transformer models are used. However, they need much data for adaptation to a change of scenery, i.e., another flooding event. Vision foundation models or large vision transformers are known to generalize across domains. Recently, foundation models for Earth observation became available. They are pretrained on satellite data, whose spatial resolution, viewing geometry, and radiometry differ from nadir RGB imagery. Thus, adaptation is required. We investigate how a satellite-pretrained Earth observation foundation model can be adapted to centimeter-scale floodwater mapping from RGB imagery. Specifically, we fine-tune a model we call Prithvi-2.0-UPN consisting of the Prithvi-EO-2.0-600M Vision Transformer combined with a UPerNet decoder for binary water segmentation on two RGB datasets (BlessemFlood21, NeuenahrFlood). In a first experiment we observe that Prithvi-2.0-UPN reaches state-of-the-art results on BlessemFlood21 and NeuenahrFlood, when trained on their datasets. In a second experiment we show that Prithvi-2.0-UPN performs better than state-of-the-art baseline models for transfer to a new flood event (trained on BlessemFlood21, tested on NeuenahrFlood) in a zero-shot setting. However, the performance indicates room for improvement. In this respect, we investigate in a third experiment how performance improves when further fine-tuning the models with small shares of NeuenahrFlood training data: Prithvi-2.0-UPN improves the fastest and reaches almost the performance level when fully trained on NeuenahrFlood, indicating transfer capabilities.

94. 【2606.24118】An LMM for Precisely Grounding Elements in Documents

链接：https://arxiv.org/abs/2606.24118

作者：Yijian Lu,Chuangxin Zhao,Kai Sun,Lei Hou,Juanzi Li,Ji Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Multimodal Models, Large Multimodal, document error detection, ability for Large, Multimodal Models

备注：

点击查看摘要

Abstract:Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.

95. 【2606.24115】A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

链接：https://arxiv.org/abs/2606.24115

作者：Aminu Lawal,Niyoj Oli,Sachin Acharya,Prashnna Gyawali,Maria Carmen Romano,Binod Bhattarai

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Visual Question Answering, hallucination detection methods, Vision-language models, Semantic Entropy, clinical practice

备注： Accepted at the Medical Image Understanding and Analysis (MIUA) 2026 conference

点击查看摘要

Abstract:Vision-language models (VLMs) are prone to hallucination, which remains a major barrier to their safe deployment in clinical practice. To date, most hallucination detection methods have been evaluated on radiology benchmarks such as MIMIC-CXR and VQA-RAD, while gastrointestinal (GI) endoscopy remains largely underexplored. In this paper, we benchmark nine hallucination detection methods on the Gut-VLM dataset, a GI diagnostic Visual Question Answering (VQA) dataset with 4,392 test VQA pairs, across five VLMs (MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B). The methods span three categories: black-box methods (RadFlag, SelfCheckGPT-NLI), gray-box methods (AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, and VASE), and a white-box method (ReXTrust). Our results show that ReXTrust, a white-box method, achieves the highest AUC across all five models, outperforming the strongest alternative method on each VLM by a statistically significant margin (paired permutation test, p 0.001 in all cases), reaching a peak AUC of 93.0 on MedGemma-4B. White-box hidden-state access provides a consistent advantage of 19.5 AUC points on average (range: 9.5--33.5), with ReXTrust maintaining strong performance even on LLaVA-v1.6-7B (AUC 79.9), where black-box methods and clustering-based gray-box methods collapse to near-chance performance. Among non-white-box methods, token-level gray-box statistics (MaxEnt, MaxProb) are the strongest alternatives, outperforming both clustering-based gray-box methods (Semantic Entropy, VASE) and black-box approaches on average. We further identify confident confabulation, a failure mode in which models hallucinate with high inter-sample consistency or high token-level probability, as a systemic failure for both consistency and uncertainty-based methods.

96. 【2606.24107】DramaDirector: Geometry-Guided Short Drama Generation

链接：https://arxiv.org/abs/2606.24107

作者：Hengji Zhou,Sijie Liu,Jianrun Chen,Xingchen Zou,Lianghao Xia,Liqiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：dialogue-driven focus shifts, demanding cinematographic grounding, generation pipelines struggle, rapid shot rhythms, video generation pipelines

备注： 20 pages, 17 figures, 6 tables. Code is available at [this https URL](https://github.com/iLearn-Lab/DramaDirector)

点击查看摘要

Abstract:Short dramas, with their rapid shot rhythms, dialogue-driven focus shifts, and demanding cinematographic grounding, pose challenges that prompt-level or text-only video generation pipelines struggle to meet. We study plot-to-short-drama generation, where a global plot and local context are transformed into visually grounded multi-shot videos. We propose DramaDirector, a geometry-grounded framework that lets the planner borrow cinematographic geometry from a gallery of real short-drama shots indexed by depth and pose. DramaDirector decouples each shot into static visual and dynamic narrative conditions, trains the planner with schema-constrained SFT and GRPO under a learned text-visual alignment reward, and retrieves depth-pose references to guide first-frame generation and image-to-video synthesis. We also introduce DramaBoard, a benchmark built from 35 live-action dramas, 2.8K episodes, and 81K shots, with structured storyboards and multi-dimensional evaluation protocols. Experiments show that DramaDirector improves over representative multi-agent and video generation baselines on faithfulness, consistency, and controllability. Our code is released at: this https URL

97. 【2606.24101】NavWM: A Unified Navigation World Model for Foresight-Driven Planning

链接：https://arxiv.org/abs/2606.24101

作者：Yanghong Mei,Longteng Guo,Ming-Ming Yu,Guiyu Zhao,Xingjian He,Jing Liu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Conventional visual navigation, complex environments, Conventional visual, struggle with myopic, myopic decision-making

备注： 13 pages, 5 figures, accepted to ECCV 2026

点击查看摘要

Abstract:Conventional visual navigation policies often struggle with myopic decision-making and mode collapse in complex environments. While world models offer a promising alternative, existing paradigms typically isolate perception, generation, and control, failing to capture their shared spatio-temporal dynamics. In this paper, we propose NavWM, a unified navigation world model that seamlessly integrates latent world reasoning, multimodal action prediction, and controllable visual generation. At its core, NavWM leverages latent world tokens to distill geometric and semantic priors, endowing the agent with robust structural understanding. To overcome the limitations of deterministic policies, we introduce an anchor-based multimodal trajectory forecasting framework that generates a diverse action space. This inherent diversity explicitly empowers the generative world model to act as a robust closed-loop planner, utilizing visual foresight to evaluate and select the optimal path. Extensive experiments across diverse robotics datasets demonstrate that NavWM significantly advances the state-of-the-art, delivering remarkable improvements in both high-fidelity future state generation and zero-shot navigation success.

98. 【2606.24096】Beyond Bayer: Task-Optimal Sensor Co-Design for Robust Autonomous-Driving Segmentation

链接：https://arxiv.org/abs/2606.24096

作者：Reeshad Khan,John Gauch

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Robust perception underpins, underpins autonomous driving, perception underpins autonomous, cooperative multi-agent fusion, Robust perception

备注：

点击查看摘要

Abstract:Robust perception underpins autonomous driving, and most recent progress comes from scaling the model-larger backbones, foundation models, and cooperative multi-agent fusion. We pursue a complementary, upstream question: what should the camera itself measure? Using a differentiable RAW-to-task pipeline, we decompose which sensor degrees of freedom benefit dense prediction. Learning the spectral colour-filter-array (CFA) weights is the dominant lever, improving mIoU by +0.017 (KITTI-360) and +0.023 (ACDC) over a fixed camera. In contrast, point-spread-function (optics) co-design is net-negative (-0.020 mIoU on KITTI-360) - a consequence of the data-processing inequality, which also bounds the task information that any downstream model, however large or cooperative, can recover. Noise co-optimisation is marginal, and counter to intuition enlarging the CFA tile beyond 2x2 consistently hurts, as the filters are confined to the rank three sRGB input. Because the intervention is at the sensor, the gains are model-agnostic; we validate robustness on ACDC's fog, night, rain, and snow, and conclude with a simple recipe: learn the 2x2 CFA weights and keep an identity PSF.

99. 【2606.24094】Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent

链接：https://arxiv.org/abs/2606.24094

作者：Wenliang Zhong,Rob Barton,Lucas Goncalves,Kushal Kumar,Feng Jiang,Hehuan Ma,Yuzhi Guo,Vidit Bansal,Karim Bouyarmane,Junzhou Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unifying image clustering, remains challenging due, Unifying image, Image Clustering Agent, scenarios remains challenging

备注： CVPR 2026

点击查看摘要

Abstract:Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling, which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce LLM Traversal based on Minimum Spanning Tree that selectively applies LLM reasoning for complex semantic judgments. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. Our framework consistently outperforms specialized methods across diverse clustering tasks.

100. 【2606.24092】Progressive Pixel-Neighborhood Deformable Cross-Attention for Multispectral Object Detection

链接：https://arxiv.org/abs/2606.24092

作者：Tian Qiu,Jifeng Shen,Xin Zuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multispectral object detection, Effective cross-modal feature, object detection, Effective cross-modal, central challenges

备注： Accepted by Sensors

点击查看摘要

Abstract:Effective cross-modal feature alignment and interaction are central challenges in multispectral object detection. Although global cross-attention provides strong long-range modeling ability, its quadratic complexity with respect to feature size limits deployment on resource-constrained platforms. We therefore propose Progressive Pixel-Neighborhood Deformable Cross-Attention for multispectral feature fusion, termed PNAFusion. The proposed framework is motivated by two observations: weak misalignment between visible and thermal images is usually concentrated around local neighborhoods, and semantic correspondence across modalities often follows non-linear spatial mappings that fixed receptive fields cannot model well. To address these issues, PNAFusion incorporates local spatial priors into its architectural design to concentrate feature interaction and alignment on the most relevant neighborhoods. Specifically, a Pixel-Neighborhood Cross-Attention (PNCA) module is introduced to avoid redundant global feature matching and suppress background noise. Meanwhile, an Adaptive Deformable Alignment (ADA) module captures non-linear spatial correspondences through learned pixel-wise offsets. These components are further integrated through an iterative feedback mechanism to progressively refine cross-modal feature alignment. Experiments on FLIR, M3FD, and DroneVehicle show that PNAFusion achieves 84.2, 90.5, and 85.5 mAP@0.5, respectively, under the YOLOv5 detector, and further reaches 86.8 mAP@0.5 on FLIR and 90.8 mAP@0.5 on M3FD when transferred to Co-DETR. Efficiency analysis indicates that PNAFusion reduces allocated GPU memory by 33.0\% compared with ICAFusion and reduces theoretical FLOPs from 194.8 G to 156.4 G, although the deformable sampling and iterative refinement introduce additional latency. Our code will be available at this https URL.

101. 【2606.24075】End-to-End Radar and Communication Modulation Recognition with Neuromorphic Computing

链接：https://arxiv.org/abs/2606.24075

作者：Xiaohu Li,Chongxiao Qu,Caiyong Lin,Chenxiao Dou,Wei Hua

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：high computational cost, automatic modulation recognition, deep learning-based methods, computational cost makes, high computational

备注：

点击查看摘要

Abstract:Although deep learning-based methods can achieve high accuracy in automatic modulation recognition (AMR) tasks, their high computational cost makes it difficult to strike a balance between accuracy and power consumption, thereby limiting their application on resource-constrained platforms. Neuromorphic architectures that perform spike-driven inference with modest energy budgets have recently been explored for vision and timeseries tasks. Motivated by these works, we propose EMRFormer, a novel end-to-end spiking nerural network (SNN) architecture that applies spike-driven transformer to the constraints of neuromorphic hardware for AMR. The model incorporates an adaptive spike encoder and Integer Leaky Integrate-and-Fire neurons to mitigate the degradation of effective information and enhance SNN representational capacity. By integrating spike-separable Convolution Neural Networks (SSCNN) into Spike-Driven Transformers (SpikeFormer), EMRFormer effectively extracts multi-scale temporal features from the raw IQ waveforms. We validate our approach across various mainstream datasets, the experimental results show that EMRFormer achieves state-of-the-art interms of accuracy, outperforming all the baselines. Furthermore, the model maintains strong performance in low signal-to-noise(SNR) environments and reduces theoretical energy consumption by over 90%. Finally, we evaluate our model on a KA200 neuromorphic chip. The results show that our model achieves up to 5 times reduction in power compared to running on a 3090 GPU or an Orin NX. This work demonstrates a promising pathway for AMR on resource-constrained devices.

102. 【2606.24072】Fabric Image Demoiréing Benchmark from Synthesis to Restoration

链接：https://arxiv.org/abs/2606.24072

作者：Pengchao Wei,Xiaojie Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：camera sensor grids, producing structured interference, degrades image quality, severely degrades image, sampling-induced aliasing artifact

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Fabric moiré is a sampling-induced aliasing artifact caused by the interaction between fine textile patterns and camera sensor grids, producing structured interference that severely degrades image quality. Unlike screen-induced moiré, which stems from strictly periodic display lattices, fabric moiré is intrinsically more challenging due to the broadband and semi-periodic nature of textile weaves. The heavy spectral overlap between intrinsic texture and aliasing components renders fabric demoiréing substantially more ill-posed. Consequently, existing models trained on screen moiré datasets generalize poorly to these complex textile patterns. Despite its practical importance, fabric image demoiréing remains underexplored and lacks standardized benchmarks. We present the first comprehensive benchmark for fabric image demoiréing. To address the difficulty of acquiring pixel-aligned real-world pairs, we develop a physically motivated synthesis framework and construct a large-scale dataset comprising 16,050 paired multi-resolution fabric images with controllable aliasing severity. Furthermore, we customize a baseline model, which establishes promising performance on the proposed benchmark dataset with strong generalization ability. Our benchmark provides a standardized platform for advancing research in fabric image demoiréing.

103. 【2606.24068】ObsGraph: Hierarchical Observation Representation for Embodied Reasoning and Exploration

链接：https://arxiv.org/abs/2606.24068

作者：Taekbeom Lee,Youngseok Jang,Jeonghwa Heo,Jeongjun Choi,H. Jin Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：increasingly considered crucial, considered crucial abilities, unfamiliar environments, increasingly considered, considered crucial

备注：

点击查看摘要

Abstract:Embodied reasoning and exploration are increasingly considered crucial abilities for robots operating in complex and unfamiliar environments. To accomplish tasks in such settings, an agent must identify and acquire the information necessary for the task through exploration. We propose ObsGraph, an observation-centric hierarchical scene graph that unifies scene representation, retrieval, and exploration. It retains visual evidence and organizes it into room-view-object layers: rooms provide coarse semantic anchors, views preserve contextual object covisibility, and objects store fine-grained details. On top of this representation, we perform coarse-to-fine hierarchical retrieval under a bounded budget, and crucially use retrieval outcomes to structure the exploration candidate space--activating room-level exploration, view refinement, or frontier exploration--thereby tightly coupling representation, retrieval, and adaptive multi-scale exploration. Experiments across embodied reasoning and exploration benchmarks demonstrate improved success and efficiency, highlighting the benefits of structured scene representation and more targeted information gathering driven by identified evidence gaps.

104. 【2606.24059】Ingredient-Level Food Image Segmentation for Nutrition Awareness

链接：https://arxiv.org/abs/2606.24059

作者：Jonesh Shrestha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：important visual structure, entire image hides, image hides important, hides important visual, assigning one dish

备注： 5 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Food images often contain several visible ingredients, so assigning one dish label to an entire image hides important visual structure. This work studies ingredient-level semantic segmentation on FoodSeg103, where the model predicts an ingredient class for each pixel. Two SegFormer variants were fine-tuned and evaluated under a controlled setup: SegFormer-B0 as the smaller baseline model and SegFormer-B1 as the larger final model. Both models use ImageNet-pretrained MiT backbones with newly initialized 104-class output layers. On the held-out FoodSeg103 test split of 2,135 images, B0 achieved 0.7709 pixel accuracy and 0.2521 mean IoU, while B1 achieved 0.7929 pixel accuracy and 0.3204 mean IoU. B1 improved every saved test metric, including a +0.0683 absolute gain in mean IoU. The system also converts predicted masks into visible ingredient-area percentages, giving a simple visual composition summary of the predicted meal. This summary can serve as a first-pass nutrition-awareness cue by providing a visual alternative to detailed food tracking similar to plate-based meal guidance, but it is not a direct estimate of calories, macronutrients, food mass, volume, density, or true portion size.

105. 【2606.24058】VisChronos: Revolutionizing Image Captioning Through Real-Life Events

链接：https://arxiv.org/abs/2606.24058

作者：Phuc-Tan Nguyen,Hieu Nguyen,Minh-Triet Tran,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：leveraging historical events, natural language understanding, paper aims, aims to bridge, bridge the semantic

备注： SOICT 2024

点击查看摘要

Abstract:This paper aims to bridge the semantic gap between visual content and natural language understanding by leveraging historical events in the real world as a source of knowledge for caption generation. We propose VisChronos, a novel framework that utilizes large language models and dense captioning models to identify and describe real-life events from a single input image. Our framework can automatically generate detailed and context-aware event descriptions, enhancing the descriptive quality and contextual relevance of generated captions to address the limitations of traditional methods in capturing contextual narratives. Furthermore, we introduce a new dataset, EventCap (this https URL), specifically constructed using the proposed framework, designed to enhance the model's ability to identify and understand complex events. The user study demonstrates the efficacy of our solution in generating accurate, coherent, and event-focused descriptions, paving the way for future research in event-centric image understanding.

106. 【2606.24057】EPEdit: Redefining Image Editing with Generative AI and User-Centric Design

链接：https://arxiv.org/abs/2606.24057

作者：Hoang-Phuc Nguyen,Dinh-Khoi Vo,Trong-Le Do,Hai-Dang Nguyen,Tan-Cong Nguyen,Vinh-Tiep Nguyen,Tam V. Nguyen,Khanh-Duy Le,Minh-Triet Tran,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significant increase recently, increase recently, significant increase, Photoshop and Capture, Stable Diffusion

备注： SOICT 2024

点击查看摘要

Abstract:The demand for image manipulation has seen a significant increase recently. Traditional tools like Photoshop and Capture One, while powerful, require considerable expertise to use effectively. Generative AI has introduced alternative platforms, such as Luminar Neo, Pixlr X, and Canva. However, many of these solutions, including resource-heavy models like Stable Diffusion, often require substantial retraining and fine-tuning, leading to high costs for users. To address these challenges, we introduce Efficient Photo Editor (EPEdit), an application that integrates a robust backend framework with a user-friendly front-end interface. EPEdit supports a wide range of creative image editing tasks, including image generation, object replacement, object removal, background modification, changes in object pose or perspective, region-specific editing, and thematic collection design, all guided by masks and prompts. Users can interact with the system through simple text commands or by marking areas for precise adjustments, making it accessible even to those without technical expertise. At its core, EPEdit leverages zero-shot image editing algorithms based on Stable Diffusion model, removing the need for additional fine-tuning. This approach enables efficient image manipulation and thematic collection creation. User evaluations for tasks of image editing, thematic design, and overall system performance demonstrate that EPEdit outperforms existing solutions, offering a user-friendly, cost-effective solution for comprehensive image editing.

107. 【2606.24051】DriveStack-VLA: Render-Teacher Alignment for BEV-Based DeepStack Vision-Language-Action Model

链接：https://arxiv.org/abs/2606.24051

作者：Jingke Wang,Zhenru Zhao,Shuangming Lei,Hao Su,Yuehao Huang,Yijia Xie,Kai Tang,Guanglin Xu,AiXue Ye,Yukai Ma,Yong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：follow language guidances, pretrained Vision-Language Model, driving models convert, VLA driving models, Large Language Model

备注：

点击查看摘要

Abstract:Vision-Language-Action driving models convert a pretrained Vision-Language Model into a driving policy, allowing them to use world knowledge and follow language guidances. However, existing VLA driving models still lack driving-oriented spatial intelligence: their policies are mainly grounded on perspective image tokens and language priors, while precise motion planning requires metric geometry, top-down scene structure, and attention to safety-critical perceptual cues. This limitation makes current models vulnerable to weak visual geometry modeling and perceptual coverage in expert demonstrations. In this paper, we present DriveStack-VLA, a framework built upon a large VLM backbone. To strengthen the spatial grounding of VLA driving, we develop dual visual modeling components. We inject a Bird-Eye-View representation into the Large Language Model decoder through a DeepStack-style connection, and propose Render-Teacher Alignment to align the perceptual focus of real images with that of rasterized images. Furthermore, to bridge the gap in multimodal trajectory selection, we introduce a head-based self-critique module that ranks sampled trajectories and conditionally refines the best one. DriveStack-VLA achieves 91.6 PDMS on NAVSIMv1, 91.0 EPDMS on NAVSIMv2 (with the human penalty filter enabled), and a driving score of 79.49 with a success rate of 56.36\% on the closed-loop Bench2Drive. More visualizations are available on our project page: this https URL.

108. 【2606.24021】oken-to-Token Alignment of Text Embeddings for Semantic Blending

链接：https://arxiv.org/abs/2606.24021

作者：Saar Huberman,Ron Mokady,Or Patashnik,Daniel Cohen-Or

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：prompts, semantic, sequences, structure, token

备注：

点击查看摘要

Abstract:In modern generative models, images are specified and controlled through text prompts. In practice, images are generated from sequences of tokens derived from these prompts. However, the space of token sequences lacks a consistent accessible structure: semantically similar images may correspond to sequences that differ in wording, ordering, and placement of concepts, while similar token sequences may encode very different semantics. This apparent lack of structure makes it difficult to perform smooth transitions in this space, hindering applications such as image blending and continuous control of edits. We argue that this limitation stems not from the absence of semantic structure, but from misalignment between representations. To address this misalignment, we introduce Token-to-Token alignment, a framework that establishes explicit semantic correspondence between tokens across prompts. Our approach transforms prompts into a structured representation in which semantically corresponding concepts are mapped to consistent positions across prompts, and then aligns their token embeddings based on semantic similarity. Concretely, the method consists of two stages: a structural alignment that rephrases prompts into a shared structured form, followed by an embedding-level alignment that matches token representations across prompts. With this alignment in place, simple linear interpolation becomes a meaningful operation, producing smooth and coherent semantic transitions and enabling applications such as blending and continuous editing. Our results show that text embedding spaces in text-to-image models implicitly encode a continuous semantic structure that becomes accessible once representations are properly aligned, suggesting that semantic control can be achieved by organizing existing representations rather than modifying the generative model.

109. 【2606.24000】Cyclic Denoising Reveals Ultrastable Memories in Diffusion Models

链接：https://arxiv.org/abs/2606.24000

作者：Rishabh Sharma,Stefano Martiniani

类目：Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：introduce cyclic denoising, cyclic denoising, repeated forward, forward and reverse, cyclic denoising exposes

备注： 22 pages, 7 main figures; supplementary material included. Supplementary movies available at the project webpage

点击查看摘要

Abstract:We introduce cyclic denoising -- repeated forward and reverse diffusion at controlled noise amplitudes -- as an extraction attack for image diffusion models. Inspired by random organization in disordered solids, cyclic denoising exposes regions of the learned distribution that are largely inaccessible to standard sampling. The dynamics drive samples toward attractors with a broad stability spectrum. The deepest attractors are ultrastable: they regenerate after near-total corruption and persist through thousands of noising-denoising cycles. Many of these attractors correspond to memorized training images, including stock photographs, brand watermarks, and web-crawl artifacts. The attack requires only sampler-level control, with no gradients, weight inspection, prompts, captions, or prior knowledge of the training data. Unlike generate-and-filter attacks, which rely on large-scale prompted generation and post-hoc similarity or membership-inference filtering, our main protocol is fully unconditioned. We demonstrate the phenomenon in Stable Diffusion v1.4 and in a pixel-space DDPM, showing consistent behavior across latent- and pixel-space diffusion models. Across noise amplitudes, we observe a yielding-like transition: low-amplitude cycling produces trivial absorbing fixed points or limit cycles, while larger amplitudes induce rearrangements, basin hopping, and long-lived trapping in structured memorized attractor basins. We also observe hierarchical partial absorption, prompt-stabilized basins, and cross-initial-condition universality of the recovered attractor set. Our results therefore show that cyclic denoising is both a physics-inspired probe of generative landscapes and a practical tool for memorization auditing, with implications for privacy, copyright compliance, and model fingerprinting.

110. 【2606.23964】3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

链接：https://arxiv.org/abs/2606.23964

作者：Amirhossein Kardoost,Lion Gleiter,Tingying Peng,Carsten Marr

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词：inherently three-dimensional nature, nature of cells, Self-supervised learning, inherently three-dimensional, three-dimensional nature

备注： Accepted at MICCAI 2026. Code available at: [this https URL](https://github.com/marrlab/mae3d-opencell)

点击查看摘要

Abstract:Self-supervised learning in fluorescence microscopy often relies on 2D projections, despite the inherently three-dimensional nature of cells. We present a systematic comparison of 2D and 3D masked autoencoders (MAE-2D vs. MAE-3D) on volumetric microscopy data. Under matched architectures and training protocols, MAE-3D consistently outperforms 2D max-projection and slice-based variants on downstream single-cell tasks. We further align visual representations with a pretrained protein language model (ESM2) and show that cross-modal supervision yields larger gains for volumetric models. Channel cross-attention and frequency-domain regularization are critical for leveraging 3D spatial context. On a protein--protein interaction task, MAE-3D achieves a ROC--AUC of 0.865, outperforming prior methods by up to +0.025. For protein localization, our best 3D model attains state-of-the-art AUC$_{\text{micro}}$ (0.952) and F1$_{\text{micro}}$ (0.742), improving over previous approaches by +0.003 and +0.010 absolute, respectively. Overall, these results demonstrate the advantages of native 3D modeling and multimodal alignment for representation learning in single-cell microscopy.

111. 【2606.23950】DivRL: Disentangled Self-Similarity Rewards for Diverse Subject-Driven Generation

链接：https://arxiv.org/abs/2606.23950

作者：Qian Wang,Zhenyu Li,Abdelrahman Eldesokey,Peter Wonka

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Subject-driven image generation, Identity-Diversity Paradox, image generation faces, Visual Semantic Matching, low-diversity outputs

备注： Accepted to ECCV 2026. Project page: [this https URL](https://qianwangx.github.io/DivRL/)

点击查看摘要

Abstract:Subject-driven image generation faces an "Identity-Diversity Paradox", where strong identity preservation often leads to rigid and low-diversity outputs. We propose a post-training framework called DivRL that jointly optimizes identity consistency and structural diversity simultaneously by leveraging disentangled visual features from a robust similarity model. Specifically, we introduce a Negative Self-Similarity Measure (nSSM) to quantify structural diversity, and Visual Semantic Matching (VSM) to evaluate identity consistency. We propose an "Explore-and-Suppress" strategy that treats VSM as a gated constraint: the model freely explores structurally diverse configurations, and only samples that violate the identity threshold are penalized via a quadratic hinge loss. This converts identity preservation from a competing objective into a feasibility constraint, allowing nSSM and VSM to improve jointly. Experiments demonstrate that our method effectively pushes the model to generate both consistent and diverse images and improves structural diversity while maintaining comparable identity consistency through a gated optimization formulation.

112. 【2606.23917】rustworthy Image Authentication using Forensic Knowledge Graphs

链接：https://arxiv.org/abs/2606.23917

作者：Tai D. Nguyen,Matthew C. Stamm

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made image falsification, image falsification highly, Advances in generative, demanding trustworthy authentication, trustworthy authentication systems

备注： Accepted and Published at ECCV 2026

点击查看摘要

Abstract:Advances in generative AI have made image falsification highly realistic, demanding trustworthy authentication systems. Existing forensic detectors can target certain forgery types but lack interpretability, while vision-language models (VLMs) provide explanations but cannot exploit forensic traces for reliable detection. We propose Forensic Knowledge Graphs (FKGs), a unified framework that integrates forensic evidence extraction, structured reasoning, and human-interpretable explanation. Our FKG structure encodes forensic traces along with their causal dependencies and links to scene content. To generate accurate FKGs, we introduce a novel forensic authentication network and an Iterative Context Refinement strategy that guides VLMs to produce faithful, grounded explanations. We also present FKG-50K, a dataset of 50,000 realistic forgeries with ground-truth FKGs. Experiments demonstrate that FKG outperforms both forensic detectors and VLMs in detection, forgery identification and localization, and forensic justification.

113. 【2606.23897】he Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

链接：https://arxiv.org/abs/2606.23897

作者：Ahmad Algadhi,Ahmed Alzuhair,Omar Alkhulaif,Muzammil Behzad

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：large vision-language models, compresses large vision-language, lightweight student models, CLIP into lightweight, matching teacher predictions

备注：

点击查看摘要

Abstract:Prompt distillation compresses large vision-language models (VLMs) such as CLIP into lightweight student models by matching teacher predictions on unlabeled domain images. PromptKD (CVPR 2024) established this paradigm with a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. We propose TheProfessor, a multi-teacher extension that distills from a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher whose logits are pre-computed per dataset. We evaluate single-teacher PromptKD, equal-probability ensembling, and confidence-weighted ensembling on four base-to-novel datasets: Caltech-101, DTD, UCF101, and EuroSAT. In a 12-run single-seed sweep, confidence-weighted ensembling improves average HM from 87.52 to 89.28 (+1.77 points), while equal averaging improves average HM to 88.88 (+1.37 points). Gains are dataset dependent: they are negligible on Caltech-101 (+0.16 HM for confidence weighting), modest on UCF101 (+0.62), and largest on domain-shifted EuroSAT (+5.78). These results update our earlier Caltech-only analysis and show that multi-teacher prompt distillation is most useful when the second teacher contributes complementary supervision under domain shift.

114. 【2606.23892】REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs

链接：https://arxiv.org/abs/2606.23892

作者：Yifei Zhao,Qian Lou,Mengxin Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：safety-critical physical systems, Vision-language models, physical systems, decisions or actions, perception-reasoning backbones

备注： 20 pages, 5 figures. Preprint

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used as perception-reasoning backbones for embodied intelligence in safety-critical physical systems, where perception or reasoning errors can lead to unsafe decisions or actions. Although many red-teaming methods have been developed to probe VLM vulnerabilities, their evaluation remains fragmented across datasets, metrics, and threat models, making direct comparison difficult and obscuring whether observed differences arise from stronger attacks, more vulnerable models, or incompatible evaluation settings. Existing chatbot-centric red-teaming benchmarks mainly standardize jailbreak and content-safety evaluation, but they do not systematically capture physically grounded functional failures or cover red-teaming methods that target physical-world VLMs. This raises the key challenge of comparing diverse attack methods under a unified protocol while targeting the same scenario-specific failures. We introduce REALM, to our knowledge the first unified red-teaming benchmark for physical-world VLMs. REALM integrates 12 red-teaming methods, 3 model-agnostic defenses, and 13 VLMs under a practical black-box threat model with shared datasets and metrics. To align adversarial objectives across attack families, REALM introduces an agentic target-generation pipeline that constructs shared, scenario-specific, and physically grounded attack objectives for each scene, enabling fair comparison of diverse red-teaming methods under aligned adversarial goals. Our evaluation shows that text and typographic injection attacks induce the most failures, multimodal co-optimization yields the strongest visual-perturbation transfer, single-pass attacks approach iterative methods at much lower cost, and model scale alone does not confer adversarial robustness. Code is available at this https URL.

115. 【2606.23885】Mind the Heads: Topological Representation Alignment for Multimodal LLMs

链接：https://arxiv.org/abs/2606.23885

作者：Davide Caffagni,Alberto Compagnoni,Federico Melis,Sara Sarto,Pier Luigi Dovesi,Mark Granroth-Wilding,Marcella Cornia,Lorenzo Baraldi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：Large Language Models, Multimodal Large Language, external vision encoder, Large Language, improve Multimodal Large

备注：

点击查看摘要

116. 【2606.23881】Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

链接：https://arxiv.org/abs/2606.23881

作者：Qian Ma,Qiong Wu,Zhengyi Zhou,Yao Ma

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Visual Question Answering, Knowledge-Based Visual Question, Question Answering, requires grounding visual, Visual Question

备注： Accepted by ACL 2026 Findings. Project page [this https URL](https://github.com/VAN-QIAN/ACL26-IBA/)

点击查看摘要

117. 【2606.23851】Machine Learning Modeling for Real-Time Melt Pool Monitoring in Laser Powder Bed Fusion Additive Manufacturing: A Hybrid Approach

链接：https://arxiv.org/abs/2606.23851

作者：Inioluwa Emmanuel,Zhuo Yang,Ho Yeung,Xinyao Zhang

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：powder bed fusion, laser powder bed, NIST AMMT platform, bed fusion, work investigates

备注：

点击查看摘要

Abstract:This work investigates the implementation of artificial intelligence and machine learning (AI/ML) for real-time monitoring in laser powder bed fusion (LPBF) additive manufacturing. We developed a binary image classification framework for distinguishing normal and abnormal melt pool images using a balanced dataset of 1,200 images collected from Nickel superalloy 625 on the NIST AMMT platform. The study evaluates accuracy and inference time based on control requirements and hardware limitations of open-architecture LPBF machines. We benchmark three transfer learning architectures (ResNet50, EfficientNetB0, and MobileNetV2) against two Random Forest approaches: one trained on EfficientNetB0 feature embeddings (hybrid) and one trained on raw pixel features (baseline). Images are stratified into 80/20 train-test splits, with a further 90/10 validation split on the training set, and undergo standardized resizing, normalization, and label-preserving data augmentation to emulate realistic process variability. Each model is evaluated using accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC), along with training time, inference latency, and CPU GPU usage to capture deployability constraints relevant to factory-floor monitoring. The hybrid EfficientNetB0-plus-Random Forest approach achieves the best performance on the held-out test set, with an F1 score of 0.9451, accuracy of 0.9458, and AUC of 0.9904, while maintaining sub-millisecond per-image inference (1.15 ms). In contrast, purely deep learning models exhibit significantly higher inference times with lower accuracy. These results demonstrate that combining pre-trained convolutional features with classical ensemble methods provides a robust, computationally efficient route to real-time melt pool anomaly detection in data-limited additive manufacturing environments.

118. 【2606.23843】HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

链接：https://arxiv.org/abs/2606.23843

作者：Hoang-Bao Le,Aiden Durrant,Thai Son Mai,Binh T. Nguyen,Liting Zhou,Cathal Gurrin

类目：Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：capture semantic correspondences, natural language, typically pre-trained, datasets to capture, correspondences between visual

备注：

点击查看摘要

119. 【2606.23835】ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

链接：https://arxiv.org/abs/2606.23835

作者：Anindya Mondal,Sauradip Nag,Anjan Dutta

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：benchmark-specific training required, count-faithful image generation, handles object counting, crowd counting, referring-expression counting

备注： Under review, webpage: [this https URL](https://mondalanindya.github.io/ABACUS/)

点击查看摘要

Abstract:ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.

120. 【2606.23825】From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection

链接：https://arxiv.org/abs/2606.23825

作者：Yuhan Rui,Shihan Qiao,Yibin Lou,Mingxi Yu,Yutong Wan,Yanqiao Chen,Dongsheng Hou,Zhen Cao,Athena Zhuoming Zhong,Qi Hao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Efficient small object, small object detection, indiscriminately discard critical, critical high-frequency details, Efficient small

备注：

点击查看摘要

Abstract:Efficient small object detection is bottlenecked by the inherent feature scarcity of tiny targets, which is further aggravated by operations of spatial-domain detectors that indiscriminately discard critical high-frequency details. Recovering these fragile cues within the spatial domain is notoriously difficult, as it often requires computationally expensive architectural upscaling that inadvertently amplifies background noise. To bridge this gap, we propose a paradigm \textbf{shift from spatial to spectral} feature processing, introducing a holistic solution with the following novelty: (1) A versatile \textbf{Frequency-Guided Feature Representation framework} that generalizes across diverse detector architectures (both CNN and Transformer-based), offering a robust alternative to spatial-only feature extraction; (2) The unified \textbf{Decompose--Enhance--Reconstruct (DER)} operator, instantiated via three \textbf{lightweight, plug-and-play} modules -- Wavelet-Difference Gate (WDG), Log-Gabor Enhancer (LGE), and Frequency-Driven Head (FDHead) -- to systematically inject frequency-aware modulation into the backbone, neck, and head. This mechanism decouples feature modeling from resolution reduction, capturing discriminative high-frequency components to enable accurate localization with significantly reduced parameter redundancy; (3) Extensive validation on multi-domain benchmarks (VisDrone2019, UAVDT, TinyPerson, DOTAv1) demonstrating consistent gains. Notably, our proposed \textbf{DERNet} series outperforms YOLOv11 models under the same scale while requiring \textbf{only 1/6 of the parameters}, backed by rigorous spectral diagnostics and error decomposition analysis.

121. 【2606.23763】Listening makes Vision Clear for VLMs

链接：https://arxiv.org/abs/2606.23763

作者：Yiyang Chen,Yixin Tan,Binrui Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent work typically, typically assesses vision, work typically assesses, Recent work, assesses vision

备注： 18pages,3 figures

点击查看摘要

Abstract:Recent work typically assesses vision--language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where language priors from previously generated answer tokens accumulate and mismatch with visual attention. Besides the priors from previous answer tokens, we find that structural tokens, e.g., modality boundary markers, may encompass the entire context and generate high attention to areas unrelated to the target. To avoid these distortions and provide consistency evaluation for large VLMs, we adopt prompt-side semantics and propose Prompt-Vision Token Activation Map (PV-TAM). PV-TAM further incorporates a filter to remove systematic bias induced by modality boundary markers. Unlike traditional methods that evaluate overlap solely through masks while ignoring activation intensity, our metrics leverage the peak distribution of attention to measure the alignment between prompts and visual regions. In experiments, PV-TAM consistently improves both attention-based and IoU-style localization metrics over answer-side baselines on various datasets.

122. 【2606.23743】Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

链接：https://arxiv.org/abs/2606.23743

作者：Yitong Li,Junsong Chen,Haopeng Li,Haozhe Liu,Jincheng Yu,Ligeng Zhu,Ping Luo,Song Han,Enze Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：increases inference cost, Modern video diffusion, Modern video, Modern, inference cost

备注：

点击查看摘要

Abstract:Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective acceleration strategy is highly instance-specific: a recipe that works well for one combination of model, hardware, and inference configuration often does not transfer to another. Different models vary in architecture, numerical sensitivity, and attention concentration patterns. Inference settings differ in spatial and temporal resolution and video duration, while hardware platforms differ in memory hierarchy, supported numerical formats, and kernel throughput. These factors create a large tuning space, making manual performance engineering costly. We present Sol Video Inference Engine, an agentic, native, training-free acceleration framework for video diffusion models. It organizes five broadly applicable techniques, cache, sparse attention, token pruning, quantization, and kernel fusion, into an agentic acceleration stack for instance-specific optimization. For a concrete deployment target defined by a model, hardware platform, and serving configuration, parallel skill agents optimize the implementation of each technique, an agent integrator composes them into a global acceleration stack, and a human validator provides feedback on generation quality. We instantiate this workflow on three video models with different sizes and architectures: 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video. With little human effort, the full stack achieves more than 2x end-to-end acceleration while maintaining near-lossless VBench quality, demonstrating the effectiveness of the agent framework for video diffusion acceleration.

123. 【2606.23739】Systematic Exploration of 4-Expert Heterogeneous Mixture-of-Experts via Automated Pipeline Search

链接：https://arxiv.org/abs/2606.23739

作者：Yashkumar R Lukhi,Harsh Rameshbhai Moradiya,Radu Timofte,Dmitry Ignatov

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)

关键词：LEMUR neural network, network dataset ecosystem, neural network dataset, automated large-scale search, LEMUR neural

备注： 8 pages, 2 figures

点击查看摘要

Abstract:We present an automated large-scale search pipeline for heterogeneous 4-Expert Mixture-of-Experts (MoE4) architectures within the LEMUR neural network dataset ecosystem. Building on a hand-crafted heterogeneous MoE reference model, we replace manual design with a deterministic code-assembly generator that systematically combines base architecture families drawn from the LEMUR database into MoE4 ensembles, each governed by a convolutional gating network with temperature scaling, mixup augmentation, and cosine-annealed learning rate scheduling. Over a 28-day campaign on an NVIDIA RTX 4090, the pipeline generated 4,463 candidate models across 197 batches, of which 1,021 were evaluated successfully. A critical finding emerged from the campaign: due to alphabetical enumeration via this http URL, the entire explored search space (4.8% of the theoretical 23,751 possible 4-family combinations) is anchored to a single family, AirNet. We characterise this coverage bias precisely, identify the root cause in the generator, and propose a stratified random sampling fix. Within the AirNet anchored scope, ShuffleNet and MobileNetV3 consistently co-produce the highest-accuracy ensembles (mean accuracy up to 0.632), while FractalNet and MNASNet are identified as low-yield families warranting exclusion in future campaigns. The pipeline, analysis artefacts, and corrected generator are released as part of the open-source NNGPT project at this https URL

124. 【2606.23699】A Geometry-Informed Computer Vision Method for Detecting and Examining Overtaking Vehicles From A Bicycle

链接：https://arxiv.org/abs/2606.23699

作者：Gandhimathi Padmanaban,Rayane Moustafa,Fred Feng

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：continuous rear-facing video, direct field evidence, produced direct field, dependent on manual, produced direct

备注： 18 pages, 6 figures, in preparation for journal submission

点击查看摘要

Abstract:Instrumented bicycle studies have produced direct field evidence on vehicle passing behavior, but extracting overtaking events from continuous rear-facing video has remained dependent on manual, frame-by-frame annotation. This bottleneck constrains sample sizes and limits naturalistic cycling safety research. We present a geometry-informed computer vision pipeline that automates overtaking event detection from a single bicycle-mounted camera without multi-sensor configurations or explicit camera calibration. The system combines RT-DETR object detection with ByteTrack multi-object tracking through a three-stage geometric validation module enforcing bearing angle trend, apparent size growth, and spatial confirmation criteria derived from perspective projection principles. Validated on 315 manually annotated real-world overtaking events from urban roads in Ann Arbor, Michigan, the pipeline achieved 97.8% recall with zero false positives. The system identified overtaking intentions a mean of 2.44 seconds before vehicle passage, with 84.1% of events exceeding the 1.5-second human reaction time threshold, demonstrating feasibility for active cyclist warning. Lateral passing distance measurements from 96 events revealed 33.3% of passes below the 5-foot (152.4 cm) threshold, consistent with non-compliance rates in prior field and self-reported studies. A preliminary calibration-free lateral distance estimation approach using bounding box geometric features achieved mean absolute errors of 13-14 cm under leave-one-out cross-validation, sufficient to distinguish close passes from standard passes for safety categorization. By automating event isolation from consumer-grade footage, the system removes the primary annotation bottleneck of instrumented bicycle research and provides a scalable foundation for vehicle-bicycle interaction analysis across larger datasets and diverse urban environments.

125. 【2511.08065】I2E: Real-Time Image-to-Event Conversion for High-Performance Spiking Neural Networks

链接：https://arxiv.org/abs/2511.08065

作者：Ruichen Ma,Liwei Meng,Guanchao Qiao,Ning Ning,Yang Liu,Shaogang Hu

类目：Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词：Spiking neural networks, highly energy-efficient computing, promise highly energy-efficient, Spiking neural, neural networks

备注： AAAI-26 Oral

点击查看摘要

Abstract:Spiking neural networks (SNNs) promise highly energy-efficient computing, but their adoption is hindered by a critical scarcity of event-stream data. This work introduces I2E, an algorithmic framework that resolves this bottleneck by converting static images into high-fidelity event streams. By simulating microsaccadic eye movements with a highly parallelized convolution, I2E achieves a conversion speed over 300x faster than prior methods, uniquely enabling on-the-fly data augmentation for SNN training. The framework's effectiveness is demonstrated on large-scale benchmarks. An SNN trained on the generated I2E-ImageNet dataset achieves a state-of-the-art accuracy of 60.50%. Critically, this work establishes a powerful sim-to-real paradigm where pre-training on synthetic I2E data and fine-tuning on the real-world CIFAR10-DVS dataset yields an unprecedented accuracy of 92.5%. This result validates that synthetic event data can serve as a high-fidelity proxy for real sensor data, bridging a long-standing gap in neuromorphic engineering. By providing a scalable solution to the data problem, I2E offers a foundational toolkit for developing high-performance neuromorphic systems. The open-source algorithm and all generated datasets are provided to accelerate research in the field.

126. 【2606.24390】Female-RHINO: A Real-Time Scanner-Integrated Framework for Automated Quantitative Uterine MRI Analysis and Structured Reporting

链接：https://arxiv.org/abs/2606.24390

作者：Deepak Bhatia,Saad Ahmad,Smiti Tripathy,Maria Camila Bustos Vivas,Lieselotte Kratzsch,Anika Knupfer,Jordina Aviles Verdera,Susanne Schulz-Heise,Matthias May,Jana Hutter

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：MRI remains challenging, uterine MRI analysis, remains challenging due, uterine MRI remains, automated uterine MRI

备注：

点击查看摘要

Abstract:Standardized assessment of uterine MRI remains challenging due to anatomical variability, observer dependence, and the lack of workflow-integrated automated analysis tools. This work presents Female-RHINO: (R)eproductive (H)ealth (I)maging A(N)alysis T(O)ol, a real-time AI-assisted framework for automated quantitative uterine MRI analysis and structured reporting during image acquisition. We present an end-to-end system that integrates inline communication with the MRI scanner and deep learning-based analysis to derive quantitative uterine biomarkers from sagittal T2-weighted pelvic MRI. The framework combines segmentation and anatomical landmark detection models trained and evaluated on more than 500 multi-center datasets spanning diverse protocols, vendors, and patient populations. It performs volumetry, detects and quantifies common incidental findings such as fibroids and Nabothian cysts, and extracts six anatomical landmarks for biometric assessment. Results are compiled into a structured clinician-oriented report with integrated visualizations, without manual interaction. Evaluation on independent retrospective and prospective cohorts demonstrated robust performance across varying acquisition settings. Mean Dice similarity coefficients were 0.82 for the uterus and 0.80 for fibroids, with lower but consistent agreement for Nabothian cysts. Landmark detection achieved a mean radial error of 3.7 mm. End-to-end processing was completed in under 70 seconds, enabling availability of results during the ongoing scan. Prospective deployment yielded immediate, standardized, and reproducible analyses supported by inter-observer agreement. The proposed system enables real-time scanner-integrated AI for automated uterine MRI analysis and reporting, with potential to improve standardization, efficiency, and clinical workflow in pelvic imaging.

127. 【2606.24236】Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web

链接：https://arxiv.org/abs/2606.24236

作者：Weihao Li,Dianne Cook,Emi Tanaka,Susan VanderPlas,Klaus Ackermann

类目：Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：diagnosing linear models, common approach, approach for diagnosing, diagnosing linear, relies on manual

备注： Published in Australian New Zealand Journal of Statistics

点击查看摘要

Abstract:Visual assessment of residual plots is a common approach for diagnosing linear models, but it relies on manual evaluation, which does not scale well and can lead to inconsistent decisions across analysts. The lineup protocol, which embeds the observed plot among null plots, can reduce subjectivity but requires even more human effort. In today's data-driven world, such tasks are well suited for automation. We present a new R package that uses a computer vision model to automate the evaluation of residual plots. An accompanying Shiny application is provided for ease of use. Given a sample of residuals, the model predicts a visual signal strength (VSS) and offers supporting information to help analysts assess model fit.

128. 【2606.24168】A Dual Edge Spatial Jacobian Image Graph for Interpretable Diabetic Retinopathy Grading

链接：https://arxiv.org/abs/2606.24168

作者：Inam Ullah,Imran Razzak,Shoaib Jameel

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词：Automated diabetic retinopathy, strong predictive performance, Automated diabetic, colour fundus photographs, achieve strong predictive

备注：

点击查看摘要

Abstract:Automated diabetic retinopathy (DR) grading from colour fundus photographs can achieve strong predictive performance, but clinical interpretation requires more than an image-level label. It requires understanding how lesion evidence is distributed around retinal vessels and how this evidence relates to quantitative vascular biomarkers. We present a dual-edge spatial-Jacobian image graph for interpretable DR grading. Each fundus image is represented as a graph node with four aligned evidence streams: AutoMorph vessel information ($X_1$), DR-XAI-style lesion evidence maps ($X_2$), a 128-dimensional lesion-based contrastive image embedding ($X_3$), and AutoMorph morphometric biomarkers ($X_4$). The spatial edge branch ($X_{12}$) encodes vessel-lesion geometry, while the Jacobian branch ($X_{34}$) models embedding-biomarker sensitivity. Lightweight two-token attention fuses both edge families into a final image graph. On 2,910 matched non-augmented APTOS images, the full graph achieves 0.8076 accuracy, 0.8312 quadratic weighted kappa, 0.5915 macro-F1, and 0.9330 adjacent-grade accuracy; referable DR reaches 0.9055 accuracy and 0.9711 AUROC. The framework is positioned as an explainable representation-learning tool for lesion-biomarker hypothesis generation, rather than as a deployment-ready clinical classifier. The code is available at this https URL.

129. 【2606.23888】E-MRL: Cross-view Aligned Evidence-driven Multimodal Reinforcement Learning for Reliable 3D Tumor Analysis

链接：https://arxiv.org/abs/2606.23888

作者：Sijing Li,Zhongwei Qiu,Zhuoya Wang,Boxiang Yun,Zhenyu Yi,Jianwei Xu,Wenqiao Zhang,Yingda Xia,Ling Zhang

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：show great promise, Multimodal Reinforcement Learning, volumetric medical report, Reinforcement Learning, Evidence-driven Multimodal Reinforcement

备注： 9 pages, 2 figures

点击查看摘要

Abstract:While Vision-Language Models (VLMs) show great promise in volumetric medical report generation, they frequently suffer from visual hallucinations and a lack of grounding in 3D CT data. Current Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) strategies typically optimize text fidelity alone, essentially rewarding correct diagnoses derived from language priors rather than genuine visual perception. To address this, we propose cross-view aligned Evidence-driven Multimodal Reinforcement Learning (Evidence-MRL, noted as E-MRL), a reliable RL reasoning framework that formulates the generation process as a Markov Decision Process of "diagnosis-localization-verification". Unlike standard approaches, our model is explicitly trained to identify a "key evidence slice" alongside the global diagnostic report, grounding its findings in verifiable visual evidence. Crucially, we introduce a novel cross-view consistency reward, which validates the semantic alignment between the golden-standard report and a local visual re-query of the selected key slice, providing additional rewards for correctly-localized reasoning. Experiments on large-scale 3D CT tumor datasets demonstrate that E-MRL significantly reduces hallucinations and improves diagnostic accuracy compared to SFT and RL baselines, offering a clinically interpretable solution for visually-grounded and tumor analysis.

130. 【2606.23744】Performance and Interpretability of Convolutional, Transformer, and Hybrid Deep Learning Models in Colorectal Histology Classification

链接：https://arxiv.org/abs/2606.23744

作者：Reza Bozorgpour

类目：Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)

关键词：enabling automated analysis, computational pathology, enabling automated, important tool, tool in computational

备注：

点击查看摘要

Abstract:Deep learning has become an important tool in computational pathology, enabling automated analysis of histopathological images. While convolutional neural networks (CNNs) have traditionally dominated this field, transformer-based and hybrid architectures have recently demonstrated promising performance. However, comprehensive comparisons of these approaches for colorectal histopathology remain limited. This study evaluated twelve ImageNet-pretrained CNN, transformer, and hybrid architectures using the Kather colorectal histopathology dataset containing 5,000 image tiles from eight tissue classes. All models were trained using a standardized transfer-learning and fine-tuning protocol and assessed using multiple performance metrics, including accuracy, precision, sensitivity, specificity, F1-score, ROC-AUC, Cohen's kappa, and Matthews correlation coefficient. All evaluated models achieved high classification performance, with accuracies ranging from 93.2% to 97.1%. EVA-02 achieved the highest overall performance (97.1% accuracy, 97.0% F1-score), closely followed by ViT-B/16. Among CNNs, ResNet34 and ConvNeXt-Tiny demonstrated highly competitive performance, achieving accuracies of 96.4% and 96.3%, respectively. Transformer architectures generally produced the strongest results across evaluation metrics, although the performance gap between the best transformer and CNN models was relatively small. Per-class analysis showed consistently strong classification performance across all tissue categories, with Complex Stroma representing the most challenging class. Overall, transformer-based architectures achieved the highest predictive performance, whereas modern CNNs provided a favorable balance between accuracy and model complexity. These findings provide a comprehensive benchmark of major deep learning paradigms for colorectal histopathology classification.