本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新587篇论文，其中：

自然语言处理88篇
信息检索17篇
计算机视觉113篇

自然语言处理

1. 【2606.07515】How reliable are LLMs when it comes to playing dice?

作者：Luca Avena,Gianmarco Bet,Bernardo Busoni

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Probability (math.PR)

关键词：controlled benchmarking study, large language models, discrete probability problems, capabilities of large, large language

备注：

点击查看摘要

Abstract:We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

2. 【2606.07513】Agentopia: Long-Term Life Simulation and Learning in Agent Societies

链接：https://arxiv.org/abs/2606.07513

作者：Xintao Wang,Sirui Zheng,Hongqiu Wu,Weiyuan Li,Jen-tse Huang,Minghao Zhu,Can Zu,Qi Deng,Jiawei Wang,Qianyu He,Heng Wang,Xiaojian Wu,Yunzhe Tao

类目：Computation and Language (cs.CL)

关键词：social, life, Humans learn, simulated social experience, long-term life simulation

备注： 79 pages, 19 figures

点击查看摘要

Abstract:Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society simulations typically operate at the scale of days, limiting the depth of social interactions and long-term growth. In this paper, we study long-term life simulation and LLM learning in agent societies, with two goals: (1) investigating social behaviors that emerge from life-long simulation, and (2) developing anthropomorphic capabilities in LLMs, particularly intelligence in social life, through years of simulated social experience. Specifically, we present Agentopia, a comprehensive framework for long-term life simulation in multi-agent societies, where 100 agents autonomously pursue personal growth, develop social relationships, and fulfill their needs and goals over 10 simulated years. We define life reward to mirror human well-being, and leverage this reward to train LLMs via rejection sampling. Extensive experiments show that agents exhibit rich emergent social behaviors. Furthermore, life reward training effectively enhances the underlying LLM, which leads to improved agent well-being in simulation, and generalizes to downstream role-playing benchmarks with +15.6% improvement.

3. 【2606.07512】MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

链接：https://arxiv.org/abs/2606.07512

作者：Cong Chen,Guo Gan,Kaixiang Ji,ChaoYang Zhang,Zhen Yang,Guangming Yao,Hao Chen,Jingdong Chen,Yi Yuan,Chunhua Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Current Vision-Language Models, processing full-length visual, full-length visual sequences, visual sequences induces, sequences induces prohibitive

备注：

点击查看摘要

Abstract:Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

4. 【2606.07502】Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

链接：https://arxiv.org/abs/2606.07502

作者：Songhao Wu,Zhongxin Chen,Yuxuan Liu,Heng Cui,Cong Li,Rui Yan

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Large language models, Large language, language models exhibit, models exhibit impressive, exhibit impressive zero-shot

备注： preprint

点击查看摘要

Abstract:Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at this https URL.

5. 【2606.07479】Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

链接：https://arxiv.org/abs/2606.07479

作者：Sercan Karakaş,Yusuf Şimşek

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：light verb constructions, multiword expression processing, partially idiomatic predicate, idiomatic light verb, fully literal verb-object

备注： Accepted to ACL SRW 2026

点击查看摘要

Abstract:Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.

6. 【2606.07451】EVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

链接：https://arxiv.org/abs/2606.07451

作者：Sweta Mahajan,Sukrut Rao,Jiahao Xie,Alexander Koller,Bernt Schiele

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：diverse tasks due, image-text embedding space, shared image-text embedding, Vision-language models, diverse tasks

备注： 20 pages, 13 figures, 14 tables

点击查看摘要

Abstract:Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

7. 【2606.07441】Sycophantic Praise: Evaluating Excessive Praise in Language Models

链接：https://arxiv.org/abs/2606.07441

作者：Daniel Vennemeyer,Phan Anh Duong,Meryl Ye,Ruihong Huang,Tianyu Jiang

类目：Computation and Language (cs.CL)

关键词：Sycophancy in language, comparatively little attention, language models, models is typically, typically studied

备注：

点击查看摘要

Abstract:Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment problem that cannot be reliably measured using current methods. We introduce a parameterized framework that measures whether praise is excessive relative to contribution quality and expected user ability. We show that our framework substantially outperforms generic LLM judges in agreement with human annotations, and that sycophantic praise occurs far more frequently in social and interpretive domains than in objective reasoning settings. Together, these findings position praise calibration as a distinct alignment challenge.

8. 【2606.07435】he Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

链接：https://arxiv.org/abs/2606.07435

作者：Rishabh Jain,Naomi Harte

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Visual speech recognition, human-like visual speech, speech recognition, surpass human lipreaders, gains establish human-like

备注： Accepted at INTERSPEECH 2026

点击查看摘要

Abstract:Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

9. 【2606.07422】he Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

链接：https://arxiv.org/abs/2606.07422

作者：Yang Zhang,Xiao Fei,Amr Mohamed,Sarah Almeida Carneiro,Mersin Konomi,Mingmeng Geng,Ahmed Asaad,Guokan Shang,Michalis Vazirgiannis

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：answer culturally grounded, Large language models, culturally grounded questions, Large language, cultural knowledge

备注：

点击查看摘要

Abstract:Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

10. 【2606.07402】M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

链接：https://arxiv.org/abs/2606.07402

作者：Zhengjun Huang,Wenxuan Liu,Zhoujin Tian,Wei Chen,Junle Chen,Yuqian Wu,Fangyuan Zhang,Qintian Guo,Xiaofang Zhou

类目：Computation and Language (cs.CL)

关键词：existing benchmarks assume, concealed user information, Language agents, authentic multimodal file, straightforward content

备注：

点击查看摘要

Abstract:Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

11. 【2606.07379】Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

链接：https://arxiv.org/abs/2606.07379

作者：Thanawat Lodkaew,Johannes Ackermann,Soichiro Nishimori,Nontawat Charoenphakdee,Masashi Sugiyama,Takashi Ishida

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)

关键词：growing failure mode, achieve high evaluation, producing deceptive performance, producing deceptive, high evaluation scores

备注：

点击查看摘要

Abstract:A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

12. 【2606.07356】DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

链接：https://arxiv.org/abs/2606.07356

作者：Zhengkun Ge,Xiaoqian Liu,Haoran Zhang,Yuan Ge,Junxiang Zhang,Zhengtao Yu,Jingbo Zhu,Tong Xiao

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：edit-irrelevant source components, language-specified acoustic content, preserving edit-irrelevant source, Text-guided audio editing, Text-guided audio

备注：

点击查看摘要

Abstract:Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.

13. 【2606.07342】LLM-Guided Evolution for Medical Decision Pipelines

链接：https://arxiv.org/abs/2606.07342

作者：Ivan Sviridov,Artem Oskin,Ivan Panin,Iaroslav Bespalov,Dmitry Dylov,Ivan Oseledets,Aleksandr Nesterov

类目：Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词：Adapting large language, requires costly fine-tuning, large language models, Adapting large, pipeline engineering

备注：

点击查看摘要

Abstract:Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at this https URL. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

Subjects:

Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

Cite as:
arXiv:2606.07342 [cs.CL]

(or
arXiv:2606.07342v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.07342

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

14. 【2606.07313】SV-Detect: AI-generated Text Detection with Steering Vectors

链接：https://arxiv.org/abs/2606.07313

作者：Mikhail Vishnyakov,Tatiana Gaintseva

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Detecting machine-generated text, Detecting machine-generated, editing attacks, source models, frozen language model

备注：

点击查看摘要

Abstract:Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

15. 【2606.07309】Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

链接：https://arxiv.org/abs/2606.07309

作者：Iosif Tsangko,Andreas Triantafyllopoulos,Björn W. Schuller

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Instruction-following audio language, Instruction-following audio, explicit acoustic cues, augmented with explicit, audio language models

备注： 6 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

16. 【2606.07300】Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

链接：https://arxiv.org/abs/2606.07300

作者：Xing Yue,Yongliang Shen,Weiming Lu

类目：Computation and Language (cs.CL)

关键词：vehicle for thought, intricately tied, large language model, largely overlooking sounds, LLMs' phonological understanding

备注： Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.

17. 【2606.07297】SWE-Explore: Benchmarking How Coding Agents Explore Repositories

链接：https://arxiv.org/abs/2606.07297

作者：Shaoqiu Zhang,Yuhang Wang,Jialiang Liang,Yuling Shi,Wenhao Zeng,Maoquan Wang,Shilin He,Ningyuan Xu,Siyu Ye,Kai Cai,Xiaodong Gu

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：Repository-level coding benchmarks, Repository-level coding, SWE-bench have driven, driven a rapid, rapid surge

备注： 20 pages, 5 figures

点击查看摘要

Abstract:Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

18. 【2606.07240】KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

链接：https://arxiv.org/abs/2606.07240

作者：Seymanur Akti,Alexander Waibel

类目：Computation and Language (cs.CL); Sound (cs.SD)

关键词：Cross-lingual voice cloning, voice cloning aims, preserving speaker identity, Cross-lingual voice, Voice Cloning track

备注：

点击查看摘要

Abstract:Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

19. 【2606.07237】When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

链接：https://arxiv.org/abs/2606.07237

作者：Mahdi Alkaeed

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, clinical question answering, diagnosis support, Language Models

备注： 12 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.

20. 【2606.07229】MMAE: A Massive Multitask Audio Editing Benchmark

链接：https://arxiv.org/abs/2606.07229

作者：Ziyang Ma,Ruiqi Yan,Ruiyang Xu,Jie Fang,Zhikang Niu,Yi-Wen Chao,Wenming Tu,Tianrui Wang,Auden,Qi Chen,Wenxi Chen,Jiaying Chi,Yanru Huo,Zixuan Jiang,Xiquan Li,Yalin Li,Junxi Liu,Minghao Liu,Binghao Qiang,Yijia Shan,Zheshu Song,Tian Tan,Zixiang Wang,Zeyu Xie,Zhifei Xie,Xiaoyu Xing,Qixiang Xu,Chen Yang,Guanrou Yang,Shan Yang,Yifan Yang,Steve Yves,Haotian Zhang,Haina Zhu,Kai Yu,Liefeng Bo,Eng-Siong Chng,Xie Chen

类目：ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：Massive Multitask Audio, Massive Multitask, Multitask Audio Editing, general-purpose instruction-based audio, evaluation testbed designed

备注： Open-Source at [this https URL](https://github.com/ddlBoJack/MMAE)

点击查看摘要

Abstract:We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

21. 【2606.07226】DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

链接：https://arxiv.org/abs/2606.07226

作者：Tongzhou Yu,Mingjia Li,Hong Qian,Wenkai Wang,Zongbao Zhang,Yaoyu Jiang,Xiangfeng Wang,Aimin Zhou,Jiajun Guo

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：critical competency, debate, creativity, scoring, data

备注： Accepted by KDD 2026

点击查看摘要

Abstract:Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

22. 【2606.07219】Adversarial Creation and Detection of AI-Generated Social Bot Content

链接：https://arxiv.org/abs/2606.07219

作者：Mykola Trokhymovych,Ricardo Baeza-Yates,Alessandro Flammini,Diego Saez-Trumper,Filippo Menczer

类目：Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词：generating human-like content, large language models, convergence of large, large language, manipulate the information

备注：

点击查看摘要

Abstract:The convergence of large language models and social bots allows malicious actors to manipulate the information ecosystem by generating human-like content at scale. Existing models for detecting AI-generated content often fail in the wild, primarily due to the lack of ground-truth data. We address this gap through an adversarial methodology that models the impersonation of real social media users by malicious actors. Using this methodology, we curate a multilingual, cross-platform dataset of paired human and AI-generated messages. Training on such adversarial data yields accurate detection of AI-generated text. Our approach significantly outperforms existing models for content-based bot detection in real-world, out-of-distribution data.

23. 【2606.07218】HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

链接：https://arxiv.org/abs/2606.07218

作者：Mingyu Zhang,Ying Ma

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：expose answer chains, Multi-hop RAG poses, organize retrieved text, poses a data-engineering, data-engineering problem

备注： Submitted to ICDE 2027. 13 pages, 3 figures

点击查看摘要

Abstract:Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.

24. 【2606.07190】From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

链接：https://arxiv.org/abs/2606.07190

作者：Yuhang Zhou,Yixin Cao,Guangnan Ye

类目：Computation and Language (cs.CL)

关键词：LLM problem solving, local step correctness, trajectory of LLM, LLM problem, existing process reward

备注：

点击查看摘要

Abstract:Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at this https URL.

25. 【2606.07183】Geometry of Semantic Space: Comparative Study of Discrete and Continuous Models

链接：https://arxiv.org/abs/2606.07183

作者：Gabriel Bounias,Sabine Ploux

类目：Computation and Language (cs.CL)

关键词：geometry underlying NLP, underlying NLP models, semantic geometry underlying, underlying NLP, NLP models

备注： 9 pages, 7 figures

点击查看摘要

Abstract:This work examines the semantic geometry underlying NLP models. We compare supervised vector embeddings, such as CamemBERT, with lexical co-occurrence graphs that encode semantic relations more directly. While transformer-based embeddings achieve strong performance, their induced geometries often display unsatisfactory distributions. In contrast, graph-based models reveal a clearer and more human-readable organization of meaning. We have implemented a methodology that allows us to perform a comparative analysis either based on the structure of the graphs or based on the topology of the embeddings induced by these two approaches. The results of the comparison -- applied to the French "Great National Debate" corpus a collection of citizen contributions to the public debate -- show a similar local topology but a very different overall structure and topology. Theses findings suggest complementary perspectives between deep supervised models and graph-based models, considering a new pathway to guide neural architectures toward more stable and interpretable convergence with graphs structures.

26. 【2606.07172】xtual Supervision Enhances Geospatial Representations in Vision-Language Models

链接：https://arxiv.org/abs/2606.07172

作者：Marcelo Sartori Locatelli,Fernando Tonucci,Jea Kwon,Luiz Felipe Vecchietti,Bryan Nathanael Wijaya,Cheng Yaw Low,Virgilio Almeida,Meeyoung Cha

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：machine learning systems, critical yet underexplored, underexplored dimension, development of machine, systems for tasks

备注： Accepted at ICML 2026

点击查看摘要

Abstract:Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

27. 【2606.07167】UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

链接：https://arxiv.org/abs/2606.07167

作者：Ahmer Tabassum,Sarfraz Ahmad,Hasan Iqbal,Owais Aijaz,Momina Ahsan,Preslav Nakov

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Meaningful multilingual evaluation, Meaningful multilingual, native Urdu MCQ, Urdu, educational context

备注： 27 pages, 18 figures, 17 tables, Submitted to ARR May 2026

点击查看摘要

Abstract:Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

28. 【2606.07130】Explicit Evidence Grounding via Structured Inline Citation Generation

链接：https://arxiv.org/abs/2606.07130

作者：Anar Yeginbergen,Amelie Wührl,Anna Rogers,Rodrigo Agerri

类目：Computation and Language (cs.CL)

关键词：widely adopted, demand for factual, faithful generation grows, generation grows, evidence span identification

备注：

点击查看摘要

Abstract:As AI systems become more widely adopted, the demand for factual and faithful generation grows. Properly attributing information through citations becomes, therefore, crucial. This work introduces FullCite, a framework that, in contrast to most previous works, generates structured inline citations linking each claim to both its source document and supporting evidence. FullCite proposes three strategies to inline citation generation: prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment. Using three question answering benchmarks, namely, ASQA, BioASQ, and ExpertQA, we assess citation quality and faithfulness along three dimensions: document-level correctness, evidence span identification, and claim-citation faithfulness. Our evaluation shows that while LLMs are generally effective at identifying relevant documents, they struggle to identify the precise supporting spans within them. This gap suggests that achieving faithful attributed QA will require research to place greater emphasis on precise evidence span identification.

29. 【2606.07123】Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings

链接：https://arxiv.org/abs/2606.07123

作者：Amanda Cercas Curry,Lucio La Cava,Luca Maria Aiello,Gianmarco De Francisci Morales

类目：Computation and Language (cs.CL)

关键词：inherently perspectival, varying across annotator, annotator backgrounds, ideological positions, meaning in language

备注：

点击查看摘要

Abstract:Social meaning in language is inherently perspectival, varying across annotator backgrounds, demographics, and ideological positions. However, most NLP systems collapse this variation into a single ground-truth label, ignoring the diversity of interpretations. In this work, we model social dimensions along a perspectivist spectrum, capturing how interpretations vary across demographic groups on a dataset consisting of 28k human annotations. We benchmark multiple modeling paradigms, including zero-shot, few-shot, and fine-tuned approaches, and propose fusion embeddings that integrate textual and demographic representations. Our fusion models yield consistent and statistically significant improvements over text-only baselines across all fusion strategies (+5.9-6.5% relative macro PR-AUC), with shuffle ablations confirming that demographic profiles carry genuine predictive signal rather than spurious correlations.

30. 【2606.07116】OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

链接：https://arxiv.org/abs/2606.07116

作者：Haoqi Wang,Lorenz K. Mueller,Jiawei Zhuang,Mathieu Salzmann,Lukas Cavigelli

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：significantly reducing computational, reducing computational cost, large language models, memory usage, widely adopted

备注：

点击查看摘要

Abstract:Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

31. 【2606.07103】Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

链接：https://arxiv.org/abs/2606.07103

作者：Zhuo Liu,Haozheng Du,Xiangxiang Xu,Hangfeng He

类目：Computation and Language (cs.CL)

关键词：naturally collected data, collected data, naturally collected, lack a systematic, content

备注： 9 pages

点击查看摘要

Abstract:Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance. We study this problem with a controlled content overlap setup built on parallel Bible translations. Specifically, we define the overlap parameter $\alpha$ as the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content ($\alpha=0$) to fully shared content ($\alpha=1$). Cross-overlap evaluation of RoBERTa-based classifiers shows that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly. A cross-style content retrieval probe further shows that content becomes less recoverable as $\alpha$ increases, with training dynamics showing this removal occurs gradually. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts.

32. 【2606.07098】SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

链接：https://arxiv.org/abs/2606.07098

作者：Ernests Lavrinovics,Marco Letizia,Roy Janco,Shai Segal,Johannes Bjerva,Maurizio Pierini

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：based Large Language, Large Language Model, aid truncated Singular, Singular Value Decomposition, Large Language

备注：

点击查看摘要

Abstract:We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective-rank entropy, and that this reduction is strongly correlated with compression loss. Experiments on Llama 3.1 8B Instruct and Qwen3-8B show that SigmaScale is competitive with closely related state-of-the-art SVD-based compression methods across perplexity and zero-shot benchmarks. By using learned activation-aware transformations, SigmaScale explores a more flexible route to low-rank LLM compression by adapting to the structure of individual model weights. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM-inference computing cost.

33. 【2606.07069】mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

链接：https://arxiv.org/abs/2606.07069

作者：Yerzhan Sapenov,Jaromir Savelka

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：International Student Assessment, Student Assessment, OECD Programme, Programme for International, International Student

备注：

点击查看摘要

Abstract:We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.

34. 【2606.07066】Modeling semantic association in self-paced reading with language model embeddings

链接：https://arxiv.org/abs/2606.07066

作者：Sara Møller Østergaard,Kenneth Enevoldsen,Afra Alishahi,Bruno Nicenboim

类目：Computation and Language (cs.CL)

关键词：Semantic association, Semantic, association, important component, self-paced reading

备注：

点击查看摘要

Abstract:Semantic association between a word and its context has been identified as an important component of reading comprehension, even when word predictability is accounted for. Recent research has highlighted the potential of language model ( LM) embeddings to quantify semantic association. Yet, embedding-based semantic association have been operationalized in a myriad of ways. In this study, we use embeddings from LMs to estimate semantic association on a corpus of joint electroencephalography (EEG) and self-paced reading of natural, Dutch texts. Semantic association is calculated in ten different implementations that vary the embedding model and context lengths. The effects of semantic association across the different implementations on the N400 and self-paced reading times are examined using Bayesian hierarchical models and Bayes factor. The results show that the choice of embedding model can alter the estimated effect of semantic association on both the N400 and self-paced reading times. Furthermore, the results demonstrate a promising potential of sentence embeddings for capturing semantic association, as only implementations relying on sentence embeddings indicate reliable results of semantic association beyond word predictability on both neural and behavioral measures. Together, these findings highlight the importance of methodological choices in quantifying semantic association.

35. 【2606.07057】Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation

链接：https://arxiv.org/abs/2606.07057

作者：Shamira Venturini,Steffen Kinkel

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：automatically generated keyphrases, generated keyphrases remains, complex challenge, quality of automatically, automatically generated

备注：

点击查看摘要

Abstract:Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which misalign with how humans judge informativeness and relevance. We introduce Semantic R-Precision (SemR-p), a novel evaluation metric that integrates semantic similarity into the rank-aware R-Precision framework. Designed from a human-centric perspective and inspired by Information Retrieval metrics, SemR-p rewards semantically relevant keyphrases that appear early in the output list. We conducted extensive analyses to assess its semantic sensitivity, ranking awareness, and discriminative power across models and datasets. The results suggest that SemR-p offers a complementary lens for evaluating keyphrase predictions, helping to better reflect user-centred notions of relevance alongside traditional lexical and semantic matching metrics.

36. 【2606.07054】RACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

链接：https://arxiv.org/abs/2606.07054

作者：Vijitha Mittapalli,Shreyaa Jayant Dani,Satya Srujana Pilli,Snigdha Ansu,Mohammadreza Teymoorianfard,Franck Dernoncourt,Hongjie Chen,Yu Wang,Ryan A. Rossi,Nesreen K. Ahmed

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：Autonomous LLM agents, making sabotage difficult, pursue hidden malicious, hidden malicious objectives, individually benign actions

备注：

点击查看摘要

Abstract:Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

37. 【2606.07040】Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

链接：https://arxiv.org/abs/2606.07040

作者：Xing Yue,Linjuan Wu,Daoxin Zhang,Yongliang Shen,Weiming Lu

类目：Computation and Language (cs.CL)

关键词：Open-ended reward modeling, Open-ended reward, reward modeling requires, follow subtle, domain-specific preferences

备注： 24 pages, 6 images

点击查看摘要

Abstract:Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at this https URL.

38. 【2606.07030】Phonetic Error Analysis of Raw Waveform Acoustic Models

链接：https://arxiv.org/abs/2606.07030

作者：Erfan Loweimi,Zhengjun Yue,Andrea Carmantini,Zoran Cvetkovic,Steve Renals,Peter Bell

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：TIMIT phone recognition, phone error rate, TIMIT phone, phone recognition, waveform acoustic models

备注： INTERSPEECH2026

点击查看摘要

Abstract:We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

39. 【2606.07020】MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

链接：https://arxiv.org/abs/2606.07020

作者：Yilun Liu,Miao Zhang,Shimin Tao,Minggui He,Chunguang Zhao,Chenxin Liu,Li Zhang,Chen Liu,Cheng Qian,Liqun Deng,Xiaojun Meng,Daimeng Wei

类目：Computation and Language (cs.CL)

关键词：landscapes remain metric-rich, necessitating fine-grained multilingual, resulting score landscapes, score landscapes remain, fine-grained multilingual post-evaluation

备注：

点击查看摘要

Abstract:Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis. However, single LLMs and open-ended agents are easily swamped by the long, noisy diagnostic input, and no reusable taxonomy exists for it. To address this, we propose MADE, a Multilingual Agentic Diagnosing Engine that decomposes post-evaluation analysis into planning, aggregate analysis, instance-level case inspection, multilingual and cultural reflection, and grounded report synthesis. MADE is paired with an expert-led 54-query and 15-language diagnostic set, evaluated on top of a large-scale multilingual evaluation substrate (33 model families, 11 benchmarks, 26 languages, 34 cultures, 8.66M evaluation records). Experiments show that MADE outperforms the strongest shared baseline by 47% in diagnosis report quality and is preferred by human multilingual experts in 87.9% of pairwise comparisons. Applied with multilingual experts, MADE further surfaces four actionable findings on deployment, iteration, and cross-cultural pitfalls, turning benchmark score tables into model-selection and remediation guidance.

40. 【2606.07017】he Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

链接：https://arxiv.org/abs/2606.07017

作者：Xiaoou Liu,Tiejin Chen,Weibo Li,Xiyang Hu,Hua Wei

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)

关键词：Foundation model, Foundation model agents, foundation model agent, Markov Decision Process, foundation model community

备注： 7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track

点击查看摘要

Abstract:Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

41. 【2606.07006】RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

链接：https://arxiv.org/abs/2606.07006

作者：Yongliang Miao,Fengyuan Liu,Wei Shi,Yanguang Liu,Fei Sun,Na Zou,Mengnan Du

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：adapting large language, offline expert demonstrations, Supervised fine-tuning, imitating offline expert, single expert trajectory

备注：

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at this https URL.

42. 【2606.06994】Principles of Concept Representation in Sentence Encoders

链接：https://arxiv.org/abs/2606.06994

作者：Isabelle Mohr,John Dujany,Jonathan Souquet,Andre Freitas

类目：Computation and Language (cs.CL); Databases (cs.DB)

关键词：good concept representations, produce good concept, sentence encoder produce, encoder produce good, makes a sentence

备注：

点击查看摘要

Abstract:What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.

43. 【2606.06985】Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

链接：https://arxiv.org/abs/2606.06985

作者：Tung X. Nguyen,Hieu Minh Truong,Giang-Son Nguyen,Nhu Vo,Wray Buntine,Dung D. Le

类目：Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：Automatic Speech Recognition, Automatic Speech, challenging for Automatic, Speech Recognition, single utterance

备注： Accepted at INTERSPEECH 2026

点击查看摘要

Abstract:Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. We first identify CS spans by adopting POI detection method from literature, then construct acoustically plausible near-miss hypotheses by perturbing POIs in ASR N-best outputs and expanding candidates with a large language model. Hard but plausible negatives are retained through filtering with acoustic, phonemic, and textual constraints. Finally, we fine-tune Whisper-small with LoRA using a POI-weighted cross-entropy anchor objective together with a multi-negative contrastive ranking loss. Experiments on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) show consistent reductions of over 2% in both general and CS-aware error rates compared to standard LoRA fine-tuning.

44. 【2606.06960】ree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

链接：https://arxiv.org/abs/2606.06960

作者：Zihao Deng,Yining Zhu,Leiming Wang,Jingfei Lu,Junbo Wang,Chuncheng Ran,Yu Yang,Dixuan Yang,Jikun Shen

类目：Computation and Language (cs.CL)

关键词：assume explicit goals, stable task patterns, Experience-based self-evolution, crucial for LLM, explicit goals

备注：

点击查看摘要

Abstract:Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit rewards, where past experience is difficult to reuse and feedback is delayed, noisy, and outcome-level. We introduce \textsc{FinEvolveBench}, a temporally controlled benchmark for financial sentiment prediction that links daily news-driven predictions to future excess returns. We further propose Tree-of-Experience (ToE), a structured experience-management method that organizes, retrieves, validates, and updates agent experience. Experiments show that general-purpose experience mechanisms do not consistently outperform no-experience baselines, while ToE achieves stronger overall performance. These results highlight the importance of structured experience management for self-evolving agents in implicit-reward environments.

45. 【2606.06959】OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

链接：https://arxiv.org/abs/2606.06959

作者：Xinyi Li,Zhen Fang,Yongxin Deng,Jinyuan Luo,Hongnan Ma,Changdae Oh,Zijing Shi,Shanshan Ye,Hanchen Wang,Shu-Lin Chen,Yadan Luo,Mengyue Yang,Sean Du,Sharon Li,Ling Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, reliable deployment, deployment of large, large language, Hallucination detection

备注： Preprint. Code and data are available at [this https URL](https://github.com/Nellie179/Hallucination-Detection)

点击查看摘要

Abstract:Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at this https URL.

46. 【2606.06946】Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

链接：https://arxiv.org/abs/2606.06946

作者：Gonzalo Mancera,Daniel DeAlcala,Aythami Morales,Julian Fierrez,Ruben Tolosana,Francisco Jurado

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Natural Language Processing, recent Large Language, specific Natural Language, Membership Inference Test, Large Language Models

备注： IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

点击查看摘要

Abstract:We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples were part of the training data of these adapted models, providing a useful auditing tool for the management of intellectual property and sensitive data. Our analysis explores the relationship between model perplexity and membership status, providing a systematic framework for estimating data exposure in fine-tuned LLMs. We conducted experiments on four models and three benchmark datasets, obtaining precision values in determining if given data were used for training ranging from 0.77 to 0.92, which outperform state-of-the-art baselines and demonstrate the robustness and generality of the proposed method. In general, our findings underscore the potential of LoRA-MINT as an effective and scalable framework for auditing LLMs, improving transparency, and fostering the ethical and responsible deployment of AI and NLP technologies. For the sake of concreteness and current relevance, our discussion and experiments are centered on LoRAadjusted LLMs, but note that most of the presented methodology is easily applicable for auditing training data given any other technique for adapting LLMs or, more generally, any other domain-adapted AI models.

47. 【2606.06942】Didact: A Cross-Domain Capability Discovery System for Defence

链接：https://arxiv.org/abs/2606.06942

作者：Aarya Bodhankar,Aditya Joshi,Bao Gia Doan,Thomas Marchant,Oscar Leslie,Flora Salim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：alongside sector priorities, sector priorities relevant, monitor rapidly evolving, research alongside sector, rapidly evolving research

备注： Under Review at CIKM 2026 (System Demonstration Track)

点击查看摘要

Abstract:Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operational and strategic needs. In practice, these sources are fragmented across heterogeneous formats, disjoint repositories, and siloed update streams, making capability discovery slow and difficult to audit. We present Didact, a prototype that integrates publicly available defence reports and policy documents from Australia with a purpose-built knowledge graph derived from Australian research publications. Didact provides natural language conversations for policy-oriented workflows, and leverages a composite retrieval-augmented generation (RAG) pipeline. A key feature of Didact is an interactive Evidence Rail that visualises retrieved evidence and source relationships. Our evaluation of the output quality and runtime of Didact highlights its utility. While Didact has been co-developed as an academia-industry project for the Australian context, it is adaptable to other domains where knowledge is similarly fragmented. A demonstration video is available here:

48. 【2606.06915】hinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

链接：https://arxiv.org/abs/2606.06915

作者：Vladislav Smirnov(1),Chieu Nguyen(1),Sergey Senichev(7),Minh Ngoc Ta(1),Ekaterina Fadeeva(2),Artem Vazhentsev(1),Daria Galimzianova(1),Nikolai Rozanov(1 and 3),Viktor Mazanov(6),Jingwei Ni(2),Tianyi Wu(4),Igor Kiselev(5),Mrinmaya Sachan(2),Iryna Gurevych(1),Preslav Nakov(1),Timothy Baldwin(1),Artem Shelmanov(1) ((1) MBZUAI, (2) ETH Zürich, (3) Imperial College London, (4) NUS, (5) Accenture, (6) Innopolis University, (7) Independent Researcher)

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large language model, improving large language, TTC scaling strategies, allocating additional compute, TTC scaling

备注：

点击查看摘要

Abstract:Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

49. 【2606.06906】EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

链接：https://arxiv.org/abs/2606.06906

作者：Xiaopeng Yuan,Zebin Wang,Suwen Wang,Zongxin Yang,Haohan Wang,Yushun Dong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Long-context question answering, remains challenging, challenging for smaller, smaller language models, test-time training

备注： 13 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.

50. 【2606.06879】An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

链接：https://arxiv.org/abs/2606.06879

作者：Carl Lochstampfor,Ayan Roy

类目：Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：establishing baseline detection, work introduced COVA, synthetically generated multi-turn, baseline detection benchmarks, introduced COVA

备注：

点击查看摘要

Abstract:Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71\% accuracy and 0.7786 macro F1 compared with 78.43\% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7$\times$ improvement in label correction rate, from 49.8\% to 3.9\%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1\% to 46.5\%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.

51. 【2606.06865】Are Large Language Models Suitable for Graph Computation? Progress and Prospects

链接：https://arxiv.org/abs/2606.06865

作者：Yuting Zhang,Yi Han,Kai Wang,Wei Ni,Angela Bonifati,Wenjie Zhang

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, algorithmic operations, increasingly explored, structured relationships

备注：

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such computation and how they should be incorporated into graph-solving pipelines. Existing surveys at the intersection of LLMs and graphs primarily focus on graph learning, text-attributed graphs, or graph-language modeling. To bridge this gap, we provide a comprehensive review of LLMs for graph computation through a role-based taxonomy. Specifically, we identify two major paradigms: i) LLMs as executors, where models directly solve graph tasks from graph descriptions and instructions; and ii) LLMs as planners, where models formulate problems, decompose reasoning steps, and invoke external tools or agents for execution. Based on this taxonomy, we analyze the strengths and limitations of current methods. Our review indicates that LLMs are promising for simple, small-scale tasks, but remain unreliable for large-scale and exactness-demanding tasks. Finally, we summarize available datasets and suggest four future directions.

52. 【2606.06857】Interpreting Brain Responses to Language with Sparse Features from Language Models

链接：https://arxiv.org/abs/2606.06857

作者：Michael A. Lepori,Kendrick Kay,Greta Tuckute

类目：Computation and Language (cs.CL)

关键词：central goal, goal of cognitive, cognitive neuroscience, Sparse Encoding Models, Augmented Sparse Encoding

备注：

点击查看摘要

Abstract:A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

53. 【2606.06842】CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

链接：https://arxiv.org/abs/2606.06842

作者：Chenshuo Pan,Yu Zhao,Jie Zhang,Changzai Pan,Zhenhe Wu,Jiayi Liang,Yujie Mao,Shuangyong Song,Yongxiang Li,Zhongjiang He

类目：Computation and Language (cs.CL)

关键词：large language models, reasoning remains challenging, require multi-step inference, language models, remains challenging

备注： 24pages,10 figures

点击查看摘要

Abstract:Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

54. 【2606.06840】Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

链接：https://arxiv.org/abs/2606.06840

作者：Debjyoti Saha Roy,Byron C. Wallace,Javed A. Aslam

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Modern reasoning models, models offer surprisingly, offer surprisingly strong, surprisingly strong zero-shot, strong zero-shot performance

备注：

点击查看摘要

Abstract:Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We characterize reasoning as a two-phase process: A broad "shortlisting" of candidates followed by fine-grained reasoning over the resulting set. We provide evidence across a range of datasets that these steps can be isolated and are complementary. Using this characterization, we develop a mechanistic distillation strategy that consistently outperforms standard distillation.

55. 【2606.06835】ranslate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

链接：https://arxiv.org/abs/2606.06835

作者：Pratik Jayarao,Chaitanya Dwivedi,Himanshu Gupta,Neeraj Varshney,Adithya M Devraj,Meet Vadera,Priyanka Nigam,Bing Yin

类目：Computation and Language (cs.CL)

关键词：natively requires pretraining, performance gap, requires pretraining, pretraining or fine-tuning, fine-tuning on corpora

备注： 14 pages main text plus appendix, 7 figures, 11 tables

点击查看摘要

Abstract:The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model's dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain.

Comments:
14 pages main text plus appendix, 7 figures, 11 tables

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.06835 [cs.CL]

(or
arXiv:2606.06835v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.06835

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

56. 【2606.06834】he Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

链接：https://arxiv.org/abs/2606.06834

作者：Chahat Baranwal,Aadtya Baranwal,Lakshya Nitin Tandon

类目：Computation and Language (cs.CL); Genomics (q-bio.GN)

关键词：High-grade gliomas integrate, shape synaptogenic gene, synaptogenic gene expression, High-grade gliomas, noncoding elements shape

备注：

点击查看摘要

Abstract:High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.

57. 【2606.06825】Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

链接：https://arxiv.org/abs/2606.06825

作者：Shihao Zhang,Xiaoman Wang,Yuan Liu,Yunshi Lan,Weining Qian

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：recently shown promise, improving large language, large language models, typically optimize one-shot, single SQL state

备注：

点击查看摘要

Abstract:Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.

58. 【2606.06812】Quantifying Media Representation Dynamics Across 25 Years of News Reporting on Policing-related Deaths

链接：https://arxiv.org/abs/2606.06812

作者：Farhan Samir,Jappun Dhillon,Meghna Ravikumar,Syed Ishtiaque Ahmed,Vered Shwartz

类目：Computation and Language (cs.CL)

关键词：analysis of Canadian, perform the largest, computational analysis, Canadian, Canadian news narratives

备注： 9 pages, 6 figures. Websci'26

点击查看摘要

Abstract:We perform the largest known computational analysis of Canadian news narratives about police-involved deaths, spanning 4,000 articles from the last quarter-century. We develop a novel computational model, PerspectiveGap, grounded in prior sociological work on media representation of policing. We find that reporting on police-involved deaths on average features perspectives from state bureaucrats at a rate nearly three times as much as perspectives from other members of the public, including relatives, community members, eyewitnesses, lawyers representing the family, or civil liberties groups. A considerable fraction of articles contain no points of view from civilian actors, though civilian representation has increased in recent years. Qualitatively, we find that state bureaucrats' accounts of these deaths tend to be clinical and procedural, while civilian discourse carries considerably more emotional valence. The PerspectiveGap framework developed here can be contextualized to other jurisdictions, offering a scalable approach for analyzing how media systems construct narratives around policing and accountability.

59. 【2606.06797】Korean Culture into LLM Alignment: Toward Cultural Coherence

链接：https://arxiv.org/abs/2606.06797

作者：MinJae Jung,Minwoo Kim

类目：Computation and Language (cs.CL)

关键词：Cultural-aspect work, negative target, large language models, Korean, large language

备注： Accepted to ICML 2026 Workshop on Culture X AI

点击查看摘要

Abstract:Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is rather than only what it must avoid, and instantiate it for Korean. We design an alignment-data pipeline around a prompt-based LLM seed generator that expands a Korean harm taxonomy, with a Korean-culturally-adapted safe-response policy at its centre: a per-category guideline grounded in Korean legal frameworks, social norms, and interpretive conventions, against which three frontier models each produce a candidate response. DPO fine-tuning on the resulting triplets improves the Korean cultural safe rate across six open-weight LLMs while causing no large degradation on Korean general-capability benchmarks, and qualitative outputs show fine-tuned models naming Korean statutes and institutional procedures and, where appropriate, supplying constructive Korean-context information alongside refusal.

60. 【2606.06794】A-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

链接：https://arxiv.org/abs/2606.06794

作者：Yong-Bin Kang,Anthony McCosker

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：successfully grounds large, Retrieval-augmented generation, grounds large language, large language model, successfully grounds

备注： 5 pages, 5 figures, CIKM 2026 submission manuscript

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) successfully grounds large language model (LLM) outputs in trusted documents, but factual grounding alone is insufficient for sensitive peer-support health communication. In domains such as HIV peer support, responses must also be accessible, stigma-free, empathetic, and tailored to the recipient. This paper presents TA-RAG, a lightweight, prompt-based tone-aware RAG framework that embeds explicit tone control into a RAG pipeline without requiring model fine-tuning. We operationalise tone across four core components: stigma-free rewriting, readability adjustment, recipient adaptation, and empathy rephrasing. We evaluate TA-RAG through component-level tests using questions derived from HIV Online Learning Australia (HOLA), UNAIDS terminology guidance, readability metrics, peer-support standards from National Association of People with HIV Australia (NAPWHA), and a public empathy dataset. Results show that the TA-RAG's components improve their targeted communication quality while preserving key content. These findings emphasise that prompt-based tone control is a potential direction for making RAG outputs suitable for sensitive peer-support health communication.

61. 【2606.06788】Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

链接：https://arxiv.org/abs/2606.06788

作者：Indu Panigrahi,Tal August

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：information seeking tasks, scientific information seeking, Claude Sonnet, increasingly use-centric, real users

备注： Preprint

点击查看摘要

Abstract:Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.

62. 【2606.06781】When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

链接：https://arxiv.org/abs/2606.06781

作者：Zixian He,Bharath Raahul Murugesan,Patrick Brandt,Yibo Hu

类目：Computation and Language (cs.CL)

关键词：faithful coder, High accuracy, High, cs.CL, Abstract

备注： 14 pages, 3 figures, 11 tables

点击查看摘要

Abstract:High accuracy does not necessarily make an LLM a faithful coder. This issue matters because many social-science studies rely on expert-written codebooks to turn text into structured data. We study this problem in political event coding, a challenging source-target relation classification task beyond ordinary sentence-level classification, where models must determine what one actor did to another using detailed coding rules. We test whether expert codebooks become more effective when operationalized into LLM-friendly forms with clearer definitions, examples, retrieved context, and rules for difficult cases. We then evaluate behavioral reliability under controlled changes to label names, codebook order, and label-definition mappings. Clearer codebooks substantially improve classification performance, especially for fine-grained event classification. However, these predictive gains do not fully translate into behavioral reliability. Models may produce valid labels and recover definitions while still failing behavioral reliability tests under controlled codebook changes. These findings suggest that codebook-guided LLM systems should be evaluated not only by accuracy, but also by whether they preserve the coding logic that makes coded outputs meaningful for social-science research.

Comments:
14 pages, 3 figures, 11 tables

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.06781 [cs.CL]

(or
arXiv:2606.06781v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.06781

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

63. 【2606.06758】A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

链接：https://arxiv.org/abs/2606.06758

作者：Haizhou Xia

类目：Computation and Language (cs.CL)

关键词：Final-answer accuracy, citation overlap, retrieval-augmented language model, retrieval recall, evidence

备注： 52 pages, 34 tables, 1 figure

点击查看摘要

Abstract:Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory, fail despite receiving the right passages, or cite evidence without converting it into the requested answer. This paper proposes a matched four-condition evidence-availability protocol--no evidence, full context, retrieved evidence, and oracle-evidence reference--for diagnosing evidence utilization under fixed examples, prompts, score fields, retrieval settings, and validity checks. ONCU is used as a protocol-bound estimator of recovered oracle-reference evidence advantage and is computed only for denominator-valid groups; denominator-free answer, evidence, retrieval, and failure-audit metrics are reported separately. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families across Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, with 18,000 ONCU-compatible predictions. The main finding is a task-dependent bottleneck split: controlled synthetic settings primarily expose full-context utilization failures, whereas the tested realistic multi-hop settings primarily expose retrieval-chain coverage failures in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. The contribution is a diagnostic protocol for separating no-evidence answerability, oracle-evidence recoverability, full-context utilization, and retrieval-conditioned utilization, rather than a single-score leaderboard for long-context or retrieval-augmented systems.

64. 【2606.06755】PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs

链接：https://arxiv.org/abs/2606.06755

作者：Shaiv Patel,Kartik Narayan,Vishal Patel

类目：Computation and Language (cs.CL); Emerging Technologies (cs.ET)

关键词：Authorship attribution research, large language models, Authorship attribution, expressive texts, focused on long-form

备注： 10 pages, 6 figures

点击查看摘要

Abstract:Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do such prompts contain a stable, author-identifiable, and distinctive signal? We introduce PromptPrint, a systematic study of prompt-based identity, the hypothesis that a user's habitual vocabulary, syntax, and discourse patterns form a learnable behavioral biometric. Using 20,680 real prompts from 1,034 users, we establish three key findings. First, lexical representations significantly outperform semantic encoders, supporting the "lexical stability hypothesis": identity is primarily encoded in surface-level word choice rather than abstract intent. Second, stylometric features exhibit a "uniqueness-consistency paradox": users are highly distinctive across the population, yet behaviorally inconsistent across contexts. Third, adversarial analysis reveals a clear vulnerability spectrum: identity signals are robust to minor lexical perturbations but degrade substantially under semantic paraphrasing. Overall, our results demonstrate strong identification performance at scale, establishing prompt-based identity as a viable behavioral biometric. This work introduces a new perspective on user modeling in LLM interactions, with important implications for security and privacy. Data and code will be released upon the acceptance of our work.

65. 【2606.06754】MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

链接：https://arxiv.org/abs/2606.06754

作者：Ali Keramati,Shiyuan Zhou,Sharad Mehrotra,Mark Warschauer

类目：Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词：analytic essay scoring, combines multi-agent reasoning, retrieval-augmented grounding, training-free framework, framework for analytic

备注： 21 pages, 7 figures, 14 tables

点击查看摘要

Abstract:We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.

66. 【2606.06748】Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

链接：https://arxiv.org/abs/2606.06748

作者：Jianru Shen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Retrieval-Augmented Generation, large language models, large language, RAG, Generation

备注： Accepted at the International Conference on Advanced Machine Learning and Data Science; to appear in the IEEE Xplore proceedings

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) reduces but does not eliminate hallucination in large language models. Existing detection methods rely on flat similarity between generated answers and retrieved passages, ignoring structural relationships among evidence pieces and answer claims. We propose Evidence Graph Consistency (EGC), a framework that constructs a local evidence graph per response and computes five structural consistency measures as hallucination indicators. Evaluated on the full question answering split of RAGTruth across six LLMs (5,767 responses), EGC reveals a consistent model-family split: graph consistency features show the expected diagnostic direction for hallucinations in Llama-2 models but exhibit systematic reversal in GPT-4, GPT-3.5, and Mistral-7B. This reversal suggests qualitatively different hallucination patterns across model families and indicates that embedding-based graph consistency cannot serve as a model-independent hallucination detection signal.

67. 【2606.06745】When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

链接：https://arxiv.org/abs/2606.06745

作者：Zhixuan He,Yue Feng

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Reasoning Large Language, Large Language, Language Models, improve problem-solving performance

备注：

点击查看摘要

Abstract:Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. IDPR first generates a concise intuitive answer and then uses an inhibition controller to decide whether that specific response should be released or suppressed in favor of slow reasoning. Unlike input-only routers, the inhibition controller conditions on the fast answer and fast-side evidence, including confidence, logit margin, parseability, and generation cost. We train the controller from paired fast-slow outcomes and select the inhibition threshold on a held-out validation set under an accuracy-first slow-call budget. On a held-out 5,000-example mathematical reasoning test set, IDPR invokes slow reasoning on only 8.20% of examples and improves accuracy from 47.90% to 48.92%. Under the same slow-call budget, random routing decreases accuracy to 46.76%, while the strongest confidence-based baseline reaches 48.22%. IDPR also achieves the highest corrective precision, showing that response-conditioned inhibition better identifies fast answers that benefit from slow reasoning.

68. 【2606.06743】HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

链接：https://arxiv.org/abs/2606.06743

作者：Arjun Gangwar,S Umesh

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, advent of Multimodal

备注： 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

点击查看摘要

Abstract:The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.

69. 【2606.06741】OpenSkill: Open-World Self-Evolution for LLM Agents

链接：https://arxiv.org/abs/2606.06741

作者：Zhiling Yan,Dingjie Song,Hanrong Zhang,Wei Liang,Yuxuan Zhang,Yutong Dai,Lifang He,Philip S. Yu,Ran Xu,Xiang Li,Lichao Sun

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Self-evolving agents requires, existing approaches assume, usable learning loop, successful trajectories, Self-evolving agents

备注： 20 pages, 4 figures and 8 tables. Code is avalable at [this https URL](https://github.com/OpenLAIR/OpenSkill)

点击查看摘要

Abstract:Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

70. 【2606.06740】Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

链接：https://arxiv.org/abs/2606.06740

作者：Naman Kothari,Arjun Gangwar,Adarsh Arigala,S Umesh

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：supervised embeddings entangle, Discrete speech units, multi-speaker speech generation, Discrete speech, multilingual multi-speaker speech

备注： 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

点击查看摘要

Abstract:Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.

71. 【2606.06738】Modular Monolingual Adaptation using Pretrained Language Models

链接：https://arxiv.org/abs/2606.06738

作者：Nalin Kumar,Ondřej Dušek

类目：Computation and Language (cs.CL)

关键词：Building monolingual language, Building monolingual, languages typically relies, typically relies, Building

备注： Accepted to ACL 2026 Industry Track

点击查看摘要

Abstract:Building monolingual language models (LMs) for low-resource languages typically relies on adapting pretrained language models (PLMs) by finetuning the whole model on the target language. This approach is widely favored over training from scratch, as it enables effective knowledge transfer. Additionally, prior work has shown that using a language-specific tokenizer can enhance the adaptability. In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. Specifically, we replace the tokens, freeze the corresponding embeddings, and tune the rest of the model. We use Scottish Gaelic, Irish, and Quechua for our experiments, with Quechua being a very low-resource language (8.5k training instances). Evaluation on natural language understanding (NLU) tasks -- mask filling, NER, and POS -- shows that our proposed approach improves performance when adapting models to low-resource languages. Additionally, we provide a comprehensive analysis of the effectiveness of training strategies, the choice of pretrained embeddings, and models.

72. 【2606.06715】Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

链接：https://arxiv.org/abs/2606.06715

作者：Upasana Chatterjee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：perceived political ideology, Double Machine Learning, perceived political, answer depends, apply Double Machine

备注： Accepted to ACL SRW 2026

点击查看摘要

Abstract:We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.

73. 【2606.06712】Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

链接：https://arxiv.org/abs/2606.06712

作者：Xingyu Su,Jacob Helwig,Shubham Parashar,Atharv Chagi,Lakshmi Jotsna,Degui Zhi,James Caverlee,Dileep Kalathil,Shuiwang Ji

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：diffusion language models, Diffusion Language Model, diffusion language, ARLM, OPDLM

备注：

点击查看摘要

Abstract:We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

74. 【2606.06708】Signal-Driven Observation for Long-Horizon Web Agents

链接：https://arxiv.org/abs/2606.06708

作者：Shubham Gaur,Ian Lane

类目：Computation and Language (cs.CL)

关键词：causing progressive context, long horizons ingest, horizons ingest raw, progressive context degradation, ingest raw DOM

备注： 10 pages, 1 figure

点击查看摘要

Abstract:Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation frequency to action frequency is an architectural mistake. Drawing on the insight from Recursive Language Models that querying a document outperforms reading it wholesale, we propose Signal-Driven Observation (SDO): a dedicated sub-call reads the full DOM but returns only task-relevant elements and their selectors, and is re-invoked only when a lightweight signal detector fires -- triggered by URL transitions, newly visible interactive elements, action failures, or exogenous browser events. We outline the open problems SDO introduces and call on the community to treat observation compression as a core architectural decision in web agent design.

75. 【2606.06698】RECAP: Regression Evaluation for Continual Adaptation of Prompts

链接：https://arxiv.org/abs/2606.06698

作者：Harsh Deshpande,Kushal Chawla,Sangwoo Cho,William Campbell

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：agentic systems routinely, systems routinely face, Production agentic systems, routinely face evolving, agentic systems

备注：

点击查看摘要

Abstract:Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.

76. 【2606.06679】HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

链接：https://arxiv.org/abs/2606.06679

作者：Xi Xuan,Wenxin Zhang,Yufei Zhou,King-kui Sin,Chunyu Kit

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Hong Kong judgments, Hong Kong Judgment, Hong Kong, received limited attention, Kong Judgment Discourse

备注：

点击查看摘要

Abstract:Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $\kappa = 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at this https URL.

77. 【2606.06674】What Do People Actually Want From AI? Mapping Preference Plurality

链接：https://arxiv.org/abs/2606.06674

作者：Julia Sepúlveda Coelho,Scott A. Hale

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Human Feedback, Reinforcement Learning, Learning from Human, fine-tuned through Reinforcement, Large Language Models

备注： Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

点击查看摘要

Abstract:Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

Comments:
Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

Subjects:

Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as:
arXiv:2606.06674 [cs.CL]

(or
arXiv:2606.06674v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.06674

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1145/3805689.3812398

Focus to learn more

            DOI(s) linking to related resources</p>

78. 【2606.06667】he Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

链接：https://arxiv.org/abs/2606.06667

作者：Jiachen Zhao,Zhengxuan Wu,Aryaman Arora,Yiyou Sun,David Bau,Weiyan Shi

类目：Computation and Language (cs.CL)

关键词：LLMs' broad over-generalization, remain unclear, mechanisms behind LLMs', LLMs' broad, broad over-generalization

备注：

点击查看摘要

Abstract:The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries. We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

79. 【2606.06646】CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

链接：https://arxiv.org/abs/2606.06646

作者：Jakub Bąba,Jarosław Chudziak

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Formalizing complex reasoning, Formalizing complex, complex reasoning embedded, complex reasoning, computational linguistics

备注： Accepted for publication in the proceedings of ICCCI 2026

点击查看摘要

Abstract:Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

80. 【2606.06635】How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

链接：https://arxiv.org/abs/2606.06635

作者：Tanvi Thoria,Kiana Jafari,Marc R. Schlichting,Mykel J. Kochenderfer

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：leave identifiable signatures, emerge through distinct, leave identifiable, language model reasoning, model reasoning emerge

备注：

点击查看摘要

Abstract:Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

81. 【2606.06622】UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

链接：https://arxiv.org/abs/2606.06622

作者：Amirhossein Abaskohi,Amirhossein Dabiriaghdam,Liang Luo,Ellie Dingqiao Wen,Lele Wang,Giuseppe Carenini,Peter West

类目：Computation and Language (cs.CL)

关键词：capture true underlying, true underlying distributions, true underlying, capture true, distributions

备注：

点击查看摘要

Abstract:We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

82. 【2606.06614】Re-Centering Humans in LLM Personalization

链接：https://arxiv.org/abs/2606.06614

作者：Lechen Zhang,Jiarui Liu,Tal August

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：large language models', LLMs personalization abilities, growing interest, language models', large language

备注：

点击查看摘要

Abstract:Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

83. 【2606.06586】Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

链接：https://arxiv.org/abs/2606.06586

作者：Jonathan von Rad,Louis Arts,George Burgess,Eleftheria Kolokytha,Harry O'Donnell,Ektor Oikonomidis Doumpas,Eduardo Sanchez,Yao Lu,Pontus Stenetorp

类目：Computation and Language (cs.CL)

关键词：substantial world knowledge, English data encode, encode substantial world, predominantly on English, Large language models

备注： Under Review at EMNLP 2026

点击查看摘要

Abstract:Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

84. 【2606.06533】Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

链接：https://arxiv.org/abs/2606.06533

作者：Stella Biderman,Mohammad Aflah Khan,Niloofar Mireshghallah,Catherine Arnett,Fazl Barez,Naomi Saphra

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：scientific understanding, Abstract, training, time-evolving processes shaped, science

备注： Accepted as an oral to the ICML: [this https URL](https://icml.cc/virtual/2026/poster/67142)

点击查看摘要

Abstract:What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.

85. 【2606.06464】Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

链接：https://arxiv.org/abs/2606.06464

作者：Mandana Samiei,Eunice Yiu,Anthony GX-Chen,Dongyan Lin,Jocelyn Shen,Blake A. Richards,Alison Gopnik,Doina Precup

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：causal learning literature, long-standing finding, learning literature, simultaneous presence, presence of multiple

备注： Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

点击查看摘要

Abstract:A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

86. 【2606.05510】Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

链接：https://arxiv.org/abs/2606.05510

作者：Ahmed Alansary,Molham Mohamed,Ali Hamdi

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Telehealth systems, timely medical information, increasingly important, important for delivering, delivering accessible

备注： 6 pages, 3 figures, IMSA2026

点击查看摘要

Abstract:Telehealth systems have become increasingly important for delivering accessible and timely medical information. Existing large language models often struggle to provide consistent and contextually appropriate medical responses across varying levels of case severity. This limitation highlights the need for models that can effectively adapt to the progressive complexity in medical queries. To address this challenge, we introduce a severity-aware multi-model framework that integrates curriculum training strategy with relevance-based response selection. The proposed framework employs a three-stage curriculum learning strategy, where each model is trained sequentially on mild, moderate, and critical cases to progressively acquire domain knowledge. The approach utilizes five large language models, each independently trained under the same curriculum scheme. During inference, all models generate candidate responses, and the most appropriate response is selected as the final output. The framework is trained and evaluated on the MAQA dataset, which provides annotated medical question-answer pairs. Experimental results evaluated using BERTScore demonstrate that the proposed method achieves superior performance compared to both baseline and fine-tuned models, attaining 86.71% in the baseline setting and 90.30% after fine-tuning. These results highlight the effectiveness of combining curriculum learning with multi-model response selection in improving response quality and relevance in medical text generation.

87. 【2601.12359】Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

链接：https://arxiv.org/abs/2601.12359

作者：Anirudh Sekar,Mrinal Agarwal,Rachel Sharma,Akitsugu Tanaka,Jasmine Zhang,Arjun Damerla,Kevin Zhu

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：circumvent alignment safeguards, prompts exploit indirect, adversarial prompts exploit, exploit indirect input, indirect input channels

备注： Accepted to NeurIPS 2025 Lock-LLM Workshop

点击查看摘要

Abstract:Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of 3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

88. 【2606.06573】Multiscale POD of Transformer Attention Fields: Scale-Selective Analysis via Morlet Scalogram

链接：https://arxiv.org/abs/2606.06573

作者：Athanasios Zeris

类目：Fluid Dynamics (physics.flu-dyn); Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP)

关键词：Proper Orthogonal Decomposition, scale-selective Proper Orthogonal, introduce scale-selective Proper, Orthogonal Decomposition, Proper Orthogonal

备注： 23 pages, 3 figures, 4 tables

点击查看摘要

Abstract:We introduce scale-selective Proper Orthogonal Decomposition (POD) for transformer attention fields, inspired by the use of POD for extracting energetically dominant modes from turbulent flow ensembles. The Morlet continuous wavelet transform identifies dominant temporal scales in the attention lag structure across a document ensemble; POD then extracts the energetically dominant modes at each scale from the ensemble of attention fields. The resulting modes reveal layer-dependent scale organisation, with early layers emphasising fine scales and later layers shifting toward coarser scales. We define a spectral concentration index from the POD eigenvalue decay rate and show empirically that it differentiates layers by their attention field complexity. By the classical POD optimality theorem, the extracted modes minimise the average L2 reconstruction error over the ensemble (Theorem 1), giving a data-driven effective rank for each layer. The method requires no architectural modification and no linguistic annotations: dominant attention patterns emerge from ensemble statistics alone. The turbulence analogy is structural rather than physical: we borrow ensemble covariance and modal analysis, not fluid dynamics itself.

信息检索

1. 【2606.07502】Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

链接：https://arxiv.org/abs/2606.07502

作者：Songhao Wu,Zhongxin Chen,Yuxuan Liu,Heng Cui,Cong Li,Rui Yan

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Large language models, Large language, language models exhibit, models exhibit impressive, exhibit impressive zero-shot

备注： preprint

点击查看摘要

2. 【2606.07492】Bradley-Terry Rankings for Recommender Systems Across Dataset Taxonomies

链接：https://arxiv.org/abs/2606.07492

作者：Ekaterina Grishina,Stepan Kuznetsov,Askar Tsyganov,Ilya Ivanov,Daria Korovaitceva,Margarita Rusanova,Uliana Parkina,Alexander Derevyagin,Evgeny Frolov,Sergey Samsonov,Anton Lysenko

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词：sequential structure, recommendation algorithms, ranking, challenging problem, algorithms

备注： KDD'26

点击查看摘要

Abstract:The ranking of recommendation algorithms is a challenging problem since model performance is sensitive to dataset characteristics such as sparsity, sequential structure, and scale. This drives a demand for a proper methodology for fair comparison between algorithms. Naive aggregation of performance metrics (e.g., averaging NDCG over benchmarks) can yield misleading rankings, undermining practical selection. To address this problem, we introduce a novel, data-driven ranking methodology based on Bradley-Terry (BT) model. We demonstrate that the obtained ranking depends on key dataset statistics. Additionally, we propose a novel metric for evaluating ranking consistency and demonstrate robustness of our ranking to incomplete data. Finally, we introduce a dataset-specific methodology for ranking algorithms on unseen datasets without running the models, relying on extensions of the Bradley-Terry framework, including BT trees and BT models with covariates.

3. 【2606.07454】PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

链接：https://arxiv.org/abs/2606.07454

作者：Fuqiang Wang,Song Tan,Zheng Guo,Jiaohao Fu,Xinglong Xu,Bihui Yu,Jie Dong,Zheng Sun,Siyuan Li,Jingxuan Wei,Cheng Tan

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：fixed candidate set, typically evaluated, evaluated as static, real scientific reading, scientific reading unfolds

备注： 48 pages, 13 figures, 22 tables

点击查看摘要

Abstract:Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real scientific reading unfolds as a daily, longitudinal process in which interests shift and feedback accumulates. We introduce PaperFlow, a framework that organizes it into three coupled stages: Profiling, which constructs and maintains a structured, inspectable scholarly profile from heterogeneous cold-start evidence; Recommending, which ranks each date-specific paper stream through multi-signal aggregation under a fixed display budget; and Adapting, which updates user state from semantically distinct feedback signals and models interest drift across days. We further define a longitudinal user-day benchmark that fixes users, dates, candidate pools, visible inputs, and hidden simulated relevance labels under a shared temporal information boundary. The benchmark contains 24 simulated research users, 50 daily paper streams, 1,200 user-day episodes, 20,727 unique papers, and 497,448 episode-paper records. We additionally specify a blind human-evaluation protocol to validate alignment between automatic metrics and expert judgments. Experiments against five scientific recommendation baselines show that PaperFlow achieves the strongest oracle-based ranking, the highest behavioral alignment with simulated reading selections, and the best blind human-evaluation score.

4. 【2606.07317】Gated Bidirectional Linear Attention for Generative Retrieval

链接：https://arxiv.org/abs/2606.07317

作者：Artem Matveev,Vladislav Tytskiy,Sergei Makeev,Sergei Liamaev

类目：Information Retrieval (cs.IR)

关键词：generates recommended items, generative retrieval typically, recommender systems, encoder-decoder setup, recommended items

备注： 5 pages, 2 figures, 7 tables. Accepted at SIGIR 2026

点击查看摘要

Abstract:In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an $8.2\times$ single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks.

Comments:
5 pages, 2 figures, 7 tables. Accepted at SIGIR 2026

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.07317 [cs.IR]

(or
arXiv:2606.07317v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.07317

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1145/3805712.3808495

Focus to learn more

            DOI(s) linking to related resources</p>

5. 【2606.07252】Constrained Dominant Sets for Multimodal Document Question Answering

链接：https://arxiv.org/abs/2606.07252

作者：Ambuj Mehrish,Sebatiano Vascon

类目：Information Retrieval (cs.IR)

关键词：Long multimodal document, document question answering, Long multimodal, multimodal document question, Constrained Dominant Set

备注：

点击查看摘要

Abstract:Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ($66.99$ average) and improves over the no-retrieval baseline by $37.1$ points on VisDoMBench and $4.8$ on MMLongBench-Doc.

6. 【2606.07235】FLOWREADER: Min-Cost Flow Optimization for Multi-Modal Long Document QA

链接：https://arxiv.org/abs/2606.07235

作者：Ambuj Mehrish,Sebatiano Vascon

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：documents force retrieval-augmented, force retrieval-augmented systems, long table, multimodal documents force, slides broken

备注：

点击查看摘要

Abstract:Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top-$k$ chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FLOWREADER, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector $h$ controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths, a compact non-redundant subset is selected by entropy-regularized replicator dynamics, and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or the routed flow is strained. On VisDoMBench, FLOWREADER is best on the two subsets dominated by fragmented evidence PaperTab ($58.40$, $+1.30$ over G^{2}-Reader) and SlideVQA ($72.93$, $+0.62$) and competitive on SPIQA, FetaTab, and SciGraphQA. Macro-averaged across all five subsets, FLOWREADER ($65.47$) is within $0.74$ of the strongest baseline (G^{2}-Reader, $66.21$). Overall, these results show that min-cost flow performs well on fragmented multimodal evidence, where top-$k$ retrieval fails. It also provides a unified way to control scoring, routing, selection, and adaptive compute together.

7. 【2606.07218】HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

链接：https://arxiv.org/abs/2606.07218

作者：Mingyu Zhang,Ying Ma

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：expose answer chains, Multi-hop RAG poses, organize retrieved text, poses a data-engineering, data-engineering problem

备注： Submitted to ICDE 2027. 13 pages, 3 figures

点击查看摘要

8. 【2606.07187】RISE: A Rust Library for Inverted Index Search Engines

链接：https://arxiv.org/abs/2606.07187

作者：Angelo Savino,Rossano Venturini

类目：Information Retrieval (cs.IR)

关键词：large text corpora, crucial data structure, text corpora, crucial data, data structure

备注：

点击查看摘要

Abstract:Inverted indexes are a crucial data structure for efficient information retrieval in large text corpora. They enable fast full-text search by mapping each term to the documents in which it appears, on top of which efficient algorithms quickly retrieve the documents relevant to a user query. We present RISE, a novel inverted index library implemented in Rust, designed to deliver high performance and efficiency for information retrieval tasks. RISE leverages Rust's safety and performance to provide a robust solution for building and querying inverted indexes, while offering accessible extensibility through its expressive trait system. While developing RISE, we revisited the inverted-index literature, thereby reproducing numerous prior works using this new test bench. We evaluated RISE against existing libraries, demonstrating competitive query performance across various datasets and workloads, with speedups of up to 2x over the current state of the art. Our results indicate that RISE is a promising tool for researchers and practitioners in the field of information retrieval.

9. 【2606.07075】Beyond Matching: Category-Guided Latent Intent Reasoning for Generative Retrieval in E-Commerce

链接：https://arxiv.org/abs/2606.07075

作者：Fuwei Zhang,Xiaoyu Liu,Jiajie Jin,Jiale Mao,Wei Chen,Dongbo Xi,Yifan Yang,Peng Yan,Zichao Hao,Zhao Zhang,Fuzhen Zhuang

类目：Information Retrieval (cs.IR)

关键词：product Semantic Identifiers, mapping user queries, user queries directly, Semantic Identifiers, mapping user

备注：

点击查看摘要

Abstract:Generative retrieval offers a new paradigm for e-commerce search by mapping user queries directly to product Semantic Identifiers (SIDs). However, e-commerce queries are often short, noisy, attribute-heavy, and associated with multiple category-consistent products, creating a substantial representation gap between natural-language shopping intent and artificially constructed item SIDs. Explicit Chain-of-Thought (CoT) reasoning can help bridge this gap, but its extra generation cost is difficult to reconcile with the low-latency requirements of online e-commerce systems. To address this challenge, we propose CaLIR (Category-guided Latent Intent Reasoning), a category-guided latent intent reasoning framework for e-commerce generative retrieval. Rather than generating explicit textual rationales, CaLIR learns continuous latent intent states before SID decoding and uses product category hierarchies as a natural scaffold for coarse-to-fine intent reasoning. Specifically, we introduce hierarchical semantic reasoning to align latent states with category-level shopping intent, and query-wise reasoning enhancement to model diverse intent paths under multi-positive queries. CaLIR further combines a query-specific dynamic prefix trie, assembled from pre-indexed category-level tries, with reasoning-aware constrained decoding. Experiments on multilingual e-commerce search datasets show that CaLIR achieves a better balance between retrieval effectiveness and inference efficiency than existing methods, while also demonstrating transferability and robustness across induced hierarchies and different generative backbones.

10. 【2606.07071】Decision-Theoretic Stopping Rules for Document Screening

链接：https://arxiv.org/abs/2606.07071

作者：Aaron H.A. Fletcher,Mark Stevenson

类目：Information Retrieval (cs.IR)

关键词：multiple applications, common problem, Deciding, stop reviewing, Perfect Information

备注：

点击查看摘要

Abstract:Deciding when to stop reviewing the results of a search is a common problem with multiple applications. Existing stopping rules developed within Technology-Assisted Review (TAR) aim to achieve a pre-specified recall target and do not take into account the reason for examining the results, potentially leading to sub-optimal recommendations. This paper applies decision theory to the problem and uses it to derive three practical stopping policies based on the Expected Value of Perfect Information. The approach is applied to two professional search tasks: patent examining and systematic reviewing. Experiments on CLEF-IP and medical systematic review datasets show that the proposed approach generally produces more appropriate stopping decisions than existing methods, as demonstrated by higher net utility under the evaluated cost and payoff settings.

11. 【2606.07057】Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation

链接：https://arxiv.org/abs/2606.07057

作者：Shamira Venturini,Steffen Kinkel

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：automatically generated keyphrases, generated keyphrases remains, complex challenge, quality of automatically, automatically generated

备注：

点击查看摘要

12. 【2606.06970】SSRLive: Live Streaming Recommendation with Dynamic Semantic ID

链接：https://arxiv.org/abs/2606.06970

作者：Teng Shi,Zhaoheng Li,Yuanhang Qu,Yi Liu,Lixiang Lai,Yuning Jiang

类目：Information Retrieval (cs.IR)

关键词：instant content broadcasting, Live streaming, fastest-growing forms, broadcasting and real-time, real-time engagement

备注：

点击查看摘要

Abstract:Live streaming has emerged as one of the fastest-growing forms of online media, enabling instant content broadcasting and real-time engagement between users and streamers. Despite the effectiveness of existing recommendation algorithms in this domain, they often suffer from limited utilization of computational resources, with low FLOPs that hinder further performance enhancement. Generative recommendation techniques, which have gained traction in various industrial tasks, offer a promising avenue for improving live streaming recommendations. However, directly applying generative methods to live streaming is non-trivial due to two major challenges: (1) static semantic IDs (SIDs) cannot reflect the rapidly changing nature of live room content; and (2) generative pipelines generally do not incorporate user--streamer interaction signals (e.g., likes, orders), which are critical for modeling user intent toward both the streamer and showcased products. To address these challenges, we introduce SSRLive: Dynamic Semantic ID-guided Streaming Recommendation for Live platforms. The proposed framework integrates a generative module and a discriminative module in a unified architecture. The generative component employs an encoder-decoder design to produce both static and dynamic SIDs, enabling timely representation of live room content while leveraging multimodal information. The discriminative component refines task-specific representations by combining SIDs with user features, augments them with user-streamer interaction data, and performs multi-task predictions. Online A/B tests in real-world deployment demonstrate tangible benefits: watch time (+3.38%), GMV (+0.72%), follower growth (+3.12%), and interaction volume (+2.92%). These improvements highlight the effectiveness and business value of SSRLive, which is now fully deployed, serving hundreds of millions of active users.

13. 【2606.06947】DREAM: Dynamic Refinement of Early Assignment Mappings

链接：https://arxiv.org/abs/2606.06947

作者：Liwei Guan,Huanjie Wang,Hongwei Zhang,Linxun Chen,Zhaojie Liu

类目：Information Retrieval (cs.IR)

关键词：compact token sequences, advances item retrieval, encode item semantics, Semantic IDs, retrieval by reformulating

备注： 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Generative recommendation advances item retrieval by reformulating it as autoregressive generation of Semantic IDs (SIDs), compact token sequences that encode item semantics. While SIDs offer a strong semantic prior, current SID-based methods assign each item a single static identifier through offline tokenization before sufficient user feedback is observed. For cold-start items, this one-shot commitment produces poorly discriminative codes, generating misaligned paths that remain unrefined because the associated tokens are rarely sampled during training. We identify this early static commitment, not model capacity, as the fundamental cold-start bottleneck in SID-based generative recommendation. To overcome this bottleneck and bridge the disjoint objectives of tokenization and generation, we propose DREAM (Dynamic Refinement of Early Assignment Mappings), a three-stage framework that resolves this flaw through progressive refinement. First, an intent-aware tokenizer rebuilds the SID space through counterfactual contrastive learning, generating a diverse pool of behavior-aligned candidates per cold-start item. Second, the frozen recommendation backbone serves as an evaluator, selecting the most reliable candidate based on multi-context user support without retraining. Third, a dynamic beam mechanism maintains multiple weighted SID hypotheses throughout training and inference, preventing premature collapse to a single assignment. Extensive experiments on three Amazon benchmarks show that DREAM substantially outperforms state-of-the-art generative and sequential baselines on cold-start metrics.

14. 【2606.06880】owards Retrieving Interaction Spaces for Agentic Search

链接：https://arxiv.org/abs/2606.06880

作者：Shengyao Zhuang,Yuansheng Ni,Hengxin Fun,Jimmy Lin,Xueguang Ma

类目：Information Retrieval (cs.IR)

关键词：non-agentic information retrieval, inherited from non-agentic, non-agentic information, retriever ranks, small set

备注：

点击查看摘要

Abstract:Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the corpus and the agent reads a small set of returned documents. Recent direct corpus interaction (DCI) work shows that agents can instead interact with the raw corpus through shell tools such as grep and file reads. But unbounded interaction does not scale: every broad shell command is a scan over the whole corpus, and latency degrades sharply as the corpus grows. We argue that the role of retrieval for agentic search is not just to select documents that fit in the LLM context window, but to construct an interaction space: a bounded subset of the corpus the agent can explore with associated tools. Two design consequences follow. The space needs a boundary supplied by retrieval, and the objects within it should be processed for interaction. As a proof of concept, we propose RISE (Retrieving Interaction SpacE): we use BM25 to construct the interaction space; meanwhile, its documents are processed during indexing for shell-style navigation. On BrowseComp-Plus, RISE matches the pure-shell DCI baseline at 78% accuracy with gpt-5.4-mini at roughly one quarter of the per-query cost. At 1M documents, RISE-BM25 reaches 81% on gpt-5.4-mini, whereas DCI on gpt-5.4-nano degrades to 60% with 33 of 100 wall-clock failures.

15. 【2606.06794】A-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

链接：https://arxiv.org/abs/2606.06794

作者：Yong-Bin Kang,Anthony McCosker

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：successfully grounds large, Retrieval-augmented generation, grounds large language, large language model, successfully grounds

备注： 5 pages, 5 figures, CIKM 2026 submission manuscript

点击查看摘要

16. 【2606.06779】Mind the Gap: Bridging Behavioral Silos with LLMs in Multi-Vertical Recommendations

链接：https://arxiv.org/abs/2606.06779

作者：Nimesh Sinha,Raghav Saboo,Martin Wang,Sudeep Das

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：multi-vertical e-commerce platforms, newer product verticals, multi-vertical e-commerce, e-commerce platforms, newer product

备注：

点击查看摘要

Abstract:In multi-vertical e-commerce platforms like DoorDash, relatively newer product verticals such as grocery and retail present a significant opportunity for personalization innovation. A key challenge lies in solving the "cold start" problem for users. This paper introduces a novel framework for enhancing recommendation quality by transferring knowledge from data-rich verticals (e.g., restaurants at DoorDash) to data-sparse ones. We leverage Large Language Models (LLMs) to perform generative inference, synthesizing sparse, high-dimensional features that encapsulate latent user affinities. Specifically, we employ a hierarchical Retrieval-Augmented Generation (RAG) pipeline to derive multi-level taxonomic features from user restaurant order histories and search queries. These generated features, encoding both long-term cross-vertical preferences and short-term intent, are integrated into a production Multi-Task Learning (MTL) ranking model. We demonstrate through extensive offline and online evaluation that this approach significantly improves personalization and engagement in emerging business verticals, effectively bridging the behavioral data gap.

17. 【2606.04550】rading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations

链接：https://arxiv.org/abs/2606.04550

作者：Noah Lund Syrdal,Anders Vestrum,Jorgen Bergh

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

关键词：recommender systems strongly, systems strongly influence, E-commerce recommender systems, Product Carbon Footprint, recommender systems

备注： 23 pages, 30 figures. Code available at [this https URL](https://github.com/andersvestrum/carbon-aware-recsys)

点击查看摘要

Abstract:E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Carbon Footprint (PCF) are almost never available at catalog scale. We study carbon-aware product recommendation in the realistic setting where PCF labels are missing for most items and must be inferred. We first estimate product-level carbon footprints via a retrieval-augmented PCF estimation pipeline that transfers supervision from the Carbon Catalogue, a small set of life-cycle-assessed products, to a large unlabeled e-commerce catalog using semantic similarity search, few-shot LLM prompting, and a nearest-neighbour fallback. We then apply a carbon-aware post-hoc re-ranking strategy on top of relevance scores produced by three established recommendation models: BPR, NeuMF, and LightGCN. The method trades off predicted user-item engagement against estimated carbon footprint through a single tunable parameter, lambda. In this offline study, engagement is operationalized through Amazon review interactions, which serve as implicit feedback and as a proxy for user interest or purchase behavior. We evaluate the framework on the Amazon Reviews dataset across three product categories: Home and Kitchen, Sports and Outdoors, and Electronics. By sweeping lambda, we construct Pareto frontiers that characterize the achievable engagement and carbon trade-off for each model and category. Substantial carbon reductions are achievable at minimal engagement cost across all models and categories. However, the available carbon headroom varies by model and category, underscoring the importance of model choice and domain context.

计算机视觉

1. 【2606.07514】UniSHARP: Universal Sharp Monocular View Synthesis

链接：https://arxiv.org/abs/2606.07514

作者：Meixi Song,Dizhe Zhang,Hao Ren,Ruiyang Zhang,Bo Du,Ming-Hsuan Yang,Lu Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：conventional perspective cameras, omnidirectional panoramic settings, popular photorealistic view, photorealistic view synthesis, extending SHARP

备注： Project page: [this https URL](https://insta360-research-team.github.io/Unisharp-website/)

点击查看摘要

Abstract:In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal monocular rendering across a continuum of camera systems, from conventional perspective cameras to wide-field-of-view, fisheye and omnidirectional panoramic settings. To overcome the pinhole-specific assumptions of SHARP, our key idea is to align various images in a unified omnidirectional latent space. Thus, we propose UniSHARP, which performs implicit alignment in both feature and Gaussian spaces. Specifically, Gaussian primitives are arranged along rays and radial distances in a ray-based universal representation, while 2D semantic and 3D spatial features extracted from UniK3D-inspired encoders are jointly decoded to generate the complete Gaussian cloud. To comprehensively evaluate our method, we construct a benchmark covering diverse imaging systems across various scenes. The benchmark is further stratified by field of view (FoV) to enable fine-grained assessment of the universal monocular rendering task. Extensive experiments on the proposed benchmark demonstrate the effectiveness of UniSHARP, outperforming alternative methods by a large margin. The project page can be found at: this https URL

2. 【2606.07512】MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

链接：https://arxiv.org/abs/2606.07512

作者：Cong Chen,Guo Gan,Kaixiang Ji,ChaoYang Zhang,Zhen Yang,Guangming Yao,Hao Chen,Jingdong Chen,Yi Yuan,Chunhua Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Current Vision-Language Models, processing full-length visual, full-length visual sequences, visual sequences induces, sequences induces prohibitive

备注：

点击查看摘要

3. 【2606.07508】Streaming Video Generation with Streaming Force Control

链接：https://arxiv.org/abs/2606.07508

作者：Hanhui Wang,Yiming Xie,Haiwen Feng,Zhaoyang Lv,Shenlong Wang,Huaizu Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enables physically grounded, physically grounded control, continuous force inputs, streaming video generation, video generation framework

备注：

点击查看摘要

Abstract:We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, or rely on non-causal processing, StreamForce is a causal and unified model that responds instantly and coherently to both local and global, time-varying forces. To achieve this, we design a unified force representation as a control signal and develop a distillation pipeline for force-controllable video generation. Our model combines autoregressive efficiency with force responsiveness, sustaining stable photometric and dynamic realism. StreamForce runs at up to 16.6 FPS on a single GPU, achieving state-of-the-art performance in both force adherence and motion realism. Project website: this https URL

4. 【2606.07503】Differences in Detection: Explainability Where it Matters

链接：https://arxiv.org/abs/2606.07503

作者：Johannes Theodoridis,Johannes Maucher,Andreas Schilling

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：object detection models, compare two object, Average Precision, ground truth labels, propose Differences

备注： Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2026 - How Do Vision Models Work? (HOW)

点击查看摘要

Abstract:We propose Differences in Detection (DnD), an intuitive method to compare two object detection models. Based on the same matching algorithm, it complements the standard metrics of mean Average Precision ($mAP$) and TIDE error analysis with the ability to compare two models directly. More specifically, we calculate the intersection of ground truth labels that are recognized by both models, followed by the corresponding difference sets and the complement set of ground truth labels that are missed by both models. The resulting comparison is more direct and intuitive than a comparison of independent summary statistics. It reveals individual and shared mistakes and becomes particularly interesting when combined with error types. In this case, the differences in detection errors can be analyzed naturally in a standard confusion matrix. While valuable in itself, we believe that one of the best applications of DnD is to guide explainability methods such as ODAM towards metric-relevant examples, grounded in structured subsets. The code for our method is available here: this https URL

5. 【2606.07498】Implicit Data Synthesis for Contrastive Unsupervised Data Augmentation

链接：https://arxiv.org/abs/2606.07498

作者：Patrick Kage,Trevor Hedges,N. Siddharth,Pavlos Andreadis

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：making unsupervised learning, generate large quantities, laborious to hand-label, making unsupervised, learning techniques valuable

备注： 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Scientific observations generate large quantities of unlabeled data which is laborious to hand-label, making unsupervised learning techniques valuable for processing datasets. Among these approaches, contrastive learning provides a convenient mechanism for extracting structural representations from unannotated datasets. For natural imagery, the general approach is to use a variety of data-space augmentation methods in order to generate synthetic samples; however, for scientific observations data-space perturbations can fundamentally alter the underlying data. Our proposed method is to generate contrastive samples by perturbing the network weights rather than the underlying data, thus more closely preserving the structure of the data. We demonstrate this technique using a SimCLR-based pipeline applied over radar observations of meteors, and show performance gains under matched protocols.

6. 【2606.07464】Planning-aligned Token Compression for Long-Context Autonomous Driving

链接：https://arxiv.org/abs/2606.07464

作者：Zhixuan Liang,Yuxiao Chen,Yurong You,Peter Karkus,Wenhao Ding,Boyi Li,Alexander Popov,Yan Wang,Maximilian Igl,Yiming Li,Danfei Xu,Nikolai Smolyanskiy,Boris Ivanovic,Ping Luo,Marco Pavone

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Monolithic vision-action models, vision-action models represent, Monolithic vision-action, vision-action models, models represent

备注： 9 pages

点击查看摘要

Abstract:Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

7. 【2606.07451】EVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

链接：https://arxiv.org/abs/2606.07451

作者：Sweta Mahajan,Sukrut Rao,Jiahao Xie,Alexander Koller,Bernt Schiele

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：diverse tasks due, image-text embedding space, shared image-text embedding, Vision-language models, diverse tasks

备注： 20 pages, 13 figures, 14 tables

点击查看摘要

8. 【2606.07436】Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

链接：https://arxiv.org/abs/2606.07436

作者：Haoyuan Li,Zhengdong Hu,Jun Wang,Hehe Fan,Yi Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：MLLM agents performing, paper explores agentic, MLLM agents, paper explores, spatial understanding

备注：

点击查看摘要

Abstract:This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.

9. 【2606.07435】he Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

链接：https://arxiv.org/abs/2606.07435

作者：Rishabh Jain,Naomi Harte

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Visual speech recognition, human-like visual speech, speech recognition, surpass human lipreaders, gains establish human-like

备注： Accepted at INTERSPEECH 2026

点击查看摘要

10. 【2606.07433】Watch, Remember, Reason: Human-View Video Understanding with MLLMs

链接：https://arxiv.org/abs/2606.07433

作者：Jiahao Meng,Yue Tan,Qi Xu,Kuan Gao,Weisong Liu,Yanwei Li,Jason Li,Lingdong Kong,Haochen Wang,Qianyu Zhou,Jiangning Zhang,Guangliang Cheng,Yunhai Tong,Lu Qi,Minghsuan Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：multimodal large language, large language models, knowledge-intensive video scenarios, clips to long, rapidly transformed

备注：

点击查看摘要

Abstract:Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at this https URL.

11. 【2606.07431】OpenGlass: Open-Source Smart Glasses for On-Device Event-Based Gesture Recognition

链接：https://arxiv.org/abs/2606.07431

作者：Pietro Bonazzi,Julian Moosmann,Ahmet Celik,Philipp Mayer,Michele Magno

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：eyewear enables unobtrusive, compact form factor, Smart eyewear enables, enables unobtrusive, context-aware interaction

备注：

点击查看摘要

Abstract:Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.8 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94\% (macro F1 = 0.781) under leave-two-subjects-out validation, with 33.9 ms end-to-end latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.

12. 【2606.07419】DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

链接：https://arxiv.org/abs/2606.07419

作者：Tony Danjun Wang,Tolga Birdal,Nassir Navab

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：analyzing interacting behaviors, interacting behaviors, fundamental bottleneck, bottleneck for analyzing, analyzing interacting

备注：

点击查看摘要

Abstract:Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

13. 【2606.07401】RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

链接：https://arxiv.org/abs/2606.07401

作者：Ameya Joshi,Joon Kim,Gus Eggert,Joseph Bajor,Cindy Hao,Jing Reyhan,Kushal Byatnal,Eli Badgio

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：financial reporting, supply-chain logistics, deployed in high-stakes, mortgage underwriting, clinical records

备注：

点击查看摘要

Abstract:Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

14. 【2606.07394】Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

链接：https://arxiv.org/abs/2606.07394

作者：Danial Hamdi,Fardin Ayar,Mahdi Javanmardi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：loss remain opaque, performance loss remain, Video Instance Segmentation, Integer Linear Program, Instance Segmentation

备注：

点击查看摘要

Abstract:In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.

15. 【2606.07368】Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

链接：https://arxiv.org/abs/2606.07368

作者：Marc Aubreville,Jonas Ammeling,Sweta Banerjee,Viktoria Weiss,Taryn A. Donovan,Robert Klopfleisch,Jiaqi Lv,Shan E Ahmed Raza,Raphaël Bourgade,Thomas Walter,Yasemin Topuz,Songül Varlı,Charles-Antoine Collins-Fekete,Zhuoyan Shen,Navya Sri Kelam,Nitin Singhal,Christian Marzahl,Brian Napora,Tengyou Xu,Hongyan Gu,Mario Vento,Gennaro Percannella,Norbert Ropiak,Izabela Wasiak,Jie Xiao,Shaojun Liu,Seungho Choe,April Khademi,Vidushi Walia,Sujatha Kotte,Andrew Broad,Alex Wright,Guillaume Balezo,Esha Sadia Nasir,Mostafa Jahanifar,Yosuke Yamagishi,Shouhei Hanaoka,Mattia Sarno,Francesco Tortorella,Biwen Meng,Jingxin Liu,Sara Krauss,Daniel Hieber,Lavish Ramchandani,Dev Kumar Das,Mieko Ochi,Yuan Bae,Piotr Giedziun,Mateusz Maniewski,Vangala Govindakrishnan Saipradeep,Naveen Sivadasan,Leire Benito-Del-Valle,Adrian Galdran,Kaustubh Atey,Sameer Anand Jha,Adinath Dukre,Imran Razzak,Maxime W. Lafarge,Viktor H. Koelzer,Nils Porsche,Nikolas Stathonikos,Mitko Veta,Dominik Hirling,Zsanett Zsófia Iván,Peter Horvath,Katharina Breininger,Christof A. Bertram

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Automated mitosis detection, Automated mitosis, computational pathology, well-established task, task in computational

备注：

点击查看摘要

Abstract:Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

16. 【2606.07366】Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

链接：https://arxiv.org/abs/2606.07366

作者：Anurag Ghosh,Francesco Pittaluga,Khiem Vuong,Angela Chen,Juan Alvarez-Padilla,Manmohan Chandraker,Srinivasa Narasimhan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Self-driving simulations typically, hand-authored synthetic scenarios, simulations typically rely, Self-driving simulations, typically rely

备注：

点击查看摘要

Abstract:Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

17. 【2606.07355】Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

链接：https://arxiv.org/abs/2606.07355

作者：Xucheng Shen,Kun Li,Fei Wang,Wei Qian,Jin Jiang,Dan Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：online recognition aims, classify subtle gestures, Micro-gesture online recognition, online recognition, recognition aims

备注： Technical Report. 1st Place in Micro-gesture Online Recognition in 4th MiGA at IJCAI 2026

点击查看摘要

Abstract:Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

18. 【2606.07338】VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

链接：https://arxiv.org/abs/2606.07338

作者：Zikai Zhang,Hubert P. H. Shum,Toby P. Breckon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：driving models increasingly, existing driving rationales, Vision-language driving models, models increasingly, frontier models

备注：

点击查看摘要

Abstract:Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

19. 【2606.07333】Varifold Moment Invariants for Sustainable and Explainable Contour Feature Extraction

链接：https://arxiv.org/abs/2606.07333

作者：G. Longari,J.-C. Alvarez Paiva,A.B. Tumpach

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：previously introduced Moment, Extended Gaussian Image, Elliptic Fourier Descriptors, introduced Moment Invariants, introduce Varifold Moments

备注： 29 pages, 12 figures

点击查看摘要

Abstract:We introduce Varifold Moments Invariants (VMI) as a unifying framework for many previously introduced Moment Invariants. These invariants are deeply related to other contour features that are invariant under translations and rotations, like Extended Gaussian Image, Elliptic Fourier Descriptors or Shape Distributions. The advantage of the varifold approach to moments consists in being able to combine the geometry of the region, its boundary, and the family of lines tangent to it, in order to create a substantial number of invariant features with high discriminating power and clear geometric meaning. By coupling our VMI feature extraction with the light feature classifiers Random Forest or Multi-Layer-Perceptron, we outperform state-of-the-art approaches based on contours, while decreasing drastically the computational cost to the point of allowing our algorithm to run on light devices. We tested our approach on classification tasks on a large number of widely-used datasets of various types (leaves, objects, cells) and achieved high accuracy with a low number of geometrically interpretable features.

20. 【2606.07326】AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

链接：https://arxiv.org/abs/2606.07326

作者：Yu Li,Menghan Xia,Gongye Liu,Xintao Wang,Conglang Zhang,Lei Ke,Yuxuan Lin,Ruihang Chu,Pengfei Wan,Kun Gai,Yujiu Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modeling remains underexplored, versatile controllability required, interactive world modeling, world modeling remains, pivotal frontier

备注：

点击查看摘要

Abstract:Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent's first-person sensorium. It allows the model to observe the agent's full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.

21. 【2606.07311】CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

链接：https://arxiv.org/abs/2606.07311

作者：Anku Rani,Wei Dai,Shravan Nayak,Pattie Maes,Mahdi M. Kalayeh,Paul Pu Liang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：accurately represent diverse, represent diverse global, diverse global cultures, global cultures remains, understudied frontier

备注：

点击查看摘要

Abstract:As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

22. 【2606.07289】Closed-Form Spectral Regularization for Multi-Task Model Merging

链接：https://arxiv.org/abs/2606.07289

作者：Yongxian Wei,Runxi Cheng,Xingxuan Zhang,Li Shen,Chun Yuan,Peng Cui,Dacheng Tao

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：independently fine-tuned experts, large foundation models, reducing the storage, independently fine-tuned, fine-tuned experts

备注：

点击查看摘要

Abstract:Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.

23. 【2606.07288】ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation

链接：https://arxiv.org/abs/2606.07288

作者：Chuanjin Fan,Lifan Wu,Wenjie Chang,Hanzhi Chang,Wenfei Yang,Tianzhu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Reconstructing surface meshes, Reconstructing surface, recent years, multi-view images, images has remained

备注： Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

点击查看摘要

Abstract:Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose ExMesh, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, ExMesh is the first framework to seamlessly integrate discrete topology operations into a continuous differentiable optimization pipeline. Extensive experiments demonstrate that ExMesh achieves a balance among accuracy, computational efficiency, and mesh conciseness.

24. 【2606.07280】Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation

链接：https://arxiv.org/abs/2606.07280

作者：Zihao Zhang,Aming Wu,Yang Li,Yahong Han,Jialie Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cloud segmentation aims, point cloud segmentation, aims to transfer, transfer knowledge, automatically identify

备注： Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

点击查看摘要

Abstract:Novel class discovery in point cloud segmentation aims to transfer knowledge from known classes to automatically identify and segment unlabeled novel classes in point clouds. Existing methods mainly rely on pairwise associations for class assignment and novel class reasoning, which limits their ability to capture complex relationships among known and novel classes and may lead to inaccurate semantic segmentation. To address this issue, we introduce a hypergraph-based framework that models high-order associations among classes and enables collaborative reasoning from known classes to novel classes beyond traditional pairwise relations. Moreover, existing methods tend to focus on semantic feature extraction while paying insufficient attention to geometric information in point clouds. To better exploit spatial structure, we propose Geometric-Aware Prototypes to enhance the representation of class-level geometric cues. By propagating geometric information through hyperedges, the proposed method improves the understanding of spatial distributions across classes and leads to more accurate segmentation. Experiments on the SemanticKITTI and SemanticPOSS datasets demonstrate the effectiveness and superiority of our method.

25. 【2606.07249】Reconstructing Multi-Decadal Forest Disturbances: A Spatio-Temporal Transformer Approach

链接：https://arxiv.org/abs/2606.07249

作者：Linus Scheibenreif,Anton Raichuk,Maxim Neumann

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：understanding carbon dynamics, traditional approaches typically, approaches typically rely, ignoring spatial context, contiguous United States

备注：

点击查看摘要

Abstract:Accurate monitoring of forest disturbances is essential for understanding carbon dynamics and land management, yet traditional approaches typically rely on pixel-wise analysis of satellite time-series, ignoring spatial context. We present a deep learning framework that maps 38 years (1984-2022) of forest disturbance across the contiguous United States by modeling temporal trajectories and spatial neighborhoods simultaneously. By leveraging a vision transformer architecture, our approach effectively filters noise from weak supervision signals to produce spatially coherent disturbance maps. We perform exhaustive evaluations across multiple satellites (Landsat, Sentinel-1, Sentinel-2) and temporal windows (38 years and the more recent 6 years), validating performance against a novel, manually annotated validation dataset (n=300) and independent fire perimeter dataset (n=706). The results highlight the complexity of the task: while our spatio-temporal model demonstrates high precision (up to 98.2% for +-1 year detection on MTBS and up to 71.3% on the CONUS validation datasets, with F1-scores up to 75.8% and 47.3%, respectively) and effectively reduces spatial artifacts, it exhibits performance trade-offs across different disturbance regimes compared to pixel-wise baselines. Our method offers a promising foundation for consistent forest monitoring.

26. 【2606.07244】Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

链接：https://arxiv.org/abs/2606.07244

作者：Haoxiang Shi,Xiang Deng,Haoyu Zhang,Qiaohui Chu,Yaowei Wang,Liqiang Nie

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Navigation in Continuous, Continuous Environments, follow natural-language instructions, Vision-Language Navigation, requires agents

备注：

点击查看摘要

Abstract:Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

27. 【2606.07233】Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

链接：https://arxiv.org/abs/2606.07233

作者：Eduardo Borges,Luís Garrote,Urbano J. Nunes

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：typically relies solely, crowded human-populated environments, typically relies, human-populated environments, relies solely

备注： Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

点击查看摘要

Abstract:LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.

28. 【2606.07222】DualGate-Net: A Prior-Gated Dual-Encoder Framework for Histopathology Cell Detection

链接：https://arxiv.org/abs/2606.07222

作者：Bahman Jafari Tabaghsar,Son Tran,K. Devaraja,Atul Sajjanhar

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：images strongly depends, surrounding tissue context, visually similar cells, histopathology images strongly, images strongly

备注： 15 pages, 4 figures

点击查看摘要

Abstract:Cell detection in histopathology images strongly depends on surrounding tissue context, where visually similar cells may belong to different classes under different microenvironments. Recent tissue-aware methods incorporate contextual priors, but often rely on static fusion strategies that may propagate noisy information. In this work, we propose DualGate-Net, a prior-aware dual-encoder framework that combines a ConvNeXtV2-based local encoder and a SegFormer-based global encoder through a learnable prior-gated fusion mechanism. The proposed module adaptively regulates the influence of tissue priors across spatial locations, while an auxiliary foreground reconstruction branch preserves high-frequency cellular structures during training. In addition, auxiliary cellness-guided cues are incorporated to further improve localization robustness. Experiments on the OCELOT benchmark demonstrate consistent improvements, achieving macro F1-scores of 0.7722 on the validation set and 0.7345 on the test set, highlighting the effectiveness of adaptive prior integration for robust histopathology cell detection.

29. 【2606.07217】Robotic Policy Adaptation via Weight-Space Meta-Learning

链接：https://arxiv.org/abs/2606.07217

作者：Christian Bianchi,Siamak Yousefi,Alessio Sampieri,Andrea Roberti,Luca Rigazio,Fabio Galasso,Luca Franco

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：enabling general-purpose policies, general-purpose policies trained, enabling general-purpose, promising paradigm, general-purpose policies

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2606.07217 [cs.RO]

(or
arXiv:2606.07217v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.07217

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

30. 【2606.07185】AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

链接：https://arxiv.org/abs/2606.07185

作者：Xiaocheng Lu,Yuxi Chen,Jie Zhang,Jian Liu,Jingcai Guo,Fangqi Zhu,Tao Han,Song Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：grids to recent, typically encode, Prioritized Representation Learning, Image, fixed number

备注： Preprint; 11 pages, 4 figures

点击查看摘要

Abstract:Image tokenizers, from 2D grids to recent 1D sequences, typically encode every image with the same fixed number of tokens. Yet visual complexity is highly heterogeneous, so a uniform budget overspends on simple inputs and underserves complex ones. Existing elastic tokenizers expose variable-length reconstructions, but often leave token length as a deployment-time operating point, a search target, or an external prediction rather than an output of the tokenizer itself. In this work, we ask whether a discrete visual tokenizer can budget itself in one pass. Our central finding is that actionable elasticity requires a representation--allocation co-design: prefixes must remain decodable across budgets, and the tokenizer must learn which prefix each image needs. We propose AdaTok, a self-budgeting discrete 1D tokenizer. AdaTok combines Prioritized Representation Learning, which orders tokens with nested tail masking and resolves budget-dependent semantic shift through Multi-Head LoRA decoder heads, with Adaptive Token Allocation, which trains a lightweight deterministic-group GRPO policy over candidate budgets. Dynamic Pareto Weighting balances fidelity and efficiency during policy training without manual trade-off sweeps. On ImageNet-1K, AdaTok-Full reaches rFID 1.31 at 256 tokens, while AdaTok-Adaptive attains rFID 1.50 using only ~118 tokens on average, outperforming discrete 1D baselines at comparable budgets. In autoregressive image generation, the shorter adaptive representation yields ~2.1x throughput over a fixed 256-token decode, suggesting that visual token count can be learned as a content-conditioned output rather than set as a fixed hyperparameter.

31. 【2606.07180】OPTIMUS-Prime: Minimal and Sufficient Concept Explanations for Deep Vision Models

链接：https://arxiv.org/abs/2606.07180

作者：Arthur Hoarau,Chenrui Zhu,Vu Linh Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：eXplainable Artificial Intelligence, propelled eXplainable Artificial, Artificial Intelligence, machine learning research, eXplainable Artificial

备注：

点击查看摘要

Abstract:The growing demand for transparency in automated decision-making has propelled eXplainable Artificial Intelligence (XAI) to the forefront of machine learning research. In computer vision, however, existing explanation methods often prioritize end-user accessibility at the expense of formal guarantees, leaving a critical gap between practical utility and theoretical rigor. In this paper, we address this gap by introducing OPTIMUS, a novel framework for generating concept-based visual explanations for deep classification models. OPTIMUS explanations take the form of visual heatmaps that not only remain interpretable to end users, but are grounded in the well-established theory of prime implicants, providing formal guarantees that have been largely absent from existing saliency-based methods. Specifically, OPTIMUS explanations satisfy two desirable properties: sufficiency, ensuring that the highlighted concepts provably guarantee the classifier's prediction, and minimality, ensuring that no strict subset of those concepts retains this guarantee. Together, these properties yield explanations that are both logically tight and visually coherent. We validate our approach on a visual classification benchmark, demonstrating that OPTIMUS heatmaps naturally and faithfully surface the decision-relevant concepts underlying model predictions.

32. 【2606.07179】EvoGS: Constructing Continuous-Layered Gaussian Splatting with Evolution Tree for Scalable 3D Streaming

链接：https://arxiv.org/abs/2606.07179

作者：Yuang Shi,Simone Gasparini,Géraldine Morin,Wei Tsang Ooi

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词：Gaussian Splatting requires, Gaussian Splatting, Splatting requires highly, Splatting requires, requires highly scalable

备注： Project page: [this https URL](https://yuang-ian.github.io/evogs/)

点击查看摘要

Abstract:Streaming 3D Gaussian Splatting requires highly scalable, progressive representations. Existing progressive methods rely on \textit{discrete layering}, accumulating separate splat sets for each level of detail. This structural independence between layers inherently leads to error accumulation, severe splat redundancy, and uncontrolled quality transitions. We propose EvoGS, the first \textit{continuous-layering} representation. Organized as an Evolution Tree, EvoGS generates finer details via an explicit, wavelet-inspired parent-child refinement. This empowers child nodes to structurally correct ancestral errors, yield inherently sparse and highly compressible inter-layer signals. Extensive experiments show EvoGS eliminates splat redundancy from over 65\% to under 25\%. Compared to state-of-the-art baselines, it reduces transmission payload and GPU VRAM footprint by up to 2.4$\times$ and 5.5$\times$, respectively, and achieves smooth quality transitions optimal for real-time adaptive streaming. Project page: this https URL

33. 【2606.07175】Seeing Without Exposing: Adaptive Privacy Control for Open-World, Context-Hungry MLLMs

链接：https://arxiv.org/abs/2606.07175

作者：Siyuan Xu,Yibing Liu,Peilin Chen,Yung-Hui Li,Shiqi Wang,Sam Kwong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal large language, Multimodal large, large language models, large language, language models

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have raised new privacy challenges. On the data side, user-provided inputs often include unpredictable sensitive information; while on the downstream task side, model reasoning depends on rich visual context that may itself be privacy-sensitive. Existing privacy protection methods, however, rely on predefined sensitive categories and fixed obfuscation strategies, struggling to tackle such challenges in MLLMs. To address this dilemma, we propose Anchored Privacy Drifting (APD), a training-free method that drifts privacy-sensitive elements toward semantically equivalent alternatives while anchoring contextual cues to the source image. To systematically evaluate this dual objective of privacy protection and contextual preservation, we introduce AdaptShield, a comprehensive benchmark covering 22 privacy categories, which combines conventional privacy metrics with MLLM-based assessments of contextual utility. Extensive experiments show that our method achieves balanced improvements in both privacy sanitization and content retention, with average gains of 10.4% on textual categories and 8.5% under MLLM-based evaluation across four MLLM series, i.e., Qwen2.5, Qwen3, InternVL3, and InternVL3.5.

34. 【2606.07172】xtual Supervision Enhances Geospatial Representations in Vision-Language Models

链接：https://arxiv.org/abs/2606.07172

作者：Marcelo Sartori Locatelli,Fernando Tonucci,Jea Kwon,Luiz Felipe Vecchietti,Bryan Nathanael Wijaya,Cheng Yaw Low,Virgilio Almeida,Meeyoung Cha

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：machine learning systems, critical yet underexplored, underexplored dimension, development of machine, systems for tasks

备注： Accepted at ICML 2026

点击查看摘要

35. 【2606.07171】When Recovery Matters: The Blind Spot of Surrogate Privacy in MLLM Editing

链接：https://arxiv.org/abs/2606.07171

作者：Siyuan Xu,Yibing Liu,Peilin Chen,Yung-Hui LI,Shiqi Wang,Sam Kwong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, enable flexible instruction-driven

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both design and evaluation scope. To this end, we introduce SPPE (Surrogate-based Privacy-Preserving Editing), the first recovery-oriented benchmark covering 36 fine-grained privacy categories and 65 editing instructions. It defines two complementary tasks: 1) editability assessment, which estimates before cloud interaction whether a surrogate can induce an edit consistent with the original image; and 2) surrogate-to-source edit recovery, which evaluates whether the edited surrogate can be transferred back to the private source with the edit effect preserved. We address each task with a dedicated method: ERMA predicts surrogate editability through instruction-aware multimodal relation modeling, while \method performs cycle-consistent recovery by using the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor. Experiments on SPPE and InstructPix2Pix show consistent improvements on both tasks. For editability assessment, ERMA improves over the best-performing baselines by 13.9% in SRCC and 12.3% in PLCC. For surrogate-to-source edit recovery, C2E-S2SER outperforms SOER across all 8 source integrity and edit consistency metrics on SPPE.

36. 【2606.07161】raRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

链接：https://arxiv.org/abs/2606.07161

作者：Duc Tri Tran,Trung Thanh Nguyen,Vijay John,Phi Le Nguyen,Yasutomo Kawanishi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video Text Spotting, intelligent transportation systems, enabling automated reading, Text Spotting, vehicle markings

备注： 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems

点击查看摘要

Abstract:Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at this https URL.

37. 【2606.07145】Consistent-Inversion: Reverse Consistency Guidance for Structure-Preserving Visual Editing

链接：https://arxiv.org/abs/2606.07145

作者：Xiaocheng Lu,Jingcai Guo,Song Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：preserving editing-irrelevant structure, Text-guided diffusion models, Text-guided diffusion, real-image visual editing, editing-irrelevant structure

备注： Submitted to IEEE Transactions on Multimedia; 10 pages, 4 figures

点击查看摘要

Abstract:Text-guided diffusion models have become effective tools for real-image visual editing, where the edited image must follow a target instruction while preserving editing-irrelevant structure. Most training-free editors rely on inversion: a source image is mapped to a noisy latent trajectory and the terminal latent is reused for target-prompt denoising. This reuse is useful for preservation, but it also couples source reconstruction and target editing. The resulting trajectory mismatch may either damage background/layout details or over-constrain the intended edit. This paper presents Consistent-Inversion, a training-free reverse consistency guidance framework for structure-preserving visual editing. Instead of treating the inverted source latent as a fixed initialization, Consistent-Inversion checks whether an intermediate target trajectory can be reversed toward the source inversion trajectory under the source prompt. To make this check well-defined, we construct an auxiliary target-side noise representation, perform source-guided reverse denoising, and use the resulting reverse consistency discrepancy as a correction signal for selected early target denoising steps. The method does not update model parameters, is compatible with inversion-based editors, and introduces only a small inference overhead when applied sparsely. Experiments on PIE-Bench show that Consistent-Inversion improves background and structural fidelity under a unified SD3.5 protocol while maintaining target-prompt alignment, and compatibility experiments further verify the same correction principle on classical Stable-Diffusion inversion pipelines.

38. 【2606.07117】Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

链接：https://arxiv.org/abs/2606.07117

作者：Yibo Liu,Ziwei Zhang,Haozhou Pang,Menghao Li,Lanshan He,Gan Qi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：paper presents, completely bypasses, framework that completely, Representation Alignment Loss, intermediate representations

备注：

点击查看摘要

Abstract:This paper presents Native3D, the first end-to-end 3D scene generation framework that completely bypasses 2D intermediate representations. Traditional approaches typically require adapting 3D representations to the 2D domain to leverage pre-trained diffusion models, which inevitably introduces domain adaptation issues including geometric structural distortion and texture detail degradation. To address these limitations, we design a unified mesh-texture joint representation that simultaneously models both geometric structures and texture features through a Transformer-based scene encoder, effectively maintaining spatial relationships and visual consistency among objects within scenes. We further propose the 3D Representation Alignment Loss (3D REPA Loss), which employs an improved contrastive learning mechanism to align multi-level semantic representations in the latent space, significantly enhancing geometric and textural fidelity. Experimental results demonstrate that Native3D outperforms existing methods in both generation quality and editing flexibility, providing a novel solution for 3D scene editing.

39. 【2606.07115】3DMorph: Single-Image-Guided Local 3D Shape Editing and Morphing

链接：https://arxiv.org/abs/2606.07115

作者：Tobias Preintner,Yunfei Deng,Phillip Müller,Sebastian Illing,Adrian König,Thomas Bäck,Elena Raponi,Niki van Stein

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：shapes remains limited, remains limited, recent progress, existing shapes remains, editing

备注： Accepted to IJCNN 2026

点击查看摘要

Abstract:Despite recent progress in 3D generation, intuitive editing of existing shapes remains limited. Unlike images, which benefit from well-established inpainting tools, general 3D objects such as meshes still lack simple and effective methods for local shape editing. Existing approaches are often global, domain-specific, require complex user interaction, or focus on appearance (color and texture) rather than geometry. We introduce 3DMorph, a training-free framework for single-image-guided local 3D shape editing and morphing. Given an edited image showing a desired shape modification, our method automatically localizes the relevant 3D region and transfers 2D modifications to 3D while preserving unmodified areas. 3DMorph also enables intermediate shape generation between the original and edited objects, facilitating design exploration. To benchmark editing quality, we introduce Delta3D, an image-guided local 3D editing benchmark with paired ground-truth edits. Experimental results show that 3DMorph translates intuitive 2D edits into 3D, outperforming state-of-the-art generative and editing methods.

40. 【2606.07102】GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection

链接：https://arxiv.org/abs/2606.07102

作者：Taisei Saito,Koretaka Ogata,Takafumi Hiroi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Contrastive Language-Image Pre-training, Gaussian Process, Contrastive Language-Image, Language-Image Pre-training, training-free framework

备注： 8 pages, 6 figures, Accepted at IJCNN 2026

点击查看摘要

Abstract:We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small $K$-shot cache and lightweight hyperparameter selection, with memory cost scaling as $O(CK^2)$ for $C$ classes and $K$ shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at this https URL

41. 【2606.07100】LARA: Latent Action Representation Alignment for Vision-Language-Action Models

链接：https://arxiv.org/abs/2606.07100

作者：Mengya Liu,Baoxiong Jia,Jiangyong Huang,Jingze Zhang,Siyuan Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：robot action datasets, predict actions directly, Visual-language action, VLA, Latent Action

备注：

点击查看摘要

Abstract:Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

42. 【2606.07090】Detecting Temporally Localized Manipulations in Authentic Video Streams

链接：https://arxiv.org/abs/2606.07090

作者：Okan Umur,Ali Emre Güşlü,Ibrahim Delibasoglu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generative artificial intelligence, artificial intelligence technologies, manipulation increasingly accessible, increasingly accessible, rapid advancement

备注：

点击查看摘要

Abstract:The rapid advancement of video editing and generative artificial intelligence technologies has made realistic video manipulation increasingly accessible. Although existing datasets have significantly advanced research in deepfake detection, object removal, and video inpainting, they do not adequately model scenarios in which a short manipulated segment is inserted into an otherwise authentic video and the original video continues afterward. In this study, we review representative datasets from the literature, analyze their characteristics, and discuss their limitations with respect to temporally localized realistic manipulation detection. Based on this analysis, we motivate the need for a new dataset specifically designed for authentic videos containing short and highly realistic manipulated intervals. Finally, we evaluate two complementary approaches on our custom-curated test set to establish an initial benchmark for this challenging scenario. The first employs a linear probe on DINOv3 features, assessed under three thresholding strategies. The second leverages DINOv3 features with a consecutive frame similarity-based method to detect temporal manipulation boundaries. Together, these experiments provide an initial benchmark for partially manipulated video detection and highlight the need for content-adaptive thresholding mechanisms. The dataset, code, and supplementary materials are publicly available at this https URL.

43. 【2606.07086】An Adaptive Data cleaning Framework for Noisy Label Detection

链接：https://arxiv.org/abs/2606.07086

作者：Chen-Hsuan Fang,Wei-Hsinag Chen,Pin-Hsuan Yu,Jung-Hua Wang,Tsung-Wei Pan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Deep neural networks, large annotated datasets, computer vision tasks, Deep neural, neural networks

备注：

点击查看摘要

Abstract:Deep neural networks (DNNs) excel in computer vision tasks given large annotated datasets. In real-world applications, however, labels are often corrupted by ambiguity, human error, or dynamic environments. Over-parameterized DNNs easily memorize these noisy labels during training, degrading model accuracy and generalization. Existing data-cleaning and sample-selection strategies often rely on manually specified thresholds, prior knowledge of the noise ratio, or a single metric (either learning dynamics or geometric structure), making them unstable in complex data regimes. This paper proposes a self-adaptive data-cleaning framework that integrates local, global, and learning dynamics cues for robust noisy-label detection. Samples are mapped into a unified low-dimensional feature space through a modular feature concatenation paradigm. We provide two instantiations: a 2D metric integrating class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. Unlike conventional 1D Gaussian Mixture Models applied to a single scalar metric, our framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise show high recall across settings, including near-perfect recall (=98%) on ImageNet-100 at 40% noise. Subsequent training yields accuracy gains across evaluated settings, especially under severe corruption on ImageNet-100. These findings suggest that multi-metric integration provides a threshold-free, practical, and low-tuning strategy for noisy label detection.

44. 【2606.07079】AsyncPatch Diffusion: spatially-flexible image generation

链接：https://arxiv.org/abs/2606.07079

作者：Samuele Papa,Valentin De Bortoli,Guillaume Couairon,Daniel Sýkora,Romuald Elie,Klaus Greff

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Standard diffusion models, Standard diffusion, single shared noise, shared noise level, diffusion models corrupt

备注： 36 pages, 14 figures

点击查看摘要

Abstract:Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.

45. 【2606.07058】Constructing VAE Latent Spaces with Prescribed Topology

链接：https://arxiv.org/abs/2606.07058

作者：Jilles S. van Hulst,Jakub M. Tomczak,W.P.M.H. Heemels,Duarte J. Antunes

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT); Machine Learning (stat.ML)

关键词：Variational autoencoders, learn low-dimensional latent, low-dimensional latent representations, learn low-dimensional, Variational

备注： 16 pages, 7 figures

点击查看摘要

Abstract:Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When the data lies on a manifold with non-Euclidean topology, the standard Gaussian prior introduces a topological mismatch that degrades reconstruction quality and prevents faithful representation. We present a constructive mathematical framework that resolves this mismatch for all manifolds that admit a product covering space. These are manifolds expressible as products of elementary factors (circles, intervals, or lines) or as quotients of such products by a finite symmetry group. The class includes cylinders, tori, Möbius strips, Klein bottles, and real projective spaces. Factorized distributions over the elementary factors yield product topologies with closed-form, decoupled KL divergences, so that each latent factor can be shaped independently while keeping training tractable. We catalogue reparametrizable encoder-prior pairs for periodic, bounded, and unbounded supports, and provide coordinate transformations that allow standard neural networks to output non-Euclidean parameters with smooth gradients. For quotient manifolds, the decoder receives group-invariant features of the covering-space coordinates, so that identified points produce identical outputs. Anchor constraints fix the coordinate system relative to the data or create soft topological holes. Experiments on synthetic manifolds and real-image datasets (rotated and cyclically shifted MNIST) confirm that a topology-matched prior aligns KL regularization with the data manifold. The resulting topology-aware models outperform the Gaussian baseline at all practically relevant regularization strengths. The code is available at this https URL.

46. 【2606.07053】rioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

链接：https://arxiv.org/abs/2606.07053

作者：Dian Gu,Zhengyi Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：complex multi-person scenarios, Multimodal Diffusion Transformers, multi-person scenarios, suffers from limb, limb distortions

备注： 15 pages (9 pages main body, 6 pages references and appendix), 3 figures, 5 tables

点击查看摘要

Abstract:Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability. To resolve severe multi-instance occlusions, we design a Learnable Relational Bias Mask that categorizes topological connectivity into fine-grained physical states, mapping them into continuous attention soft constraints to effectively decouple inter-instance interference. Furthermore, a Pose-Guided Spatial Loss Weighting strategy modulates the native diffusion objective using heatmap-derived error maps, focusing anatomical supervision strictly on distortion-prone regions. Extensive experiments demonstrate that TrioPose achieves state-of-the-art performance across challenging benchmarks, including Human-Art, CrowdPose, and OCHuman. Notably, it attains an AP of $64.33$ on Human-Art, representing a $30\%$ improvement over prior arts, while setting new standards for visual fidelity and text-image semantic alignment in complex multi-human generation.

47. 【2606.07036】STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

链接：https://arxiv.org/abs/2606.07036

作者：Won June Cho,Daeky Jeong,Hyeongyeol Lim,Hongjun Yoon

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)

关键词：Synthetic histopathology image, including patient privacy, addresses critical challenges, image generation addresses, generation addresses critical

备注： 27 pages, 7 figures

点击查看摘要

Abstract:Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to "conditioning collapse," where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are $\ell_2$-normalized and lie on the unit hypersphere $\mathcal{S}^{d-1}$ with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on $\mathcal{S}^{d-1}$ for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.

48. 【2606.07034】ForensicConcept: Transferable Forensic Concepts for AIGI Detection

链接：https://arxiv.org/abs/2606.07034

作者：Menyanshu Zhou,Ziyin Zhou,Ke Sun,Yunpeng Luo,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve high accuracy, AI-generated image detectors, image detectors achieve, detectors achieve high, AI-generated image

备注： Accepted by ICML 2026

点击查看摘要

Abstract:AI-generated image detectors achieve high accuracy on in-distribution data but often fail on unseen generators. A key obstacle to understanding this failure is the black-box nature of current detectors: they do not reveal which evidence drives their decisions. We propose ForensicConcept, a framework that extracts explicit forensic concepts from detectors and enables their transfer across backbones. Our method localizes decision-critical patches via Transformer attribution, clusters them into a compact concept codebook, and uses a concept-aligned projection to produce auditable evidence readouts. Motivated by prior studies showing that DINO representations can guide diffusion generation and exhibit concept-level correspondence with diffusion features, we introduce a generation-trace reference based on CleanDIFT diffusion features and quantify backbone-trace alignment via neighborhood-structure consistency (CKNNA). We further propose concept codebook injection to transfer diffusion-derived concepts into target backbones. Experiments on GenImage, GAN-family, and Chameleon benchmarks show consistent improvements over prior methods. We also find that CKNNA alignment predicts transfer effectiveness, providing a principled explanation for why some backbones yield more transferable forensic evidence than others.

49. 【2606.07033】Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

链接：https://arxiv.org/abs/2606.07033

作者：Zhe Yang,Ruyi Zhang,Hongtao Chen,Wenrui Li,Hengyu Man,Wangmeng Zuo,Xiaopeng Fan

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Open-vocabulary audio-visual event, temporally localize events, Open-vocabulary audio-visual, including categories unseen, cues to recognize

备注：

点击查看摘要

Abstract:Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

50. 【2606.07032】Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

链接：https://arxiv.org/abs/2606.07032

作者：Zhenyu Yang,Zemin Du,Shengsheng Qian,Changsheng Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Zero-Shot Composed Image, Composed Image Retrieval, query composed, target image based, Composed Image

备注：

点击查看摘要

Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at this https URL.

51. 【2606.07024】GuideCAD: A Lightweight Multimodal Framework for 3D CAD Model Generation via Prefix Embedding

链接：https://arxiv.org/abs/2606.07024

作者：Minseong Kim,Jinyeong Park,Sungho Park,Jibum Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：CAD generation require, substantial computational resources, generation require substantial, require substantial computational, necessitating efficient training

备注：

点击查看摘要

Abstract:Multi-modal approaches used for 3D CAD generation require substantial computational resources, necessitating efficient training. To address this, we propose GuideCAD, which leverages semantically rich visual-textual representations having only a small number of trainable parameters to generate 3D CAD models. Specifically, GuideCAD uses a mapping network that converts image embeddings into prefix embeddings, enabling a pretrained large language model to integrate visual and textual information. As a result, a transformer-based decoder predicts the construction sequence using the visual-textual embeddings in order to generate the 3D CAD model. For experimental evaluation, we construct a new dataset, referred to as GuideCAD, which consists of text-image pairs. Each pair includes a text prompt that represents a 3D CAD construction sequence and its corresponding 3D CAD image. Our experimental results show that GuideCAD generates comparably high-quality 3D CAD models while using approximately four times fewer parameters and achieving twice the training efficiency compared to fine-tuning approaches. We have released the source code and dataset for our method at: this https URL

52. 【2606.06991】Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

链接：https://arxiv.org/abs/2606.06991

作者：Zhenyu Yang,Kairui Zhang,Shengsheng Qian,Weiming Dong,Changsheng Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, seamless human-AI interaction, Online Video Large, Video Large Language, Video Large

备注：

点击查看摘要

Abstract:Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.

53. 【2606.06978】CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

链接：https://arxiv.org/abs/2606.06978

作者：Zihan Liu,Yuguang Yang,Shengjie Su,Jianing Pang,Linlin Yang,Chunyu Xie,Nikolai Yu. Zolotykh,Baochang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Detection, categories, preserving previously learned, detection ability, Continual

备注：

点击查看摘要

Abstract:Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.

54. 【2606.06966】From Vision to Text: A Compact Multimodal Approach for Robust, Cross-Domain Presentation Attack Detection on ID Cards

链接：https://arxiv.org/abs/2606.06966

作者：Qingwen Zeng,Juan E. Tapia,Sneha Das,Christoph Busch

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Presentation Attack Detection, challenge Presentation Attack, Attack Detection, Presentation Attack, Cross-domain shifts challenge

备注： Publication under the revision process on IEEE

点击查看摘要

Abstract:Cross-domain shifts challenge Presentation Attack Detection (PAD) on ID Cards, given the restricted data available due to privacy concerns. This work proposes a compact multimodal model, based on new generative and discriminative blocks, which combines visual and textual data for PAD on genuine and synthetic ID images. While multimodal models exhibit strong generalisation after supervised fine-tuning, they fail in zero-shot settings. Our findings underscore that model capacity and real-world data are essential for reliable PAD, while existing synthetic datasets may not reflect real-world challenges. We argue for a re-evaluation of synthetic data as a benchmark and emphasise the need for more realistic, diverse datasets to advance PAD research.

55. 【2606.06958】MVSegNet: A Lightweight Boundary-Aware Network for Fetal Lateral Ventricle Segmentation and Atrial Width Estimation in Prenatal Ultrasound

链接：https://arxiv.org/abs/2606.06958

作者：Arafat Hossain Sayem

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ventriculomegaly is assessed, assessed by measuring, lateral ventricle, ventricle in prenatal, Fetal ventriculomegaly

备注： 11 pages, 3 figures, 4 tables. Code and trained models will be released upon acceptance. Supplementary material available upon request

点击查看摘要

Abstract:Fetal ventriculomegaly is assessed by measuring the atrial width of the lateral ventricle in prenatal ultrasound. Accurate segmentation is essential for this measurement, but acoustic shadowing, speckle noise, and poor contrast make it difficult. We developed MVSegNet, a lightweight encoder-decoder network combining multi-scale feature extraction and boundary-aware refinement. The model was trained and evaluated on 584 expert-annotated transventricular ultrasound frames using a 70/15/15 split. Performance was compared against six segmentation baselines using overlap, boundary, and measurement metrics. MVSegNet achieved a Dice score of 80.79%, IoU of 68.47%, Hausdorff distance of 4.07 mm, and atrial width mean absolute error of 3.40 mm. The model contains 2.31 million parameters and runs at 165.6 frames per second on an NVIDIA T4 GPU. MVSegNet outperformed all evaluated baselines on boundary and measurement metrics while maintaining low computational cost, supporting its use in automated fetal ultrasound analysis.

56. 【2606.06950】When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

链接：https://arxiv.org/abs/2606.06950

作者：Md Enamul Hoq,Sharafat Hossain,Imraul Emmaka,Linda Larson-Prior,Lawrence Tarbox,Jonathan Bona,Donald Johann Jr.and Fred Prior

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：volumetric medical imaging, widely assumed preferable, gains justify added, performance gains justify, justify added computational

备注： 8 pages, 6 figures

点击查看摘要

Abstract:Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.

57. 【2606.06943】SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

链接：https://arxiv.org/abs/2606.06943

作者：Sunoh Kim,Daeho Um

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remain highly fragile, CLIP achieve strong, Vision-language models, CLIP achieve, achieve strong zero-shot

备注： Accepted in ICML2026

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at this https URL.

58. 【2606.06938】When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

链接：https://arxiv.org/abs/2606.06938

作者：Sunoh Kim,Daeho Um

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：zero-shot recognition capabilities, achieved remarkable zero-shot, remarkable zero-shot recognition, Vision-language models, perturbations remains limited

备注： Accepted in CVPR2026

点击查看摘要

Abstract:Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at this https URL.

59. 【2606.06926】SVHighlights: Towards Extremely Long Sport Video Highlight Detection

链接：https://arxiv.org/abs/2606.06926

作者：Donggyu Lee,Youngbin Ki,Jeonghun Kang,Taehwan Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：great practical importance, methods remain limited, practical importance, largely due, highlight detection

备注： Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: [this https URL](https://leedongkyu2019.github.io/SVHighlights/)

点击查看摘要

Abstract:While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +3.12 in HIT@1, +4.06 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

60. 【2606.06918】DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

链接：https://arxiv.org/abs/2606.06918

作者：Abhishek Ameta,Sayan Banerjee,Shreyas Pandith,Harshit,Ankita Chatterjee,Akshay Janardan Bankar,Amit Satish Unde

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models challenges existing, challenges existing AI-generated, generative image models, image models challenges, AI-generated image detectors

备注： Submitted to ECCV 2026

点击查看摘要

Abstract:The rapid evolution of generative image models challenges existing AI-generated image detectors, particularly in open-world settings with unseen generators. Recent training-free approaches measure robustness gaps in frozen vision foundation models (VFMs), detecting fakes via perturbation-induced embedding drift. However, these methods rely on fixed invariance geometry inherited from pretraining and lack principled adaptation to the detection task. We instead formulate AI-generated image detection as learning a structured invariance manifold of real images under one-class supervision. Building upon a frozen VFM, we introduce lightweight projection heads that decompose representation space into complementary robust and fragile subspaces. The robust subspace is explicitly trained to suppress variations induced by physically plausible imaging transformations, approximating tangent directions of a real-image manifold, while the fragile subspace retains sensitivity to edit-like perturbations. A structured ordering margin enforces hierarchical separation between physical invariance and edit-induced variability, enabling detection as a margin-violation test relative to the learned manifold. At inference, multi-scale patch-wise drift under both transformation families yields a dual-channel invariance signature and interpretable localization. Extensive experiments demonstrate strong open-world generalization across unseen generators and resolutions, consistently outperforming training-free robustness-based baselines while providing interpretable invariance-violation maps.

61. 【2606.06908】polyDAG: Polynomial Acyclicity Constraints for Efficient Continuous Causal Discovery in Visual Semantic Graphs

链接：https://arxiv.org/abs/2606.06908

作者：Wenhao Zhang,Ramin Ramezani,Tao Han,Kai Hwang,Minyi Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Modern image-analysis pipelines, Modern image-analysis, structured semantic variables, object concepts, scene descriptors

备注：

点击查看摘要

Abstract:Modern image-analysis pipelines often convert images into structured semantic variables, such as facial attributes, object concepts, and scene descriptors. Learning directed dependencies among these variables can produce interpretable visual semantic graphs, but continuous directed acyclic graph learning is limited by the cost of enforcing acyclicity. We present polyDAG, a polynomial acyclicity framework for efficient continuous causal discovery in visual semantic graphs. polyDAG replaces the matrix-exponential acyclicity constraint with a finite polynomial trace constraint and proves that the new constraint is zero exactly for acyclic graphs. We further derive a geometric-series implementation that avoids the explicit summation loop while preserving the same acyclicity condition. Experiments on synthetic Erdos-Renyi graphs and CelebA facial visual attributes show that polyDAG improves efficiency and structure recovery. Averaged over the revised synthetic protocol with d in {100, 200, 500}, polyDAG reduces mean structural Hamming distance from 318.4 to 285.4 and improves mean F1 score from 0.725 to 0.756. At 100 nodes, the geometric variant runs in 3.44 seconds compared with 5.16 seconds for the exponential baseline, corresponding to a 33.4 percent speedup. Code and data are publicly available at this https URL.

62. 【2606.06904】ActionMap: Robot Policy Learning via Voxel Action Heatmap

链接：https://arxiv.org/abs/2606.06904

作者：Pei Yang,Hai Ci,Yanzhe Chen,Qi Lv,Han Cai,Mike Zheng Shou

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：continuous control signal, backbone hidden state, models have advanced, control signal, advanced rapidly

备注：

点击查看摘要

Abstract:Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: this https URL.

63. 【2606.06903】Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

链接：https://arxiv.org/abs/2606.06903

作者：Yuan Zeng,Yujia Shi,Yuhao Yang,Dongxia Liu,Zongqing Lu,Wenming Yang,Qingmin Liao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Human image animation, image animation aims, Human image, static reference image, pose information extracted

备注： Accepted to ICLR 2026

点击查看摘要

Abstract:Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at this https URL.

64. 【2606.06901】LUCID: Learning Unified Control for Image Deflaring and Exposure Mastery in Nighttime Photography

链接：https://arxiv.org/abs/2606.06901

作者：Tingyu Yang,Yuan Cheng,Xiaoyun Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：obscure scene structure, photon-limited regions collapse, intense flares obscure, flares obscure scene, scene structure

备注： Accepted by SIGGRAPH 2026

点击查看摘要

Abstract:Photography is the art of painting with light, yet nighttime scenes are shaped by competing degradations: intense flares obscure scene structure, while photon-limited regions collapse into noise. Conventional approaches address these factors in isolation, overlooking the fact that these degradations are fundamentally entangled. To bridge this gap, we introduce LUCID, a unified framework that reframes nighttime restoration as a continuous and controllable process rather than a fixed correction. We decompose nighttime restoration into two cooperative components: a flare disentanglement module that lifts the 'curtain' of optical artifacts to provide reliable structural guidance, and a diffusion-driven module that leverages generative priors to reconstruct clean and well-exposed imagery. Crucially, LUCID introduces explicit controllability through a novel four-mode training strategy, enabling users to steer the restoration process via classifier-free guidance (CFG) and allowing selective control over light sources and their associated flare and ghosting artifacts, while also supporting high dynamic range (HDR) reconstruction through continuous exposure control. Extensive experiments demonstrate that LUCID consistently outperforms state-of-the-art methods across diverse real-world nighttime scenarios.

65. 【2606.06899】Lighting-Aware Representation Learning under Controllable Lighting Variation

链接：https://arxiv.org/abs/2606.06899

作者：Lizhen Zhu,Charantej Reddy Pochimireddy,James Z Wang,Brad Wyble

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：induce substantial appearance, remain a major, major challenge, induce substantial, substantial appearance

备注：

点击查看摘要

Abstract:Variations in illumination remain a major challenge for visual representation learning, as they induce substantial appearance changes both across and within environments. While existing approaches typically address this issue through data augmentations that encourage models to become invariant to lighting changes, such strategies do not explicitly model lighting information during learning. Inspired by theories of human vision, we propose a lighting-aware representation learning framework that incorporates illumination variation as an explicit training signal rather than a nuisance factor to be suppressed. Our method extends contrastive learning by introducing an auxiliary objective that captures illumination-dependent variation in rendered scenes, enabling the model to jointly learn representations that preserve semantic consistency while remaining sensitive to lighting-dependent visual structure. We evaluate the proposed model on image classification and object detection tasks across the ImageNet, ExDark, and PASCAL VOC benchmarks. Results demonstrate that the proposed lighting-aware training consistently improves downstream performance over standard contrastive learning baselines, while maintaining the same architecture and training budget. Furthermore, our approach shows promising performance in supervised learning frameworks and under settings involving simpler lighting variation, suggesting broad applicability beyond complex illumination scenarios. These results indicate its potential to enhance model robustness and adaptability in complex visual environments as well as in more conventional image processing tasks.

66. 【2606.06891】Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

链接：https://arxiv.org/abs/2606.06891

作者：Hanxun Yu,Xuan Qu,Lei Ke,Boqiang Zhang,Yuxin Wang,Jianke Zhu,Dong Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Multimodal Models, Large Multimodal, Multimodal Models operate, requiring complete scene, complete scene observations

备注： Project Page: [this https URL](https://stream3d-vlm.github.io/)

点击查看摘要

Abstract:Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at this https URL

67. 【2606.06890】Diagnosing Visual Ignorance in Vision-Language Models

链接：https://arxiv.org/abs/2606.06890

作者：Runyu Zhou,Qi Zhang,Qixun Wang,Yisen Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：producing confident answers, frequently rely, producing confident, weakly grounded, visual

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

68. 【2606.06887】ARAPDiffusion: ARAP Regularization for Diffusion-Based Deformable Shape Space Learning

链接：https://arxiv.org/abs/2606.06887

作者：Haibo Liu,Jinghan Ke,Haitao Yang,Xiangru Huang,Georgios Pavlakos,Qixing Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：underlying continuous shape, continuous shape space, latent diffusion model, deformation shape collection, paper introduces ARAPDiffusion

备注：

点击查看摘要

Abstract:This paper introduces ARAPDiffusion, a latent diffusion model to learn the underlying continuous shape space of a deformation shape collection. The key innovation is in injecting the as-rigid-as-possible (ARAP) deformation model as regularization losses into latent diffusion (LD), releasing the requirement of having abundant 3D training data for learning generative models. In contrast to the standard LD, we show how the ARAP model can be used to improve both the encoder/decoder and the LD model. The training procedure alternates between using the synthetic distribution defined by the LD model to develop a regularization loss that enhances the shape encoder/decoder and using the shape decoder to develop a regularization loss to improve the LD model. We also show the benefit of the LD paradigm in combining a representation-free LD process and an implicit shape decoder that is applicable to unorganized point clouds. The experimental results of unconditional and conditional shape generation demonstrate the advantages of ARAPDiffusion over baseline approaches.

69. 【2606.06885】FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

链接：https://arxiv.org/abs/2606.06885

作者：Yuan Zeng,Yujia Shi,Zongqing Lu,QingMin Liao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Human Image Animation, Human Image, Image Animation, significant advancements, primarily driven

备注： Accepted to IEEE ICASSP 2026

点击查看摘要

Abstract:Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at this https URL.

70. 【2606.06878】A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

链接：https://arxiv.org/abs/2606.06878

作者：Kangjian Zhu,Haobo Jiang,Jianjun Qian,Jin Xie

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：cross-view fusion, cross-view fusion framework, grasp pose estimation, corner views, cross-view

备注： Corresponding author: Jin Xie

点击查看摘要

Abstract:In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at this https URL.

71. 【2606.06875】Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

链接：https://arxiv.org/abs/2606.06875

作者：Xiang Yang,Feifei Li,Mi Zhang,Geng Hong,Xiaoyu You,Mi Wen,Min Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词：Diffusion transformers, equipped with multimodal, dominant paradigm, Diffusion, UVR

备注： ICML26

点击查看摘要

Abstract:Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at this https URL.

72. 【2606.06872】EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

链接：https://arxiv.org/abs/2606.06872

作者：Yuan Zeng,Zilue Gao,Yujia Shi,Zongqing Lu,Wenming Yang,QingMin Liao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Estimating hand-surface contact, Estimating hand-surface, hand-surface contact pressure, robotic imitation, ergonomic analysis

备注： Accepted to IEEE ICASSP 2026

点击查看摘要

Abstract:Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at this https URL.

73. 【2606.06867】Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

链接：https://arxiv.org/abs/2606.06867

作者：Sanket Kachole,Siddhesh Thakur,Shubham Innani,Sanyukta Adap,Suhang You,Carla Pitarch-Abaigar,Spyridon Bakas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Modern medicine relies, sources spanning radiology, structured clinical information, heterogeneous data sources, data sources spanning

备注：

点击查看摘要

Abstract:Modern medicine relies on heterogeneous data sources spanning radiology, pathology, text reports, and structured clinical information. However, real-world patient data are frequently incomplete, with missing or sparsely acquired modalities, limiting the effectiveness of standard multimodal fusion approaches. To this end, we propose the Multimodal Flexible Redundancy-aware decomposed GAted Learning (Multi-FRuGaL) framework, a decomposition-aware, adaptive gated intermediate-fusion framework that performs modality-level representation learning under missing data. Multi-FRuGaL integrates per-modality encoders with a signal decomposition layer, an input-conditioned gating network, and an information-aware fusion objective to separate redundant from modality-specific complementary signals, selectively upweighting informative modalities and suppressing redundant or noisy inputs, and remaining well-defined even when multiple modalities are absent. We evaluate Multi-FRuGaL on two multimodal head and neck cancer cohorts: the HANCOCK challenge dataset (N = 763) comprising five modalities and two prognostic endpoints (5-year survival and 2-year recurrence), and the HECKTOR challenge dataset (N = 588) comprising three modalities for human papillomavirus (HPV) status classification. Multi-FRuGaL consistently achieves higher mean performance than the evaluated baselines across multiple tasks, improving AUC from 0.601 to 0.8496 for survival, from 0.672 to 0.8102 for recurrence, and achieving 0.975 AUC for HPV prediction on HECKTOR. For survival analysis, it further achieves a concordance index of 0.6814 for overall survival, 0.7421 for recurrence-free survival, and 0.7143 for progression-free survival on HANCOCK, and 0.7203 for recurrence-free survival on HECKTOR. Qualitative analyses further show that Multi-FRuGaL learns discriminative and robust multimodal representations, even under severe missing-modality conditions.

74. 【2606.06864】LRMIL: Efficient Low-Resolution Multiple Instance Learning via High-Resolution Knowledge Distillation for Whole Slide Image Classification

链接：https://arxiv.org/abs/2606.06864

作者：Yonghan Shin,Won-Ki Jeong

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：dense annotations, Multiple instance learning, enables slide-level prediction, standard paradigm, prediction without dense

备注：

点击查看摘要

Abstract:Multiple instance learning (MIL) has become a standard paradigm for whole slide image (WSI) analysis in digital pathology, as it enables slide-level prediction without dense annotations. Existing MIL methods typically rely on exhaustive extraction and encoding of high-resolution patches. However, this practice suffers from two critical limitations in real-world clinical settings: it struggles to capture global visual cues at lower magnifications, and incurs substantial computational overhead due to the massive number of high-resolution patches per slide. To address these limitations, we propose an efficient low-resolution multiple instance learning (LRMIL) framework that transfers high-resolution knowledge to low-resolution representations. LRMIL adopts a two-stage distillation strategy. First, patch-level cross-resolution distillation aligns low-resolution patch embeddings with high-resolution representations. Second, slide-level knowledge distillation trains a low-resolution student MIL model under both slide-level supervision and teacher guidance. At inference time, LRMIL operates exclusively on low-resolution patches, substantially reducing data preprocessing and computational cost. Extensive experiments on multiple WSI benchmarks demonstrate that LRMIL consistently outperforms state-of-the-art MIL methods while achieving more efficient inference. These results highlight LRMIL as a practical and scalable solution for WSI analysis in clinical pathology.

75. 【2606.06856】FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness

链接：https://arxiv.org/abs/2606.06856

作者：Feiyu Ji,Xiaokang Yang,Xiaoyun Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offer exceptional temporal, exceptional temporal resolution, asynchronously reporting pixel-level, reporting pixel-level intensity, Dynamic vision sensors

备注：

点击查看摘要

Abstract:Dynamic vision sensors (DVS) offer exceptional temporal resolution and dynamic range by asynchronously reporting pixel-level intensity changes. However, conventional DVS rely on a per-pixel independent triggering mechanism, ignoring the spatial integration performed by biological retinal ganglion cells (RGCs). Consequently, they lack the contrast sensitivity function (CSF) and its inherent sensitivity to mid-spatial frequencies, which inevitably leads to information incompleteness due to sub-threshold signal loss. To bridge this gap, we propose FS-DVS (Frequency-Selective Dynamic Vision Sensor), a novel paradigm that integrates a learnable spatial filter strictly preceding the event triggering process to mimic the RGC aggregation mechanism. By developing a differentiable event simulation framework, the spatial filter can be optimized end-to-end with downstream tasks. Our study reveals that starting from a delta function, the learned spatial filters spontaneously evolve into center-surround patterns that emphasize mid-frequency components, consistently aligning with human CSF. Beyond achieving substantial performance gains in object detection and action recognition, the consistent convergence to human-like CSF characteristics across different tasks underscores the universality of this mid-frequency selective mechanism. Compared to naively increasing sensor sensitivity or relying on post-processing, our paradigm achieves selective information enhancement with high noise resilience, providing a robust, biologically plausible blueprint for next-generation neuromorphic sensors.

76. 【2606.06853】MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

链接：https://arxiv.org/abs/2606.06853

作者：Yifan Xu,Chao Zhang,Ruifei Ma,Fei Gao,Zhifei Yang,Jiaxing Qi,Zhipeng Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：extend Vision-Language Models, era has witnessed, witnessed a remarkable, extend Vision-Language, tackling tasks

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

77. 【2606.06850】CFRNet: Cycle-Consistent Fixed-Point Training for Real-Time Blind Face Restoration on Consumer Embedded NPUs

链接：https://arxiv.org/abs/2606.06850

作者：Fuchen Li,Xinyang Wang,Yahui Zhang,Yuhan Chen,Jiahong Guo,Zhuohan Qin,Wenbo Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Blind face restoration, balance image quality, Blind face, speed and memory, balance image

备注： 12 [this http URL](http://pages.Code) and project page will be released

点击查看摘要

Abstract:Blind face restoration on consumer devices has to balance image quality against speed and memory. Strong methods such as GFPGAN and CodeFormer give good perceptual quality, but they rely on large pretrained generative priors and on operators such as attention, codebook lookup, and style modulation that are hard to compile and quantize on the small neural processing units (NPUs) used in consumer hardware. Small convolutional restorers run fast enough, but they tend to over-smooth and to leave artifacts around the eyes, nose, and mouth. We present CFRNet, a 2.0,M-parameter ResNet-style restorer for on-device use at $256\times256$, the common face-crop size on consumer NPUs. The main idea is Cycle-Consistent Fixed-Point Training (CCFP). Instead of training the network for one pass and then running it several times by hand, we train it to act as a fixed-point operator, so that applying it again to a restored face does not change the face. CCFP uses three training losses, namely progressive multi-cycle supervision, an idempotence loss, and a re-degradation cycle loss, and it adds no cost at inference. To compare fairly under our deployment limits, we retrain all baselines from scratch at the same $256\times256$ resolution. On a 300-image test set, CFRNet reaches the best perceptual score (LPIPS 0.250 at three cycles, which is 31% lower than one cycle) and also the best PSNR and SSIM at two cycles. It runs in about 23,ms per cycle in INT8 on a HiSilicon Hi3402 NPU, while the same baselines cannot be compiled to that chip. The cycle count $k$ acts as a simple quality knob that needs no retraining: PSNR is best at $k\!=\!2$ and LPIPS keeps improving up to $k\!=\!3$. We further show that the same idea works with a plain CNN that is even easier to deploy, and we run the model in real time on an in-car driver-monitoring board.

78. 【2606.06836】hink Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

链接：https://arxiv.org/abs/2606.06836

作者：Xiangyi Zheng,Xiangyu Wang,Qinan Liao,Zimu Tang,Yue Liao,Dongyue Lyu,Guodong Wang,Junjie Liu,Si Liu

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Language-guided UAV agents, execute long-horizon semantic, textbf, long-horizon semantic instructions, Language-guided UAV

备注：

点击查看摘要

Abstract:Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

79. 【2606.06828】AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

链接：https://arxiv.org/abs/2606.06828

作者：Jiazi Bu,Pengyang Ling,Yujie Zhou,Yibin Wang,Yuhang Zang,Tianyi Wei,Xiaohang Zhan,Jiaqi Wang,Tong Wu,Xingang Pan,Dahua Lin

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Group Relative Policy, Relative Policy Optimization, Group Relative, demonstrated remarkable success, Policy Optimization

备注： Project Website: [this https URL](https://bujiazi.github.io/adagrpo.github.io/)

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner's current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy--a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model's proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.

80. 【2606.06819】VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

链接：https://arxiv.org/abs/2606.06819

作者：Ming Dai,Sen Yang,Boqiang Duan,Boyuan Tong,Jiedong Zhuang,Wankou Yang,Jingdong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reasoning Video Object, precise pixel-level localization, Video Object Segmentation, achieve precise pixel-level, Video Object

备注： ICML2026

点击查看摘要

Abstract:Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose \textbf{VideoSEG-O3}, the first multi-turn reinforcement learning framework for RVOS that emulates the human \textit{``coarse-to-fine''} cognitive process. It employs a \textit{multi-turn temporal-spatial chain-of-thought} to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt{[SEG]} during the RL stage, we introduce \textit{SEG-aware logit calibration}, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a \textit{decoupled thinking trace} to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct \textbf{VTS-CoT}, a specialized cold-start dataset featuring comprehensive reasoning trajectories. The code and models will be released at this https URL.

81. 【2606.06813】Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation

链接：https://arxiv.org/abs/2606.06813

作者：Dahee Kwon,Haeun Lee,Jaesik Choi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：flow-based objectives deliver, objectives deliver strong, deliver strong text-image, strong text-image alignment, produce overly similar

备注： Accepted to ICML 2026. Code is available at: [this https URL](https://github.com/daheekwon/DAVE)

点击查看摘要

Abstract:Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.

82. 【2606.06760】MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

链接：https://arxiv.org/abs/2606.06760

作者：Aofei Chang,Le Huang,Alex James Boyd,Parminder Bhatia,Taha Kass-Hout,Fenglong Ma,Cao Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently achieved remarkable, achieved remarkable progress, large vision-language models, Medical large vision-language, large vision-language

备注： Accepted at ICML 2026

点击查看摘要

Abstract:Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.

83. 【2606.06714】Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

链接：https://arxiv.org/abs/2606.06714

作者：Qian Zhang,Michal Golovanevsky,Fulvio Domini,James Tompkin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Human perception, texture exhibits systematic, psychophysical experiments, emerge reliably, reliably in psychophysical

备注：

点击查看摘要

Abstract:Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences? Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e.g., 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists. While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.

84. 【2606.06709】USU-Corn-WeedDB: A UAV RGB Image Dataset for Multi-Species Weed Detection in Forage Corn

链接：https://arxiv.org/abs/2606.06709

作者：Utsav Bhandari,Saroj Burlakoti,Rhonda Miller,Sierra Young,Eric Westra,Aaron Etienne

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：site-specific weed management, deep learning remain, learning remain constrained, forage corn production, field-representative training datasets

备注： 8 pages, 4 figures, 1 table

点击查看摘要

Abstract:Weed pressure in forage corn production causes yield losses of up to 31.5%, yet site-specific weed management (SSWM) systems built on UAV imagery and deep learning remain constrained by the scarcity of field-representative training datasets. We present USU-Corn-WeedDB, a publicly available UAV RGB image dataset collected from a commercial forage corn field in Cache Valley, Utah, designed to support multi-class weed detection under both supervised and semi-supervised learning frameworks. RGB imagery was acquired on 27 June 2025 using an Autel EVO II Dual 640T V2 drone at ~10m above ground level, yielding a ground sampling distance of approximately 0.48 cm/pixel. A total of 366 full-resolution images were tiled into 8,800 patches at 640 x 640-pixel resolution. Of these, 800 images were manually annotated for three weed species; common lambsquarters (Chenopodium album), redroot pigweed (Amaranthus retroflexus), and green foxtail (Setaria viridis) comprising 10,539 bounding-box instances, with the remaining 8,000 tiles retained as an unlabeled pool for semi-supervised experiments. This dataset reflects a natural class imbalance where redroot pigweed constitutes 53.86% of annotated instances, which was preserved intentionally to mirror real field conditions. To validate dataset utility, we trained 28 object detection models spanning five architecture families including YOLOv8, YOLOv9, YOLOv10, YOLO11, YOLO26, and RT-DETR under identical conditions without hyperparameter tuning. Test set mAP@0.5 ranged from 0.773 to 0.840, with lightweight models achieving competitive performance relevant to edge-deployed UAV systems. USU-Corn-WeedDB is publicly available at this https URL.

85. 【2606.06696】MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

链接：https://arxiv.org/abs/2606.06696

作者：Ryan D'Cunha,Alejandro Lozano,Xiaoxiao Sun,Daniel Vela Jarquin,Min Woo Sun,Josiah Aklilu,James Burgess,Yuhui Zhang,Ryan Nayebi,Paola Avila,Robayo,Jin Ye,Ming Hu,Zhongying Deng,Junjun He,Xin Chen,Yue Yao,Robert Tibshirani,Jeffrey J. Nirschl,Serena Yeung-Levy

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：hold immense promise, profiling cellular features, chest X-rays, X-rays to profiling, hold immense

备注：

点击查看摘要

Abstract:Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

86. 【2606.06695】S23DR 2026 Winning Solution

链接：https://arxiv.org/abs/2606.06695

作者：Jan Skvrna,Miroslav Purkrabek,Lukas Neumann

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：challenge for structured, fitted depth, wireframe reconstruction, sparse SfM, semantic segmentations

备注：

点击查看摘要

Abstract:This text presents the winning solution to the S23DR 2026 challenge for structured 3D wireframe reconstruction from sparse SfM, fitted depth, and semantic segmentations. The method treats vertices as a conditional set and denoises 64 vertex tokens with a flow-matching DiT conditioned on Perceiver-style scene tokens. A global pass predicts the coarse structure, a hull-cropped second pass refines it, and a small multi-sample consensus step keeps the stochastic sampler well behaved. The final system ranked first on the private leaderboard, achievingHSS = 0.654.

87. 【2606.06690】RPC-GS: Gaussian Splatting with native RPC Rendering for Satellite Imagery

链接：https://arxiv.org/abs/2606.06690

作者：Valentin Wagner,Sebastian Bullinger,Christoph Bodensteiner,Michael Arens

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Rational Polynomial Camera, Rational Polynomial, natively with Rational, RPC model, Gaussian Splatting

备注：

点击查看摘要

Abstract:We present RPC-GS, the first Gaussian Splatting framework for satellite imagery that operates natively with Rational Polynomial Camera (RPC) models. The RPC model is the de facto standard for representing the complex imaging geometry of modern pushbroom satellite sensors. To simplify rendering, prior satellite Gaussian Splatting methods replace the RPC model with perspective or affine camera approximations, leading to geometric errors during reconstruction. RPC-GS avoids these approximations by projecting Gaussian means and covariances directly through the RPC model during the splatting process. We embed the RPC model in a chain of carefully selected geo-coordinate transformations representing a mapping from splatting-suitable scene coordinates to image coordinates. To map the Gaussian covariance matrices, we derive a numerically robust Jacobian-based covariance projection for the (partially nonlinear) coordinate transformations. Since RPCs lack an explicit notion of camera depth, we integrate a metric ray-based depth formulation. We benchmark RPC, perspective, and affine camera models in a unified framework, with our native RPC renderer consistently achieving the lowest reconstruction error on leading satellite benchmark datasets, improving mean altitude error over perspective and affine approximations by 29.6% and 63.8% on DFC2019, and by 9.9% and 37.9% on IARPA2016. We release our code to support future research of Gaussian Splatting in the satellite imaging domain.

88. 【2606.06685】RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video

链接：https://arxiv.org/abs/2606.06685

作者：Shichong Peng,Yanshu Zhang,Ke Li

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：neural point reconstructions, point reconstructions capture, posed images, high fidelity, fidelity from posed

备注： An overview video is available at [this https URL](https://youtu.be/up3BwRHYWG8)

点击查看摘要

Abstract:Static neural point reconstructions capture a subject at high fidelity from posed images. Given such a reconstruction, we aim to animate it to follow a monocular fixed-viewpoint driving video of the subject, whether captured or produced by image-to-video (I2V) generation, and to recover a rigged, re-posable 3D asset. Existing methods deform Gaussian splats through direct linear blend skinning (LBS) or mesh proxies, both of which are prone to joint-boundary artifacts under articulation, even with per-primitive corrections. We trace the artifact to the representation: each splat carries an individual shape calibrated in the canonical pose to tile with its neighbours. Under rigid LBS, each splat moves with its bone but cannot bend, so the canonical tiling breaks at joint boundaries into gaps and spikes. Proximity attention point rendering (PAPR) instead carries no per-primitive shape; each pixel is recomposed at render time from the deformed primitives' positions, so the surface re-forms naturally with the articulation. We present RigPAPR, which auto-rigs a static PAPR cloud and drives it under direct LBS from a single fixed-viewpoint video, without mesh proxy, pose-dependent correction, or category template. On synthetic subjects, RigPAPR matches the strongest baseline at the supervised view and exceeds mesh-based and Gaussian-splatting baselines at novel views by 3+dB PSNR, with cleaner joint-boundary renderings of both synthetic and real subjects.

89. 【2606.06684】Adaptive Band Selection for Hyperspectral Classification with Spatially Disjoint Evaluation

链接：https://arxiv.org/abs/2606.06684

作者：Ikram El-Hajri(1),Ouassim Karrakchou(1),Alejandro Mousist(2) ((1) International University of Rabat, Rabat, Morocco, (2) Thales Alenia Space, Spain)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：final discrete subset, counts limit flexibility, Hyperspectral band selection, prescribed band counts, band counts limit

备注： 6 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Hyperspectral band selection methods based on differentiable selectors can be sensitive to initialization and to extracting a final discrete subset, while prescribed band counts limit flexibility. We propose SGBR-HC (Spectral-Group Band Ranking with Hard-Concrete initialization), a two-stage method that uses a supervised spectral ranking to initialize trainable sparse gates rather than treating ranking as a fixed selection rule, letting the number of selected bands be determined by training. Stage-1 scores candidate bands from training pixels by class discriminability and spectral diversity; this ranking seeds the gate logits for Stage-2, which trains the sparse gates jointly with a spatial classifier. Under spatially disjoint evaluation on Pavia University and Houston 2013, verified by retraining a fresh classifier on the selected bands, SGBR-HC achieves the highest mean overall accuracy and Cohen's kappa with approximately twenty bands. Bypassing Stage-1 degrades OA by 8.84 pp on Pavia University and 22.15 pp on Houston 2013, confirming the ranking prior's role. Random pixel splits inflate OA on Pavia University by 30.56 pp, underscoring spatial leakage as a critical evaluation confound.

90. 【2606.06671】JA-SIREN: Deterministic Initialization for Sinusoidal Networks via Spectral Matching

链接：https://arxiv.org/abs/2606.06671

作者：Mohammed Alsakabi,Kejia Hu,John M. Dolan,Ozan K. Tonguz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing implicit neural, implicit neural representation, Existing implicit, approaches suffer, performance across runs

备注：

点击查看摘要

Abstract:Existing implicit neural representation (INR) approaches suffer from stochastic initialization that does not guarantee consistent or high-quality performance across runs, with variations reaching more than 2.5 dB (78%) in image regression. This variation is problematic for scientific computing and simulation, where result reproducibility is crucial. To address this problem, we present Jacobi-Anger Sinusoidal Representation Network (JA-SIREN), a deterministic initialization scheme for sinusoidal networks grounded in classical spectral analysis. By computing the Discrete Sine Transform (DST) of the target signal and leveraging the Jacobi-Anger expansion, we derive closed-form weights for a two-layer sinusoidal MLP that analytically match the network's initial spectral response to the target signal, requiring no random seed or additional hyperparameter tuning. On the Kodak dataset, JA-SIREN achieves a mean PSNR of 67.18 dB, a 21.30 dB improvement over the best baseline. This is achieved with zero run-to-run variance, confirming that spectrally-informed initialization is a more effective and reproducible alternative to stochastic initialization for sinusoidal INRs.

91. 【2606.06666】Architecture-Adaptive Uncertainty Fusion for Deepfake Detection

链接：https://arxiv.org/abs/2606.06666

作者：Ritesh Sharma,Mohammad Ghasemigol,Yuichi Motai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Deepfake detection systems, Deepfake detection, demands reliable prediction, detection systems achieve, systems achieve near-perfect

备注：

点击查看摘要

Abstract:Deepfake detection systems achieve near-perfect accuracy on benchmarks, yet forensic deployment demands reliable prediction uncertainty. Existing uncertainty quantification (UQ) methods rely on single sources and ignore that optimal uncertainty composition varies across architectures. We propose Correlation-Optimized Fusion (COF), an architecture-adaptive framework that fuses five complementary uncertainty sources -- epistemic, aleatoric, calibration, conformal, and distributional -- by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex. COF requires no model modifications and only 42 s of weight optimization, compared to 20--45 h for a 5-model Deep Ensemble. Evaluation across eleven architectures on FaceForensics++ reveals a fundamental trade-off: under matched train/evaluation protocol, non-linear methods achieve approximately 5--6% higher in-domain correlation than COF (mean r = 0.438), but this reverses under distribution shift. On CelebDF, COF outperforms Random Forest in 9/11 architectures with up to 7.3x higher correlation (MaxViT-B: r = 0.249 vs. 0.034); RF degrades 85% cross-domain to r = 0.071, whereas COF retains substantially more signal (74% drop to r = 0.116). Cross-dataset evaluation on CelebDF and DFDC reveals catastrophic generalization failure across all methods: in-domain correlations of 0.41--0.47 collapse to near-zero externally (mean degradation 90.7%), with seven of eleven architectures exhibiting uncertainty inversion. These results establish COF as a practical, interpretable framework for controlled-distribution deployment and identify domain-adaptive UQ as the central open challenge for forensic deployment.

92. 【2606.06664】Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

链接：https://arxiv.org/abs/2606.06664

作者：Tang Li,Yanlin Chen,Mengmeng Ma,Xi Peng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Vision Transformer, spurious cues, safe deployment, driven by spurious, concept

备注： In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)

点击查看摘要

Abstract:Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: this https URL.

93. 【2606.06631】From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video

链接：https://arxiv.org/abs/2606.06631

作者：Jessy Lauer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：govern implant longevity, forces govern implant, contact forces govern, cartilage health, shaping who develops

备注：

点击查看摘要

Abstract:Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ($0.32 \pm 0.08$ BW RMSE for hip; $0.23 \pm 0.03$ BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.

94. 【2606.06627】What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

链接：https://arxiv.org/abs/2606.06627

作者：Richard Li,Aditya Prakash,Andrew Wen,Saurabh Gupta,Yilun Du,Pulkit Agrawal

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：policies largely consist, resemble robot behavior, manipulation policies largely, specialized hardware, Human video datasets

备注： The project website is here: [this https URL](https://richardrl.github.io/what-matters-cotraining-human-videos/index.html)

点击查看摘要

Abstract:Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

95. 【2606.06601】Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

链接：https://arxiv.org/abs/2606.06601

作者：Jingbo Gong,Yikai Wang,Yushi Lan,Yuhao Wan,Ziheng Ouyang,Rui Zhao,Ming-Ming Cheng,Qibin Hou,Chen Change Loy

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Object insertion aims, aims to seamlessly, seamlessly composite, Object, reference object

备注： ICML 2026; Project Page: [this https URL](https://gong1130.github.io/DIRECT/)

点击查看摘要

Abstract:Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

96. 【2606.06539】Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

链接：https://arxiv.org/abs/2606.06539

作者：Yucheng Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词：layer-local goodness updates, strictly layer-local goodness, replaces backpropagation, backpropagation with strictly, Recent FF-CNN work

备注： 23 pages, 6 figures

点击查看摘要

Abstract:Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

Comments:
23 pages, 6 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Cite as:
arXiv:2606.06539 [cs.CV]

(or
arXiv:2606.06539v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.06539

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Yucheng Chen [view email] [v1]
Thu, 4 Jun 2026 04:01:01 UTC (284 KB)

97. 【2606.06538】WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

链接：https://arxiv.org/abs/2606.06538

作者：Yida Yin,Harish Krishnakumar,Chung Peng Lee,Boya Zeng,Wenhao Chai,Shengbang Tong,Wenhu Chen,Hu Xu,Xingyu Fu,Gabriel Sarch,Aleksandra Korolova,Zhuang Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world applications, Multimodal Large Language, Large Language Models, visual, Large Language

备注： Project page: [this https URL](https://worldbench-vl.github.io/)

点击查看摘要

Abstract:In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

98. 【2606.06536】Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

链接：https://arxiv.org/abs/2606.06536

作者：Malak Allam,Khaled Shaban,Ali Hamdi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Unmanned Aerial Vehicle, Aerial Vehicle, large scale variation, Unmanned Aerial, remains challenging due

备注：

点击查看摘要

Abstract:Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck autoencoders within a Feature Pyramid Network-Path Aggregation Network (FPN-PAN) neck. This preserves anomaly-sensitive information during multi-scale feature fusion. Convolutional Block Attention Modules (CBAM) are used throughout the backbone, enhancing feature discrimination and suppressing background interference. The framework also introduces a variance-maximizing autoencoder regularization strategy, which encourages diverse, defect-discriminative latent representations. The network trains using a unified objective that combines focal loss, Complete IoU (CIoU) loss, and autoencoder regularization to address foreground-background imbalance and improve localization accuracy. During inference, Weighted Boxes Fusion (WBF) combines predictions from YOLOv8, YOLOv10, and YOLO11. An autoencoder-guided confidence boosting mechanism improves sensitivity to rare defect categories. Experiments on the Insulator-Defect Detection dataset show that AE-YOLO with an EfficientNetV2 backbone achieves 95.10 percent mAP at 0.5, 96.40 percent precision, and 93.80 percent recall. This performance surpasses the strongest YOLO-family baseline by 5.0 points in mAP at 0.5 and 6.7 points in recall. These results confirm the effectiveness and adaptability of the framework. The model is a practical and scalable solution for UAV-based transmission-line inspection and defect monitoring.

99. 【2606.06532】GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

链接：https://arxiv.org/abs/2606.06532

作者：Haozhe Chi,Yang Jin,Yadong Mu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：agentic long video, motion comprehension coupled, long video understanding, video understanding, Groups of Pictures

备注：

点击查看摘要

Abstract:Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model's ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

100. 【2606.06520】Applying Deep Learning for cockpit segmentation in the context of mixed reality

链接：https://arxiv.org/abs/2606.06520

作者：Alexandre Leles Sousa,Pedro de Oliveira Nielson,Erick Oliveira Rodrigues,Rafael Francisco dos Santos,Giovani Bernardes Vitor

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Computer vision, growing continuously, Computer, perform image segmentation, real images

备注： XXV Congresso Brasileiro de Automática - CBA 2024

点击查看摘要

Abstract:Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it's necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence this http URL convolutional neural network architectures "U-net" and "DeepLabV3+" are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.

101. 【2606.06505】A Geometric Gaussian Mixture Representation of Plane Curves

链接：https://arxiv.org/abs/2606.06505

作者：Ali Darijani,Benedikt Stratmann,Jürgen Beyerer

类目：Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)

关键词：defined probabilistic polygonal, user defined probabilistic, normal direction, probabilistic polygonal representation, user defined

备注：

点击查看摘要

Abstract:We introduce a user defined probabilistic polygonal representation for plane curves. Given a curve, we select vertices on the curve and connect consecutive vertices by line segments to obtain a polygonal approximation. Each segment is equipped with a user defined uncertainty parameter in the normal direction. This yields a collection of thin probabilistic geometric primitives that retain the geometrz of the underlying curve while extending it beyond the idealized deterministic one dimensional formulation. For each segment, we define a Random Variable that is uniform distributed in the tangent direction of the segment and Gaussian distributed in the normal direction of the segment. By matching the first and the second central moments, this construction induces a Gaussian component whose mean lies at the segment midpoint and whose covariance encodes both tangential and normal uncertainty. Combining the segment wise components with appropriate weights yields a Gaussian Mixture Model (GMM) representation of the user defined probabilistic polygonal representation of the plane curve. The proposed framework provides an analytically tractable probabilistic model that preserves local geometry, and uncertainty in the normal direction. It applies to smooth, closed, open, non regular, and self intersecting plane curves, allows adaptive discretization and varying uncertainty in the normal direction, and as a result supports uncertainty aware geometric modeling. Experiments on a collection of canonical plane curves show that the resulting GMM capture local tangent, local normal, and local arc length; resulting in the global shape of the underlying curves to be truthfully captured as well. The representation is particularly relevant for applications in uncertainty aware CAD and digital twins, probabilistic obstacle modeling in robotics, and probabilistic trajectory planning.

Subjects:

Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)

Cite as:
arXiv:2606.06505 [cs.CG]

(or
arXiv:2606.06505v1 [cs.CG] for this version)

https://doi.org/10.48550/arXiv.2606.06505

Focus to learn more

              arXiv-issued DOI via DataCite</p>

102. 【2606.06498】Semantic-Structural Alignment for Generative Pictorial Charts

链接：https://arxiv.org/abs/2606.06498

作者：Zhida Sun,Yulin Zhang,Zheng Gu,Min Lu,Bongshin Lee,Daniel Cohen-Or,Hui Huang

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Traditional statistical graphics, pictorial charts, graphics are precise, abstract statistical chart, statistical graphics

备注： 11 pages, 17 figures, Accepted to ACM TOG

点击查看摘要

Abstract:Traditional statistical graphics are precise but often lack the visual appeal, memorability, and engagement of pictorial charts. We present a generative framework for the automated synthesis of pictorial charts that bridges the gap between semantic expression and structural faithfulness. Rather than treating charts merely as images to be stylized, we frame the problem as a dual-conditioned generation task guided by two parallel external control signals: a text prompt capturing the semantic context of the editing intent, and a context image providing the abstract statistical chart's global structure. To reinforce these controls within a Multi-Modal Diffusion Transformer, we introduce two complementary feature-level mechanisms: structural alignment to anchor spatial layouts to the input chart, and semantic alignment to transfer expressive textures from reference images. Generalizing across major visual channels (i.e., length, area, angle, and position) and diverse semantic domains, our method produces pictorial charts that are both artistically compelling and structurally consistent. Extensive quantitative evaluations and perceptual user studies demonstrate that our framework outperforms traditional controllable generation and image editing baselines, providing a foundation for high-fidelity, data-driven generative modeling in expressive visual storytelling. Project page: this https URL.

103. 【2606.06497】Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

链接：https://arxiv.org/abs/2606.06497

作者：Adam Cole,Mick Grierson

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：remarkable visual fidelity, achieved remarkable visual, prompt-only interface offers, interface offers thin, offers thin creative

备注： In review. 5 pages, 4 figures

点击查看摘要

Abstract:Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model's default representational space.

104. 【2606.07381】Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

链接：https://arxiv.org/abs/2606.07381

作者：Prabhjot Kaur,Hakim Ouaalam,Sedat Kandemirli,Sanjay P. Prabhu,Simon K. Warfield

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Background and Purpose, focal cortical dysplasia, requires large volumes, automated FCD detection, voxelwise lesion-delineated MRI

备注：

点击查看摘要

Abstract:Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.

Subjects:

Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.07381 [eess.IV]

(or
arXiv:2606.07381v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2606.07381

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1111/jon.70137

Focus to learn more

            DOI(s) linking to related resources

Submission history From: Prabhjot Kaur [view email] [v1]
Fri, 5 Jun 2026 15:21:22 UTC (4,312 KB)

105. 【2606.07374】Beyond Backscatter: InSAR coherence from detected SAR images

链接：https://arxiv.org/abs/2606.07374

作者：Francescopaolo Sica,Andrea Pulella,Michael Schmitt

类目：ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)

关键词：detected SAR images, deep learning framework, SAR images, accurate coregistration, coherence regression directly

备注： 27 pages, 20 figures

点击查看摘要

Abstract:In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A Residual U-Net is trained using coherence maps derived from precisely coregistered Sentinel-1 SLC data to learn the relationship between backscatter magnitudes and coherence. The model is trained on 12-day SLC pairs and evaluated across different datasets, including coregistered SLC products and open access analysis-ready data, covering diverse radiometric properties, geometries, and locations. Experimental results demonstrate that the proposed method achieves high-resolution coherence regression with improved accuracy compared to existing intensity-based approaches. The network generalizes well across diverse geographical locations and even across different temporal baselines that were never seen at training time. Additionally, the ability to operate on globally available analysis-ready data, such as ground range detected data, e.g., distributed through Google Earth Engine, enables its large-scale application in mission design, change monitoring, and diverse mapping tasks.

106. 【2606.07063】Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition

链接：https://arxiv.org/abs/2606.07063

作者：Sonalika Singh,Jyotirindra Dandapat,Avishi Razdan,Kshipra V. Moghe,Puneet Gupta,Lalan Kumar

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Dynamic Facial Expression, key enabling technology, Facial Expression Recognition, Dynamic Facial, human-computer interaction

备注：

点击查看摘要

Abstract:Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.

107. 【2606.07016】An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections

链接：https://arxiv.org/abs/2606.07016

作者：Parvez Anowar

类目：Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)

关键词：Vulnerable road users, traffic deaths globally, urban traffic deaths, Vulnerable road, road users

备注： 17 pages, 5 figures, 2 tables. Preprint

点击查看摘要

Abstract:Vulnerable road users (VRUs) account for approximately half of urban traffic deaths globally, with intersections concentrating a disproportionate share of these casualties. Recent reviews of sensing technology for VRU protection have cataloged dozens of single-sensor and dual-sensor deployments, yet none of the surveyed systems couples multi-modal sensing with edge-side near-miss analytics and bidirectional vehicle-to-everything (V2X) and pedestrian-to-everything (P2X) messaging in a single intersection cabinet. This paper presents an integrated framework for VRU protection at signalized intersections, combining LiDAR, radar, RGB camera, and thermal camera at the perception layer, edge-based prediction and surrogate-safety analytics at the computation layer, V2X and P2X messaging at the communication layer, and adaptive signal control at the actuation layer. The framework is grounded in an empirical case study using R-LiViT, the first publicly released roadside LiDAR-Visual-Thermal dataset, which provides 200 multi-modal sequences and 2,400 annotated RGB-T frames at three German intersections. Analysis of 53,319 detection annotations reveals that VRUs comprise approximately 49% of all road-user observations, that day-to-night density drops by 38% for pedestrians and 45% for vehicles while the night distribution shows a higher close-proximity share, that per-frame close-proximity event counts vary approximately 10-fold across the eight unique locations at three intersections, and that 83% of pedestrian bounding boxes are small in image space, indicating that VRUs are typically far from any single sensor. These findings support multi-modal sensing, edge-side analytics, and adaptive context-sensitive deployment rather than uniform single-sensor solutions.

108. 【2606.06983】DaX: Learning General Pathology Representations Across Scales

链接：https://arxiv.org/abs/2606.06983

作者：Bokai Zhao,Yiyang Zhang,Long Bai,Tai Ma,Hanqing Chao,Minfeng Xu

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse clinical endpoints, scanner type, Computational pathology requires, transfer across diverse, endpoints and remain

备注：

点击查看摘要

Abstract:Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: this https URL.

109. 【2606.06847】Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images

链接：https://arxiv.org/abs/2606.06847

作者：Yifei Yin,Xiaogang Yu,Hao Shi,Liang Chen,Wei Li

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthetic aperture radar, all-weather observation capability, Synthetic aperture, SAR target interpretation, target interpretation owing

备注：

点击查看摘要

Abstract:Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.

110. 【2606.06725】Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws

链接：https://arxiv.org/abs/2606.06725

作者：Clara Rodrigo González,Matthieu Toulemonde,Lasha Gvinianidze,Cameron A. B. Smith,Oscar Bates,Roxy Senior,Fu Siong Ng,Meng-Xing Tang

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：bedside non-ionizing alternative, nuclear imaging modalities, offers a bedside, bedside non-ionizing, non-ionizing alternative

备注： 15 pages, 4 figures, 5 tables, journal

点击查看摘要

Abstract:Myocardial perfusion quantification using contrast-enhanced ultrasound offers a bedside non-ionizing alternative to nuclear imaging modalities. However, its clinical adoption is hindered by time-consuming manual labelling. Automated segmentation has proved challenging due to a paucity of in-domain training data. Adapting strategies currently used to optimise large language models for large datasets, we apply neural scaling laws to predict network performance for myocardial segmentation. We extrapolate performance on subsets of the data to determine optimal network size on the CAMUS echocardiography dataset and a 25-patient contrast-enhanced ultrasound (CEUS) dataset. Finally, we validate the clinical utility of our models by comparing the final myocardial perfusion parameters with those obtained by a senior cardiologist. Extrapolation based on the scaling law is predictive of test loss at the full dataset size, allowing us to select two networks that obtained state-of-the-art performance on CAMUS with a 240-fold reduction in parameter count. We observe the gradient of the scaling law transfers from CAMUS to the CEUS dataset with a bias in the predicted losses. The automatically segmented masks perform equivalently to a senior cardiologist in myocardial perfusion quantification. These results establish neural scaling laws as a practical tool for data-driven compute-optimal model design for small imaging datasets.

111. 【2606.06540】ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring

链接：https://arxiv.org/abs/2606.06540

作者：Tu Vo,Chan Y. Park

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep Unrolling Network, single-image defocus deblurring, Error-Aware Deep Unrolling, Augmented Lagrangian unrolling, Unrolling Network

备注：

点击查看摘要

Abstract:We introduce ErA (Error-Aware Deep Unrolling Network), an end-to-end frame work for single-image defocus deblurring. ErA jointly learns a compact kerne basis and per-pixel weights, while an error-aware term in Augmented Lagrangian unrolling corrects kernel estimation errors via alternating updates and ResUNet denoisers. It achieves state-of-the-art PSNR/SSIM on DPDD, RealDOF, and RTF, and shows strong generalization on CUHK without ground truth.

112. 【2606.06537】DSU-Net: An Attention-Enhanced Dense Skip U-Net for Breast Lesion Segmentation in Mammographic Images

链接：https://arxiv.org/abs/2606.06537

作者：Reza Bozorgpour,Mohammadreza Soltany Sadrabadi

类目：Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：making early detection, early detection essential, breast lesion segmentation, women worldwide, making early

备注：

点击查看摘要

Abstract:Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, making early detection essential for effective treatment. Mammography is the primary screening modality; however, accurate delineation of suspicious lesions remains challenging and subject to inter-observer variability. Automated segmentation methods can assist radiologists by providing consistent and efficient lesion localization. This study presents DSU-Net, an attention-enhanced Dense Skip U-Net architecture for automated breast lesion segmentation in mammographic images. The proposed framework integrates dense skip connections and attention mechanisms to improve feature propagation, preserve spatial information, and enhance lesion boundary delineation. Experiments were conducted using the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM). To address severe foreground-background imbalance, a composite loss function combining Dice loss, focal loss, and binary cross-entropy loss was employed during training. The proposed model achieved a Dice Similarity Coefficient of 0.9421, an Intersection over Union of 0.8905, an accuracy of 0.9711, and an AUC-ROC of 0.9878 on the validation dataset. Qualitative evaluation demonstrated accurate delineation of lesions with varying sizes and morphologies, while quantitative results confirmed robust discrimination between lesion and background regions. These findings demonstrate that DSU-Net provides accurate and reliable breast lesion segmentation in mammographic images and highlights the potential of attention-guided deep learning for computer-aided breast cancer screening and diagnosis.

Subjects:

Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Cite as:
arXiv:2606.06537 [q-bio.QM]

(or
arXiv:2606.06537v1 [q-bio.QM] for this version)

https://doi.org/10.48550/arXiv.2606.06537

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Reza Bozorgpour [view email] [v1]
Wed, 3 Jun 2026 23:09:03 UTC (891 KB)

113. 【2606.06524】Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery

链接：https://arxiv.org/abs/2606.06524

作者：Tewodros Syum Gebre,Jagrati Talreja,Leila Hashemi-Beni

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：limited ground observations, mapping remains challenging, remains challenging due, heterogeneous terrain conditions, scalable flood mapping

备注： This paper has been accepted for publication in the Proceedings of the IEEE Radar Conference (RadarConf 2026). The final authenticated version will be available through IEEE Xplore

点击查看摘要

Abstract:Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.