本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新622篇论文，其中：

自然语言处理85篇
信息检索10篇
计算机视觉112篇

自然语言处理

1. 【2606.18246】Variable-Width Transformers

作者：Zhaofeng Wu,Oliver Sieberling,Shawn Tan,Rameswar Panda,Yury Polyanskiy,Yoon Kim

类目：Computation and Language (cs.CL)

关键词：driven significant progress, driven significant, significant progress, progress in transformer-based, Scaling model size

备注：

点击查看摘要

Abstract:Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

2. 【2606.18237】ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

链接：https://arxiv.org/abs/2606.18237

作者：Shanda Li,Qiuhong Anna Wei,Jingwu Tang,Valerie Chen,Nihar B Shah,Tim Dettmers,Yiming Yang,Ameet Talwalkar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Reproducing research results, Reproducing research, scientific progress, central to scientific, Reproducing

备注：

点击查看摘要

Abstract:Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at this https URL.

3. 【2606.18222】Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

链接：https://arxiv.org/abs/2606.18222

作者：Joy Bose

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词：core Jain texts, spanning classical Hindu, Bhagavad Gita, openly licensed translations, text records spanning

备注： 12 pages, 1 figure. Open Source Code available at [this https URL](https://github.com/joyboseroy/darshana-graph) and dataset at [this https URL](https://huggingface.co/datasets/joyboseroy/darshana-graph)

点击查看摘要

Abstract:We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.

4. 【2606.18216】Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

链接：https://arxiv.org/abs/2606.18216

作者：Byung-Kwan Lee,Ximing Lu,Shizhe Diao,Minki Kang,Saurav Muralidharan,Karan Sapra,Andrew Tao,Pavlo Molchanov,Yejin Choi,Yu-Chiang Frank Wang,Ryo Hachiuma

类目：Computation and Language (cs.CL)

关键词：Knowledge distillation transfers, larger teacher concentrates, teacher sharpest modes, Knowledge distillation, small-student regime

备注： Project page: [this https URL](https://byungkwanlee.github.io/ZPPO-page/)

点击查看摘要

Abstract:Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

5. 【2606.18208】Looped World Models

链接：https://arxiv.org/abs/2606.18208

作者：Hongyuan Adam Lu,Z.L. Victor Wei,Qun Zhang,Jinrui Zeng,Bowen Cao,Lingwei Meng,Mocheng Li,Zezhong Wang,Haonan Yin,Naifu Xue,Minyu Chen,Cenyuan Zhang,Zefan Zhang,Hao Wei,Jiawei Zhou,Haoran Xu,Hao Yang,Ronglai Zuo,Tongda Xu,Yonghao Li,Jian Chen,Hebin Wang,Zeyu Gao,Yang Li,Wei Zhao,Qimin Zhong,Siqi Liu,Yumeng Zhang,Leyan Cui,Zhangyu Wang,Wai Lam

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Current world models, faithful long-horizon simulation, Current world, demands deep computation, world models face

备注： Technical Report

点击查看摘要

Abstract:Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

6. 【2606.18205】Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

链接：https://arxiv.org/abs/2606.18205

作者：Diaa Fayed,Laurent Romary

类目：Computation and Language (cs.CL)

关键词：Al-Mawrid Arabic-English dictionary, Text Encoding Initiative, Encoding Initiative TEI, Lexical Markup Framework, ISO Lexical Markup

备注： 44 pages, 58 figures, 12 tables. Submitted to Language Resources and Evaluation, under review since Aug 2025, round 3

点击查看摘要

Abstract:This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

7. 【2606.18203】RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

链接：https://arxiv.org/abs/2606.18203

作者：Weizhi Zhang,Zechen Li,Hamid Palangi,Ben Graef,A. Ali Heydari,Simon A. Lee,Salman Rahman,Ray Luo,Zeinab Esmaeilpour,Erik Schenck,Chloe Zhang,Yamin Li,Menglian Zhou,Philip S. Yu,Daniel McDuff,Lindsey Sunden,Mark Malhotra,Shwetak Patel,Ahmed A. Metwally

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：alleviate global disparities, LLM-empowered personal health, personal health agents, metrics have offered, health agents

备注：

点击查看摘要

Abstract:The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

8. 【2606.18195】Learning from the Self-future: On-policy Self-distillation for dLLMs

链接：https://arxiv.org/abs/2606.18195

作者：Yifu Luo,Zeyu Chen,Haoyu Wang,Xinhao Hu,Yuxuan Zhang,Zhizhou Sha,Shiwei Liu

类目：Computation and Language (cs.CL)

关键词：post-training large language, On-policy self-distillation, diffusion LLMs, large language models, remains unexplored

备注： Preprint

点击查看摘要

Abstract:On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at this https URL.

9. 【2606.18193】A Red-Team Study of Anthropic Fable 5 Opus 4.8 Models

链接：https://arxiv.org/abs/2606.18193

作者：Nicola Franco

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：developed by Anthropic, large language models, ten-category harm taxonomy, frontier large language, automated jailbreak attack

备注： White paper

点击查看摘要

Abstract:We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

10. 【2606.18158】he Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

链接：https://arxiv.org/abs/2606.18158

作者：Michèle Finck

类目：Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, legal-AI evaluations measure, current legal-AI evaluations, doctrinal legal reasoning, perform doctrinal legal

备注：

点击查看摘要

Abstract:Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.

11. 【2606.18142】Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

链接：https://arxiv.org/abs/2606.18142

作者：Jasmine Brazilek,Oliver Tulio,Joel Christoph,Miles Tidmarsh,Carol Kline,Arturs Kanepajs

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Travel Agent Compassion, planning menus, advisors to actors, moving from advisors, running procurement

备注：

点击查看摘要

Abstract:AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

12. 【2606.18124】Unintended Effects of Geographic Conditioning in Large Language Models

链接：https://arxiv.org/abs/2606.18124

作者：Naz Col,David M. Chan

类目：Computation and Language (cs.CL)

关键词：remain poorly understood, systems frequently rely, unintended regional biases, regional biases introduced, hidden context remain

备注： To appear at the Second Workshop on Customizable NLP (CustomNLP4U) at ACL 2026

点击查看摘要

Abstract:Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate location leakage: the phenomenon where a model generates geographic references despite receiving a geographically neutral user prompt. Across both creative writing and open-ended QA prompts, even state-of-the-art LLMs systematically favor region-specific outputs when exposed to location metadata, with leakage spiking by up to 793 times above baseline (e.g., from 0.04% to 31.7% for Llama 3.1-8B, and 21.3% and 8.8% for Qwen3-8B and Claude Sonnet 4.6, respectively). Our analysis further shows a novel structural conditioning effect: replacing the injected location with the placeholder "Unknown" still elevates leakage by up to 72 times above baseline, demonstrating that the user profile frame itself, independent of any geographic content, acts as a generative conditioning signal.

13. 【2606.18120】Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

链接：https://arxiv.org/abs/2606.18120

作者：Mohammadreza Rashidi

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Microsoft Semantic Kernel, Semantic Kernel, Large language model, Microsoft Semantic, default prompt-template format

备注： 7 pages, 6 figures

点击查看摘要

Abstract:Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace {x} expression HTML-escapes the interpolated value and is documented as the safe default; its triple-brace {x} expression inserts the value raw. We show that this choice silently governs an application's exposure to structural role injection, where attacker-controlled data carries chat role delimiters that forge a higher-privilege turn. A model-free analysis establishes the mechanism: Handlebars escaping rewrites angle brackets but not square brackets, colons, or Markdown hashes, so it neutralises ChatML, Llama-3, and XML role delimiters (survival rate 0.00) while leaving Llama-2 [INST], legacy Human:/Assistant:, and Markdown ### delimiters intact (survival rate 1.00 for the last two). We then run 5760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5) at a combined API cost of 1.63 USD. GPT-3.5 Turbo follows the task-hijack instruction in 97% of raw and 91% of escaped trials, with the escaping protection concentrated in the angle-bracket families and absent for the colon- and Markdown-based families; the harder secret-exfiltration objective, which does not saturate, exposes the same family interaction more cleanly. Claude Haiku 4.5 resists both objectives almost entirely. The escaped default protects only the delimiter schemes whose characters HTML escaping happens to cover, gives no protection for the rest, and cannot substitute for a structural separation of instruction and data.

14. 【2606.18103】HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

链接：https://arxiv.org/abs/2606.18103

作者：Noah J. Kim-Baumann,Torsten Hiltmann

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：default configurations remain, configurations remain oriented, dominant evaluation paradigms, grounding language model, language model outputs

备注： 25 pages, 6 figures. Companion preprint to a Journal of Digital History notebook article (under review)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual question-answering. For interpretive disciplines such as historical studies, RAG embeds assumptions that conflict with scholarly practice. We introduce HistoRAG, a framework that translates historiographical principles into concrete architectural interventions. Separated retrieval and generation decouples source discovery from interpretation, temporal windowing enforces balanced source representation across the research period as a methodological requirement of historical inquiry, and LLM-as-judge evaluation makes relevance judgments transparent and contestable. We evaluate these interventions using SPIEGELragged, applied to 102,189 articles from Der Spiegel (1950-1979). Each intervention addresses a measurable deficiency in standard RAG: era-specific vocabulary retrieves zero chunks from the 1950s when using 1970s terminology, evidence of the temporal skew that motivates windowing; vector similarity and LLM-assessed relevance correlate only weakly (Spearman rho = 0.275), motivating post-retrieval evaluation; and keyword-based and semantic retrieval surface largely disjoint source pools, motivating an architecture in which both operate as complementary retrieval layers under a shared LLM evaluation filter. We also introduce the concept of Zwischentexte (intermediate texts that function as interpretive proposals rather than findings) as a framework for responsible integration of LLM-generated text into scholarly practice. The architecture offers a model for how domain-specific epistemological commitments can be translated into RAG design decisions, and may transfer to other interpretive disciplines working with large corpora.

15. 【2606.18062】Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

链接：https://arxiv.org/abs/2606.18062

作者：Hobin Kim,Xiaoyuan Wu,Omer Akgul,Lujo Bauer,Nicolas Christin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)

关键词：Large language models, Large language, pose educational questions, fulfill users' information, pose educational

备注：

点击查看摘要

Abstract:Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (SP), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the SP questions users ask LLMs; prior research on LLM response quality relied on expert-authored SP misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 SP prompts and categorizes them into nine categories covering a wide range of SP topics. From the SP prompts, we sampled 450 and performed a thematic analysis to characterize the SP questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking SP prompts, where users ask for recommendations, guidance, or specific SP information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided "good enough" responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.

16. 【2606.18060】PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

链接：https://arxiv.org/abs/2606.18060

作者：Xinyang Liao,Lingyu Li,Huacan Liu,Tianle Gu,Yang Yao,Tong Zhu,Yan Teng,Yingchun Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Model, Language Model based, Model based agents, Model based, Large Language

备注： 26 pages, 21 figures

点击查看摘要

Abstract:As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that contaminate academic literature and erode trust in science. We present PseudoBench, an adversarial benchmark for evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives. PseudoBench contains 200 curated pseudoscientific claim-evidence pairs across five domains and evaluates agents through an end-to-end research pipeline from experiments to writing. Testing seven state-of-the-art agents, we find that current systems readily produce persuasive reports that align with pseudoscientific premises with near-zero refusal rates and the highest resistance of only 27.4%. Stronger agents risk packaging pseudoscience in more sophisticated scientific language, increasing its apparent credibility. These findings reveal an alarming capacity to fuel pseudoscience, calling for scientific alignment before widespread deployment.

17. 【2606.18057】When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

链接：https://arxiv.org/abs/2606.18057

作者：Drishti Goel,Violeta J. Rodriguez,Daniel S. Brown,Ravi Karkar,Dong Whi Yoo,Koustuv Saha

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)

关键词：lived experience, support, lived, experience, peer-like

备注：

点击查看摘要

Abstract:Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

18. 【2606.18056】ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

链接：https://arxiv.org/abs/2606.18056

作者：Yao Chen,Yinqi Yang,Junyuan Shang,Xiangzhao Hao,Simeng Zhang,Yilong Chen,Tingwen Liu,Shuohuan Wang,Dianhai Yu

类目：Computation and Language (cs.CL)

关键词：Hybrid architectures combining, architectures combining full, efficient LLM inference, combining full attention, Hybrid architectures

备注：

点击查看摘要

Abstract:Hybrid architectures combining full attention (FA) and sliding-window attention (SWA) are a promising paradigm for efficient LLM inference. However, existing methods typically rely on hand-crafted rules or simple post-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs. We propose Controllable Sparsity in Hybrid Attention (ConSA), a framework that learns optimal FA/SWA assignment under a user-specified sparsity target. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV-head granularity. We evaluate ConSA on two LLMs at the 0.6B and 1.7B scales. Learned allocations consistently outperform rule-based baselines, with KV-head-wise allocation yielding clear gains over layer-wise allocation. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle-layer blocks, diverging from evenly interleaved patterns in rule-based methods. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine-grained spectrum of intrinsic attention behaviors that underlies the learned allocation.

19. 【2606.18051】Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

链接：https://arxiv.org/abs/2606.18051

作者：Xueping Gao

类目：Computation and Language (cs.CL)

关键词：reusable tool specifications, agents increasingly rely, require composing multiple, LLM agents increasingly, composing multiple skills

备注：

点击查看摘要

Abstract:LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill library, decompose the query into atomic sub-tasks, retrieve the appropriate skill for each sub-task, and compose an executable plan. We present SkillWeaver, a decompose-retrieve-compose framework combining an LLM task decomposer, a bi-encoder skill retriever with FAISS indexing, and a dependency-aware DAG planner. To support evaluation, we introduce CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, sourced from the public MCP ecosystem. Our experiments reveal that task decomposition quality is the primary bottleneck: standard LLM decomposition reaches only 34.2% category recall at the step level. To address this, we propose Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop that iteratively aligns decomposition with available skills. SAD improves decomposition accuracy from 51.0% to 67.7% (+32.7%, Wilcoxon p 10^-6) in a single iteration; DA-conditioned analysis confirms that correct granularity is the prerequisite for effective retrieval (CatR@1 rises from 34% to 41% when DA=1). SkillWeaver reduces context window consumption by over 99%, and transfer experiments confirm generalization (+35.6% relative DA gain even when target categories are absent from the retrieval pool).

20. 【2606.18037】ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

链接：https://arxiv.org/abs/2606.18037

作者：Ander Alvarez,Santhiya Rajan,Samuel Mugel,Román Orús

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：Model Context Protocol, Tool-using LLM agents, Tool-using LLM, Context Protocol, Model Context

备注： 20 pages, 4 figures

点击查看摘要

Abstract:Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.

Comments:
20 pages, 4 figures

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

Cite as:
arXiv:2606.18037 [cs.AI]

(or
arXiv:2606.18037v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.18037

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

21. 【2606.18033】When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

链接：https://arxiv.org/abs/2606.18033

作者：Fred Philippy,Siwen Guo,Jacques Klein,Tegawendé F. Bissyandé

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：linguistic similarity largely, similarity largely determine, supervised fine-tuning contexts, multilingual NLP, determine transfer quality

备注： Accepted at 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), co-located with ACL 2026

点击查看摘要

Abstract:Cross-lingual transfer in multilingual NLP has been widely explored in supervised fine-tuning contexts, where factors like data availability and linguistic similarity largely determine transfer quality. As the field shifts toward few-shot In-Context Learning (ICL), it is often presumed that insights from fine-tuning carry over unchanged. Yet this assumption has not been rigorously evaluated, leaving open the question of how to choose source languages for cross-lingual ICL. We conduct a broad empirical study of cross-lingual transfer in ICL spanning seven tasks, six models, and a typologically diverse set of languages. We further analyze language confusion, a key obstacle for generative tasks in cross-lingual ICL. Our results show that conventional fine-tuning-based expectations do not consistently apply in the ICL regime and point to alternative heuristics for selecting source languages effectively.

22. 【2606.18021】LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

链接：https://arxiv.org/abs/2606.18021

作者：Lalit Yadav,Akshaj Gurugubelli

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：leaving compliance officers, Risk Direction Index, legal workflows hallucinate, aggregate metrics report, leaving compliance

备注： 15 pages, 5 figures; Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

点击查看摘要

Abstract:AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.

23. 【2606.17999】VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

链接：https://arxiv.org/abs/2606.17999

作者：Chunyu Liu,Zhengyang Fan,Kaisen Yang,Alex Lamb

类目：Computation and Language (cs.CL)

关键词：making response-length modeling, response-length modeling central, MDLMs generate text, texttt, EOS

备注：

点击查看摘要

Abstract:MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by $+17.84$ points over the original model and $+6.95$ points over RainbowPadding, while reducing decoding NFE by 55.7\% on average. Code is available at this https URL.

24. 【2606.17973】Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

链接：https://arxiv.org/abs/2606.17973

作者：Olivier Tieleman,Ziyi Zhu,Ting Su,Samuel J. Bell,Thomas D. Hull,Caitlin A. Stamatis

类目：Computation and Language (cs.CL)

关键词：disability worldwide, timely intervention, early detection, change is essential, essential for timely

备注： 12 pages, 1 figure

点击查看摘要

Abstract:Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 = 10 clinical threshold. We also find AUC 0.87 at every severity threshold from PHQ-9 = 3 to PHQ-9 = 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.

25. 【2606.17967】Learning task-specific subspaces via interventional post-training of speech foundation models

链接：https://arxiv.org/abs/2606.17967

作者：Jack Cox,Jon Barker

类目：Computation and Language (cs.CL)

关键词：unlabelled speech data, produce general-purpose representations, pre-trained on large, produce general-purpose, large corpora

备注： Accepted to Interspeech 2026; 6 pages (4 main body), 2 figures

点击查看摘要

Abstract:Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.

26. 【2606.17910】Non-negative Elastic Net Decoding for Information Retrieval

链接：https://arxiv.org/abs/2606.17910

作者：Koki Okajima,Yasutoshi Ida,Tsukasa Yoshida,Yasuaki Nakamura

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：NNN decoding, Dense retrieval, retrieval, NNN, Dense

备注： 19 pages, 4 figures

点击查看摘要

Abstract:Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top-$k$ documents by score are retrieved for this query. However, since each document's score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring.

Comments:
19 pages, 4 figures

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2606.17910 [cs.IR]

(or
arXiv:2606.17910v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.17910

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

27. 【2606.17905】ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

链接：https://arxiv.org/abs/2606.17905

作者：Peixian Zhou,Yuxu Chen,Chaorui Zhang,Wei Han,Bo Bai,Xueyan Niu

类目：Computation and Language (cs.CL)

关键词：Large language models, ability remains robust, Large language, General aligned set, Difficult aligned set

备注：

点击查看摘要

Abstract:Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.

28. 【2606.17890】Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

链接：https://arxiv.org/abs/2606.17890

作者：Zihao Wei,Wenjie Shi,Liang Pang,Jingcheng Deng,Shicheng Xu,Shasha Guo,Zenghao Duan,Jiahao Liu,Jingang Wang,Huawei Shen,Xueqi Cheng

类目：Computation and Language (cs.CL)

关键词：improve LLM performance, improve LLM, LLM performance, performance on complex, generating unnecessary reasoning

备注： 21 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.

29. 【2606.17861】GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

链接：https://arxiv.org/abs/2606.17861

作者：Tongxu Luo,Rongsheng Wang,Jiaxi Bi,Chenming Xu,Zhengyang Tang,Jianlong Chen,Juhao Liang,Ke Ji,Shuqi Guo,Yuhao Du,Fan Bu,Wenyu Du,Xiaotong Zhang,Kyle Li,Shaobo Wang,Linfeng Zhang,Yuxuan Liu,Xin Lai,Chenxin Li,Yiduo Guo,Zhexin Zhang,Xinyuan Wang,Tianyi Bai,Ziniu Li,Benyou Wang

类目：Computation and Language (cs.CL)

关键词：playable interactive systems, transform natural-language specifications, Game generation, requiring models, emerging application

备注：

点击查看摘要

Abstract:Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See this https URL for demos, code, and data.

30. 【2606.17838】Environment-Grounded Automated Prompt Optimization for LLM Game Agents

链接：https://arxiv.org/abs/2606.17838

作者：Rean Clive Fernandes,Lukas Fehring,Theresa Eimer,Marius Lindauer,Matthias Feurer

类目：Computation and Language (cs.CL)

关键词：prompt engineering remains, task-specific process, remains a manual, highly sensitive, engineering remains

备注：

点击查看摘要

Abstract:LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module's prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG's RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.

31. 【2606.17835】Perceptual compensation for tonal context in self-supervised speech models

链接：https://arxiv.org/abs/2606.17835

作者：James Kirby,Ioana Krehan,Michele Gubian

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词：architecture exhibits evidence, Mandarin Chinese tones, architecture exhibits, study examines, examines the extent

备注： Accepted for publication at Interspeech 2026

点击查看摘要

Abstract:This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

32. 【2606.17826】When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

链接：https://arxiv.org/abs/2606.17826

作者：Jean Seo,Minkyu Kim,Jeonguk Lee,Jisoo Jung,Wooseok Han,Eunho Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Automatic speech recognition, Automatic speech, valid orthographic forms, multiple valid orthographic, non-English clinical settings

备注： Interspeech 2026

点击查看摘要

Abstract:Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: this https URL.

33. 【2606.17820】Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

链接：https://arxiv.org/abs/2606.17820

作者：Reihaneh Amooie,Yun Hao,Wietse de Vries,Jelske Dijkstra,Matt Coler,Martijn Wieling

类目：Computation and Language (cs.CL)

关键词：automatic speech recognition, affects automatic speech, fine-tuning affects automatic, language identification, language identification token

备注：

点击查看摘要

Abstract:This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of language families and writing systems. To distinguish the two languages, during training, we pre-pend each input text with a language identification token. At inference, the model jointly predicts both the language and transcription from the speech input alone. As texts for which the language is incorrectly determined show low ASR performance, we also conduct a follow-up experiment in which the language identification token is provided both during training and inference. Our results show that bilingual fine-tuning can be beneficial when language identification accuracy is high, and that in cases where language identification performance is low, including the language identification token at inference helps to improve ASR performance.

34. 【2606.17819】A Framework for Evaluating Agentic Skills at Scale

链接：https://arxiv.org/abs/2606.17819

作者：Maksim Shaposhnikov,Nicolas Fortuin,Simon Stipcich,Maria I. Gorinova,Amy Heineike,Rob Willoughby

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：reusable knowledge artifacts, reusable methodology exists, models remain under-studied, LLM agent capabilities, augment LLM agent

备注：

点击查看摘要

Abstract:Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

35. 【2606.17815】Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors

链接：https://arxiv.org/abs/2606.17815

作者：Kunlan Xiang,Haomiao Yang,Wenbo Jiang

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Contrastive Language-Image Pre-training, Language-Image Pre-training models, including feature extraction, Contrastive Language-Image, Language-Image Pre-training

备注：

点击查看摘要

Abstract:Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task, leaving unclear whether the same poisoned checkpoint remains exposed, weakens, or becomes not applicable when reused through other interfaces. We introduce DIFE, a Deployment-Interface Footprint Evaluation framework that audits backdoored CLIP checkpoints across deployment interfaces. DIFE makes various evaluations comparable by specifying each interface's component readout, trigger channel, target event, reference condition, and metric. DIFE also introduces effective-footprint diagnosis to identify the reusable CLIP component or component combination that carries exposure and explains where risk transfers. Auditing reproduced CLIP backdoors with DIFE reveals a structured landscape: native success is not a checkpoint-level risk certificate, exposure follows component footprints, text-side poisoning does not yield textual-encoder control, and some coupled attacks remain mechanism-bound. This audit reveals a import gapin existing CLIP backdoors: a textual encoder that itself becomes a reusable carrier of adversarial behavior. We therefore introduce BadTextTower to fill this gap. BadTextTower produces strong text-conditioned retrieval, reranking, and selection exposure while leaving visual-only reuse nearly clean.

36. 【2606.17799】Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

链接：https://arxiv.org/abs/2606.17799

作者：Maria I. Gorinova,Macey Baker,Amy Heineike,Maksim Shaposhnikov,Rob Willoughby,Dru Knox

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：software engineering, agentic software engineering, pre-agent era, typically computed, major mode

备注：

点击查看摘要

Abstract:Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

37. 【2606.17791】he Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

链接：https://arxiv.org/abs/2606.17791

作者：Samar Ansari

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：tools increasingly summarize, reformat radiology reports, large language models, documentation tools increasingly, increasingly summarize

备注：

点击查看摘要

Abstract:AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.

38. 【2606.17786】oward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars

链接：https://arxiv.org/abs/2606.17786

作者：Pascal Riachi,Sofie Kamber,Stella Brogna,Andrew Gloster,Rafael Wampfler

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词：Acceptance and Commitment, requires repeated practice, Commitment Therapy, requires repeated, opportunities for safe

备注：

点击查看摘要

Abstract:Training psychotherapists in evidence-based interventions such as Acceptance and Commitment Therapy (ACT) requires repeated practice with meaningful feedback, yet opportunities for safe, standardized training are limited by ethical, logistical, and resource constraints. We introduce a system designed to support ACT-oriented psychotherapy training through spoken dialogue with an embodied virtual patient. The system uses large language models to simulate patient behavior conditioned on profiles derived from real therapy sessions and configurable clinical scenarios, while a separate automated evaluator provides turn-by-turn feedback on therapist responses based on established ACT fidelity criteria. Rather than aiming to replace supervision, the system is intended to support deliberate practice by enabling experimentation, reflection, and immediate feedback in low-risk settings. Expert evaluation with practicing psychologists confirmed high realism in patient behavior and demonstrated that immediate turn-by-turn ACT feedback increased therapists' awareness of intervention choices and enabled effective experimentation with alternative responses. Quantitative evaluation across 49 therapy transcripts identified GPT-4o-mini as the optimal feedback model, achieving the lowest mean absolute error (MAE = 6.12) in replicating human supervisor ACT fidelity ratings with statistically significant agreement. This work demonstrates the potential of fidelity-aware simulated patients as a scalable complement to psychotherapy training.

39. 【2606.17710】Vision-language models for chest radiography do not always need the image

链接：https://arxiv.org/abs/2606.17710

作者：Mahshad Lotfinia,Sebastian Ziegelmayer,Lisa Adams,Daniel Truhn,Andreas Maier,Soroosh Tayebi Arasteh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Medical vision-language models, report strong chest, strong chest radiograph, Medical vision-language, vision-language models report

备注：

点击查看摘要

Abstract:Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

40. 【2606.17698】EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

链接：https://arxiv.org/abs/2606.17698

作者：Zeyao Du,Tong Li,Haibo Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：existing benchmarks fail, agents enter production, shopper requirements arrive, enter production, stated implicitly

备注：

点击查看摘要

Abstract:As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

41. 【2606.17688】LLMs Infer Cultural Context but Fail to Apply It When Responding

链接：https://arxiv.org/abs/2606.17688

作者：Yisong Miao,Jian Zhu,Vered Shwartz

类目：Computation and Language (cs.CL)

关键词：Recent work, overrepresent dominant cultures, LLMs overrepresent dominant, Pragmatic Response Inference, work has shown

备注： 9 pages, 7 figures, 2 tables (24 pages, 12 figures, 8 tables including references and appendices)

点击查看摘要

Abstract:Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model's country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.

42. 【2606.17687】SuCo: Sufficiency-guided Continuous Adaptive Reasoning

链接：https://arxiv.org/abs/2606.17687

作者：Jiahao Wang,Bingyu Liang,Chenhao Hu,Longhui Zhang,Xuebo Liu,Min zhang,Jing Li,Xuelong Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：inflating computational costs, generate excessively long, Large Reasoning Models, Large Reasoning, complex tasks

备注： Accepted to ICML 2026. 18 pages

点击查看摘要

Abstract:Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

43. 【2606.17683】Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation

链接：https://arxiv.org/abs/2606.17683

作者：Longhui Zhang,Jiahao Wang,Chenhao Hu,Bingyu Liang,Jing Li,Min Zhang

类目：Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词：large language models, language models, comparatively little attention, code translation systems, large language

备注： Accepted to ICML 2026

点击查看摘要

Abstract:While large language models (LLMs) have greatly advanced the functional correctness of automated code translation systems, the runtime efficiency of translated programs has received comparatively little attention. With the waning of Moore's law, runtime efficiency has become increasingly important for program quality, alongside functional correctness. Our preliminary study reveals that LLM-translated programs often run slower than human-written ones, and this issue cannot be remedied through prompt engineering alone. Therefore, our work proposes SwiftTrans, a code translation framework comprising two key stages: (1) Multi-Perspective Exploration, where MpTranslator leverages parallel in-context learning (ICL) to generate diverse translation candidates; and (2) Difference-Aware Selection, where DiffSelector identifies the optimal candidate by explicitly comparing differences between translations. We further introduce Hierarchical Guidance for MpTranslator and Ordinal Guidance for DiffSelector, enabling LLMs to better adapt to these two core components. To support the evaluation of runtime efficiency in translated programs, we extend existing benchmarks, CodeNet and F2SBench, and introduce a new benchmark, SwiftBench. Experimental results across all three benchmarks show that SwiftTrans achieves consistent improvements in both correctness and runtime efficiency.

44. 【2606.17682】From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

链接：https://arxiv.org/abs/2606.17682

作者：Chao Chen,Chengzu Li,Zhiwei Li,Yinhong Liu,Zhijiang Guo

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, Reinforcement learning pipelines, pipelines for Large, manually redesigned environments

备注：

点击查看摘要

Abstract:Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

45. 【2606.17680】EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

链接：https://arxiv.org/abs/2606.17680

作者：Zhitong Wang,Songze Li,Hao Peng,Shuzheng Si,Yi Wang,Maosong Sun,Juanzi Li

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：training Large Language, Large Language Models, Large Language, training Large, Language Models

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.

46. 【2606.17650】MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

链接：https://arxiv.org/abs/2606.17650

作者：Hao-Yuan Ma,Li Zhang,Minjie Qiang,Jie Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Text-guided Open-vocabulary Object, Open-vocabulary Object Counting, Text-guided Open-vocabulary, large scale variations, Object Counting

备注：

点击查看摘要

Abstract:Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.

47. 【2606.17645】Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

链接：https://arxiv.org/abs/2606.17645

作者：Shiqi He,Yue Cui,Feijie Wu,Xinyu Ma,Jiaheng Lu,Yaliang Li,Bolin Ding,Mosharaf Chowdhury

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large language model, structured tool action, Large language, model reads, policy-facing LLM completions

备注：

点击查看摘要

Abstract:Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2606.17645 [cs.AI]

(or
arXiv:2606.17645v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.17645

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

48. 【2606.17634】Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

链接：https://arxiv.org/abs/2606.17634

作者：Dong Huang,Jianbo Sun,Pengkun Yang

类目：Computation and Language (cs.CL)

关键词：Evaluating large language, comparing competing systems, large language models, Evaluating large, language models

备注： 42 pages, 8 figures

点击查看摘要

Abstract:Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

49. 【2606.17628】OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

链接：https://arxiv.org/abs/2606.17628

作者：Guibin Zhang,Xun Xu,Yanwei Yue,Zikun Su,Wangchunshu Zhou,Xiaobin Hu,Shuicheng Yan

类目：Computation and Language (cs.CL)

关键词：standard substrate, substrate for self-evolving, Memory, self-evolving agents, retaining experience

备注：

点击查看摘要

Abstract:Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

50. 【2606.17609】he Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

链接：https://arxiv.org/abs/2606.17609

作者：Rui Wen,Lu Sun,Jiayang Liu,Zesheng Xu,Tianshuo Cong,Zheng Li

类目：Computation and Language (cs.CL)

关键词：Compressing large language, Compressing large, standard benchmarks miss, models reduces memory, inference cost

备注：

点击查看摘要

Abstract:Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output? We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. Under high-sparsity pruning, especially Wanda, models often fail in greedy open generation while still selecting the correct answer under multiple-choice scoring. In these recognition-only errors, the answer is usually not gone, but demoted: it often reappears with beam search, sampling, or one in-context example. Overall, multiple-choice benchmarks can overstate the usability of compressed LLMs, creating an evaluation blind spot. Compressed models should be tested on what they can produce, not only on what they can recognize.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.17609 [cs.CL]

(or
arXiv:2606.17609v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.17609

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

51. 【2606.17579】LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

链接：https://arxiv.org/abs/2606.17579

作者：Zhongyuan Wang,Pratyusha Vemuri

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词：Adding LLM-generated node, graph neural networks, Adding LLM-generated, LLM-generated node features, neural networks

备注： 29 pages, 8 figures

点击查看摘要

Abstract:Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig = tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

52. 【2606.17542】Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

链接：https://arxiv.org/abs/2606.17542

作者：Ryo Fukuda,Takatomo Kano,Siddhant Arora,Marc Delcroix,Naohiro Tawara,Atsunori Ogawa,Yuya Chiba,Atsushi Ando,William Chen,Shinji Watanabe

类目：Computation and Language (cs.CL)

关键词：multimodal multi-party conversations, large language models, investigate turn-taking, multi-party conversations, conversations using large

备注： Accepted to INTERSPEECH 2026

点击查看摘要

Abstract:We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

53. 【2606.17522】An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

链接：https://arxiv.org/abs/2606.17522

作者：Vinoth Nandakumar,Qiang Qu,Pramod Thebe,Sakshi Khachariya,Tongliang Liu

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Deep neural networks, ability to form, neural networks, networks are widely, widely believed

备注：

点击查看摘要

Abstract:Deep neural networks are widely believed to derive their expressive power from their ability to form \textbf{hierarchical representations}, capturing progressively more abstract and compositional features across layers. In language modeling, \textbf{transformers} have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbf{how} deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.

54. 【2606.17519】Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

链接：https://arxiv.org/abs/2606.17519

作者：Kellen Gillespie,Robyn Perry

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：LLM assistants route, Production LLM assistants, assistants route user, routing accuracy degrade, route user requests

备注： 10 pages (6 main + 4 appendix), 4 figures, 6 tables

点击查看摘要

Abstract:Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.

55. 【2606.17506】Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

链接：https://arxiv.org/abs/2606.17506

作者：Ramaravind Kommiya Mothilal,Terry Jingchen Zhang,Raiyan Ahmed,Zhijing Jin,Shion Guha,Syed Ishtiaque Ahmed

类目：Computation and Language (cs.CL)

关键词：LLMs largely focus, imply biased content, bias, biased content, largely focus

备注： 20 pages, 13 tables, 2 figures

点击查看摘要

Abstract:Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at this https URL.

56. 【2606.17478】Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

链接：https://arxiv.org/abs/2606.17478

作者：Kexin Chen,Yi Liu,Haonan Zhang,Yanhui Li,Xinyu Deng,Dongxia Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：stronger reasoning capabilities, LLMs acquire stronger, acquire stronger reasoning, safety concern, acquire stronger

备注： Under review

点击查看摘要

Abstract:As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

57. 【2606.17474】AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

链接：https://arxiv.org/abs/2606.17474

作者：Jiahui Niu,Huizi Yu,Wenkong Wang,Guangxin Dai,Jingxian He,Xiang Li,Zhiying Liang,Xinxin Lin,Kent CY So,Bryan YP Yan,Yun Kwok Wing,Yanqiu Xing,Xin Ma,Lizhou Fan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, evaluations remain static, remain static, narrowly outcome-based

备注： 49 pages, 12 figues, 11 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

58. 【2606.17467】PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

链接：https://arxiv.org/abs/2606.17467

作者：Aaditya Pai

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：interleave legitimate authority, legitimate authority language, Prompt injection defenses, Federal Register rules, Prompt injection

备注： 7 pages, 3 figures, 2 tables. Under submission at EMNLP 2026 Industry Track

点击查看摘要

Abstract:Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with a real-document benchmark of 122 tasks across five professional domains (financial, legal, medical, scientific, DevOps) using actual SEC filings, Federal Register rules, PubMed abstracts, arXiv papers, and GitHub postmortems. Paraphrasing, the strongest defense on synthetic benchmarks, shows no statistically significant attack success rate reduction on real documents (p=0.500) while degrading utility from 91.8% to 82.8%. We introduce PARSE (Provenance-Aware Retrieval Sanitization), a domain-aware, fact-preserving sanitization pipeline that classifies each sentence by injection likelihood, extracts structured facts before rewriting, and verifies fact preservation via a consistency-checking loop. A directiveness gate routes 59% of real enterprise documents to a lightweight path, concentrating computational cost on high-risk documents. PARSE achieves 15.6% attack success rate -- a 38% reduction versus the 25.4% baseline -- at 86.9% utility, the only condition that is both statistically significant (p=0.014, adequately powered) and maintains near-baseline utility. Practitioners should evaluate defenses on domain-matched real documents, not synthetic proxies.

59. 【2606.17449】MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

链接：https://arxiv.org/abs/2606.17449

作者：Zehang Wei,Jiaxin Dai,Jiamin Yan,Xiang Xiang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：Large Vision-Language Models, enhances Large Vision-Language, Multimodal Retrieval-Augmented Generation, remains highly susceptible, enhances Large

备注： To be presented at ACL 2026

点击查看摘要

Abstract:While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

60. 【2606.17443】Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

链接：https://arxiv.org/abs/2606.17443

作者：Xi Chu,Yupeng Hou

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large language models, Large language, Claude Sonnet, brands compete, multi-brand GEO competition

备注： 16 pages, 4 figures, 11 tables

点击查看摘要

Abstract:Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

61. 【2606.17391】NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

链接：https://arxiv.org/abs/2606.17391

作者：Logan Mann,Abdur Rahman,Mohammad Saifullah,Taaha Kazi,Vasu Sharma

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Long-form serialized audio, serialized audio drama, major creative medium, Long-form serialized, frontier large language

备注： 10 pages. Accepted to the ICML 2026 Workshops on High-dimensional Learning Dynamics (HiLD) and Culture x AI

点击查看摘要

Abstract:Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 = 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

62. 【2606.17389】Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

链接：https://arxiv.org/abs/2606.17389

作者：Logan Mann,Yi Xia,Ajit Saravanan,Ishan Dave,Saadullah Ismail,Shikhar Shiromani,Emily Huang,Ruizhe Li,Kevin Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Multimodal Foundation Models, Multimodal Foundation, Foundation Models, Multimodal, reliability

备注： 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: [this https URL](https://github.com/itsloganmann/VLM-Reliability-Probe)

点击查看摘要

Abstract:Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

63. 【2606.17372】Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

链接：https://arxiv.org/abs/2606.17372

作者：Peter Zeng,Amie J. Paige,Weiling Li,Susan E. Brennan,Owen Rambow,Cameron R. Jones

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Jones, Zeng, efficient referring expressions, recent studies, Abstract

备注：

点击查看摘要

Abstract:Two recent studies (Jones et al. (2026); Zeng et al. (2026)) reach apparently contradictory conclusions about whether LVLMs can coordinate on efficient referring expressions. We control for task differences between the studies while directly comparing their prompting styles. We replicate the finding that models can coordinate efficient referring expressions when explicitly prompted to do so, suggesting that other task differences are not responsible for divergent results. However, we also find that the same models fail to infer the need for communicative efficiency from a more implicit prompt, highlighting critical differences between how humans and AI systems communicate.

64. 【2606.17354】ranslating the Untranslatable: An Operationalizable Ontology for Untranslatability

链接：https://arxiv.org/abs/2606.17354

作者：Jacob Bremerman,Brihi Joshi,Hirona Arai,Xiang Ren,Jonathan May

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：underexplored in NLP, preserved across languages, directly preserved, well-studied in linguistics, linguistics but underexplored

备注：

点击查看摘要

Abstract:Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.

65. 【2606.17350】Do Large Language Models Always Tell The Same Stories?

链接：https://arxiv.org/abs/2606.17350

作者：Thennal DK,Hans Ole Hatzel

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：outputs remains contested, generating diverse outputs, diverse outputs remains, Recent advances, large language models

备注：

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.

66. 【2606.17339】SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

链接：https://arxiv.org/abs/2606.17339

作者：Sejal Bhalla,Larry Kieu,Aina Merchant,Eyal de Lara,Alex Mariakakis

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词：simultaneously engaging neurological, uniquely informative window, engaging neurological, vocal systems, offers a uniquely

备注：

点击查看摘要

Abstract:Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

67. 【2606.17299】Examining the Limits of Word2Vec with Toki Pona

链接：https://arxiv.org/abs/2606.17299

作者：Daniel Zhenhan Huang,Hongchen Wu

类目：Computation and Language (cs.CL)

关键词：large vocabulary inventories, Toki Pona, widely validated, tested almost exclusively, Toki Pona community

备注： 10 pages, 4 figures, 3 tables. Accepted to the Society for Computation in Linguistics (SCiL) 2026

点击查看摘要

Abstract:Word2Vec's effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance -- a topic rarely addressed in word embedding literature -- we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec's effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.

68. 【2606.17289】Nothing from Something: Can a Language Model Discover 0?

链接：https://arxiv.org/abs/2606.17289

作者：Phoebe Zeng,Thomas L. Griffiths,Brenden M. Lake

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：artificial neural networks, based on artificial, developed with aspirations, aspirations of pushing, pushing the boundary

备注：

点击查看摘要

Abstract:AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

69. 【2606.17281】Are you speaking my languages? On spoken language adherence in multimodal LLMs

链接：https://arxiv.org/abs/2606.17281

作者：Hyungwon Kim,Kandarp Joshi,Lillian Zhou,Pavel Golik,Petar Aleksic

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：Automatic Speech Recognition, based Automatic Speech, Large Language Model, Speech Recognition, enables seamless multilingual

备注： 7 pages, 3 tables in the main body

点击查看摘要

Abstract:While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

70. 【2606.17255】MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

链接：https://arxiv.org/abs/2606.17255

作者：Jorge Iranzo-Sánchez,Gerard Mas-Mollà,Adrià Giménez,Jorge Civera,Albert Sanchis,Alfons Juan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Simultaneous Speech Translation, MLLP-VRAIN research group, Speech Translation track, Simultaneous Speech, Speech Translation

备注： IWSLT 2026 System Description

点击查看摘要

Abstract:This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

71. 【2606.17250】Rethinking Groups in Critic-Free RLVR

链接：https://arxiv.org/abs/2606.17250

作者：Yihong Wu,Liheng Ma,Lingfeng Xiao,Muzhi Li,Xinyu Wang,Yingxue Zhang,Jian-Yun Nie

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：large language models, post-training large language, Reinforcement learning, language models, central paradigm

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.

72. 【2606.17234】Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

链接：https://arxiv.org/abs/2606.17234

作者：Ali Marashian,Alexis Palmer,Katharina von der Wense

类目：Computation and Language (cs.CL)

关键词：large language models, rapid rise, rise in popularity, popularity of large, large language

备注：

点击查看摘要

Abstract:The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.

73. 【2606.17229】Rift: A Conflict Signature for Deception in Language Models

链接：https://arxiv.org/abs/2606.17229

作者：Petr Nyoma

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：central case ELK, case ELK, ELK cannot handle, AUC, central case

备注： 13 pages, 4 figures. Code and experiment logs: [this https URL](https://github.com/Omibranch/Rift)

点击查看摘要

Abstract:A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

74. 【2606.17213】Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

链接：https://arxiv.org/abs/2606.17213

作者：Vanshali Sharma,Andrea M. Bejar,Halil Ertugrul Aktas,Quoc-Huy Trinh,Debesh Jha,Gorkem Durak,Ulas Bagci

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：large language models, including large language, Recent advances, demonstrated strong adaptability, language models

备注：

点击查看摘要

Abstract:Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

75. 【2606.17188】Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

链接：https://arxiv.org/abs/2606.17188

作者：Prabhjot Singh,Bhushan Pawar,Madhu Reddiboina,Rajvee Sheth

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Visual Reasoning, Punjabi Multimodal Visual, overlooking billions, billions of users, Punjabi Multimodal

备注：

点击查看摘要

Abstract:Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: this https URL.

76. 【2606.17175】Self-Generated Error Training for Token Editing in Diffusion Language Models

链接：https://arxiv.org/abs/2606.17175

作者：Lin Yao

类目：Computation and Language (cs.CL)

关键词：revise committed tokens, revise committed, block-diffusion decoding, committed tokens, random vocabulary corruptions

备注：

点击查看摘要

Abstract:Token-to-token (T2T) editing lets LLaDA2.1 revise committed tokens during block-diffusion decoding. The released recipe trains this editor on random vocabulary corruptions, but at inference the editor sees the model's own fluent, high-confidence draft errors instead. We study this training-inference mismatch and propose self-generated T2T, which performs a no-gradient draft pass, fills masked positions with predicted tokens, and supervises recovery in a second pass under these self-generated corruptions. We implement the update as a short LoRA continued-pretraining pass on LLaDA2.1-mini and evaluate on several benchmarks under the official Q-Mode T2T procedure with unchanged inference parameters. The method generally improves accuracy while reducing T2T edit intensity, mitigating failure modes such as final-digit transcription errors after otherwise correct reasoning and excessive self-correction before short factual answers.

77. 【2606.17174】From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

链接：https://arxiv.org/abs/2606.17174

作者：Mohammadsadegh Abolhasani,Hamid Reza Firoozfar,Reza Mousavi,Paul Jen-Hwa Hu

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词：conventional media settings, parasocial interactions, parasocial relationships, media settings, studied in conventional

备注： Submitted for review in ARR for EMNLP 2026

点击查看摘要

Abstract:While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.

78. 【2606.17168】RepSelect: Robust LLM Unlearning via Representation Selectivity

链接：https://arxiv.org/abs/2606.17168

作者：Filip Sondej,Yushi Yang,Adam Mahdi

类目：Computation and Language (cs.CL)

关键词：Making large language, deeply forget specific, large language models, large language, remains a central

备注：

点击查看摘要

Abstract:Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

79. 【2606.17164】PromptMN: Pseudo Prompting Language

链接：https://arxiv.org/abs/2606.17164

作者：Enkhzol Dovdon

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL); Software Engineering (cs.SE)

关键词：left implicit, primary interface, interface between humans, humans and generative, buried in prose

备注： 32 pages, 2 figures

点击查看摘要

Abstract:Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN's feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction.

80. 【2606.17162】MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

链接：https://arxiv.org/abs/2606.17162

作者：Ye Jin,Yangyang Xu,Jun Zhu,Yibo Yang

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

关键词：retain newly introduced, local edits reliably, preserve stable user, personalized presentation agents, user profile memory

备注： Code, website, project page, and video are linked in the paper

点击查看摘要

Abstract:Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory stores intent-conditioned profiles for round-0 personalization, working memory carries active preferences and session constraints across revision rounds, and tool memory stores reusable execution experience for reliable localized editing. MemSlides pairs this memory design with scoped slide-local revision, so targeted updates act on the smallest affected region instead of repeatedly regenerating the full deck. In controlled experiments, user profile memory improves persona-alignment judgments on a multi-persona, multi-intent profile bank, tool-memory injection improves closed-loop modify behavior in diagnostic matched-pair settings, and qualitative cases illustrate working memory's ability to carryover preferences. Taken together, these results suggest that effective personalization in presentation authoring depends on separating persistent user profiles, session-level working memory, and reusable execution experience across generation and localized revision.

81. 【2606.17113】he Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

链接：https://arxiv.org/abs/2606.17113

作者：Csaba Kiss,Roland Molontay,Gabriele Pergola

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Distinguishing causal adverse, adverse drug events, spurious correlations remains, causal adverse drug, Distinguishing causal

备注： 10 pages, 5 figures

点击查看摘要

Abstract:Distinguishing causal adverse drug events (ADEs) from spurious correlations remains a central challenge in pharmacovigilance. The InferBERT framework integrates transformer models with Do-calculus, but its success hinges on the underlying classification model. This study evaluates the impact of model choice in InferBERT, assessing whether simpler models suffice, if domain-specific pre-training helps, whether scaling to LLMs improves causal detection, and the effect of post-hoc calibration. We performed a comparative study on two benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM). Four models were evaluated-XGBoost (baseline), ALBERT (original InferBERT), BioBERT (biomedical transformer), and Med-LLaMA (medical LLM)-using 5-fold cross-validation repeated over 20 runs. We measured accuracy, Expected Calibration Error (ECE) pre- and post-isotonic regression, and Jaccard concordance of causal terms with PRR, ROR, and EBGM; significance was tested with paired t-tests. BioBERT achieved the highest accuracy on both datasets, while Med-LLaMA underperformed despite its size and parameter-efficient fine-tuning. Domain-specific pre-training was decisive. Calibration improved ECE but had mixed effects on accuracy and causal discovery. BioBERT's superiority also yielded the strongest concordance with traditional pharmacovigilance signals. These results show that domain-specific pre-training provides a clear advantage over simpler baselines and larger LLMs. Investing in manageable, domain-aware models is more effective for computational pharmacovigilance than simply scaling model size.

82. 【2606.17092】Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

链接：https://arxiv.org/abs/2606.17092

作者：Kyle Gao,Pranavi Kotta,Linlin Xu,Jonathan Li,David A. Clausi

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：coordination enables complex, enables complex conversational, multi-agent coordination enables, geographic information systems, multi-agent GIS system

备注： Kyle Gao and Pranavi Kotta contributed equally to this work

点击查看摘要

Abstract:Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a security-oriented framework for risk identification, evaluation, and mitigation in a multi-agent GIS system while maintaining adaptability to broader agentic architectures. We test the agentic system of a commercial geospatial partner while developing a modular state-machine-based orchestration framework that abstracts agent behavior into reusable components. We evaluate robustness using a red-teaming framework with an adaptive attacker LLM and a deterministic judge that produces binary outcomes with supporting rationales across multi-turn attacks. We further improve resilience with a prompt optimization framework that treats prompts as structured signatures and injects adversarial demonstrations, enabling systematic security improvements without degrading task performance.

83. 【2606.17057】Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

链接：https://arxiv.org/abs/2606.17057

作者：Tingchao Fu,Wenkai Wang,Fanxiao Li,Huadong Zhang,Jinhong Zhang,Dayang Li,Yunyun Dong,Renyang Liu,Wei Zhou

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Multimodal Large Language, image query pairs, Large Language, remain underexplored issue

备注： 18 pages, 11 figures

点击查看摘要

Abstract:Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text--image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

84. 【2606.18019】Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews

链接：https://arxiv.org/abs/2606.18019

作者：Franziska Braun,Alea Rüggeberg,Thomas Ranzenberger,Hartmut Lehfeld,Thomas Hillemacher,Tobias Bocklet,Korbinian Riedhammer

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词：pose major challenges, Large Language Models, overlapping symptoms pose, symptoms pose major, prevalent neuropsychiatric disorders

备注： Accepted for publication in Text, Speech and Dialogue (TSD 2026). The final authenticated publication will be available online via Springer LNCS/LNAI

点击查看摘要

Abstract:Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for differential diagnosis. In this study, we investigate open-weights Large Language Models (LLMs) for predicting dementia and depression severity from speech samples collected during standardized history taking interviews with 154 German-speaking subjects. We introduce an observer-based Global Depression Scale (GDS-D) aligned with the established Global Deterioration Scale (GDS), enabling parallel global staging of affective and cognitive symptoms. We compare three LLMs (Mistral 3.1, DeepHermes, Qwen3) in two settings: (1) zero-shot prediction and (2) LLM-based feature extraction for Support Vector Regression, using human and pause-enriched transcripts. Results show that LLMs effectively predict depression severity in zero-shot settings (best MAE of 0.60), while dementia assessment benefits substantially from structured feature extraction (best MAE of 0.78), reducing errors by up to 35% over zero-shot baselines. Pause-enriched transcripts achieve competitive performance with human transcriptions, demonstrating the viability of fully automatic screening pipelines for differential neuropsychiatric assessment.

85. 【2606.17537】Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition

链接：https://arxiv.org/abs/2606.17537

作者：Hiroyuki Deguchi,Takatomo Kano,Katsuki Chousa,Marc Delcroix

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词：making speech recognition, generates output tokens, decoding generates output, NAR decoding, speech recognition faster

备注： Accepted at Interspeech2026

点击查看摘要

Abstract:Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is degraded because NAR decoding cannot resolve uncertainty by conditioning on previously generated tokens. To address this issue, we propose a novel NAR decoding framework based on minimum Bayes' risk (MBR) decoding, termed NAR-MBR decoding, that maximizes the expected utility calculated from samples drawn from the output probability of an NAR model rather than maximizing the output probability. Notably, by leveraging the nature of NAR models, multiple samples are obtained efficiently with a single forward computation. Our experiments across LibriSpeech, Switchboard, AMI, and web presentation corpus demonstrated that our NAR-MBR decoding outperformed previous NAR decoding and ran faster than AR decoding.

信息检索

1. 【2606.18181】IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

链接：https://arxiv.org/abs/2606.18181

作者：Henry Bodwell,Hong Yang,John C. Simeone,Kelvin Gorospe,Bella Sullivan,Lana Huang,Jessica Gephart,Sandy Aylesworth,Molly Masterton,Naren Ramakrishnan

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：violate applicable laws, lack applicable laws, applicable laws, violate applicable, lack applicable

备注：

点击查看摘要

Abstract:Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws. We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors. Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity, remains difficult to obtain. We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity. The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis. Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk assessments for industry, and provide support for policy implementation and targeted enforcement efforts to government agencies.

2. 【2606.18103】HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

链接：https://arxiv.org/abs/2606.18103

作者：Noah J. Kim-Baumann,Torsten Hiltmann

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：default configurations remain, configurations remain oriented, dominant evaluation paradigms, grounding language model, language model outputs

备注： 25 pages, 6 figures. Companion preprint to a Journal of Digital History notebook article (under review)

点击查看摘要

3. 【2606.17910】Non-negative Elastic Net Decoding for Information Retrieval

链接：https://arxiv.org/abs/2606.17910

作者：Koki Okajima,Yasutoshi Ida,Tsukasa Yoshida,Yasuaki Nakamura

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：NNN decoding, Dense retrieval, retrieval, NNN, Dense

备注： 19 pages, 4 figures

点击查看摘要

Comments:
19 pages, 4 figures

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2606.17910 [cs.IR]

(or
arXiv:2606.17910v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.17910

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

4. 【2606.17721】Understanding and Debugging Failures in N-Gram-Based Generative Retrieval

链接：https://arxiv.org/abs/2606.17721

作者：Richard Takacs,Adrian Bracher,Svitlana Vakulenko

类目：Information Retrieval (cs.IR)

关键词：emerging Information Retrieval, Generative Retrieval, Information Retrieval, increasingly capable language, capable language models

备注： Work in progress

点击查看摘要

Abstract:Generative Retrieval (GR) is an emerging Information Retrieval (IR) paradigm that is motivated by increasingly capable language models. In GR, a model directly generates identifiers for relevant documents. While these systems offer unique advantages, they also introduce distinct failure mechanisms. We explore these failure modes in three contributions: (1) We present a taxonomy of GR failure modes based on GR literature. (2) We empirically investigate failure in a subset of GR: ngram-based methods, more specifically, SEAL and MINDER. Our analysis reveals common issues, such as ambiguous docids, low identifier diversity, and the disproportionate impact of specific identifiers. (3) We introduce a new web-based tool that helps the IR community analyze generated ngrams and their respective contribution to the final ranking, providing an intuitive interface to identify where such GR methods go wrong.

5. 【2606.17707】Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

链接：https://arxiv.org/abs/2606.17707

作者：Jiyuan Yang,Gengxin Sun,Mengqi Zhang,Lingjie Wang,Yuanzi Li,Hongxi Cui,Xin Xin,Pengjie Ren

类目：Information Retrieval (cs.IR)

关键词：reinforce existing preferences, Recommender systems alleviate, alleviate information overload, systems alleviate information, narrow users' exposure

备注：

点击查看摘要

Abstract:Recommender systems alleviate information overload, yet repeated feedback between recommendations and user interactions can reinforce existing preferences and narrow users' exposure, forming information cocoons. While this phenomenon has been widely studied in traditional sequential recommendation, its impact on generative recommendation remains unclear. By replacing atomic item IDs with Semantic ID (SID) sequences, generative recommenders introduce a different recommendation mechanism whose role in information cocoon formation is not yet understood. To investigate whether generative recommenders deepen information cocoons, we propose \textsc{RecLoop}, a closed-loop simulation framework with LLM-driven user agents. We compare two generative recommenders and two traditional sequential baselines on two Amazon datasets across multiple feedback cycles. In addition to standard exposure-level metrics, we introduce \emph{Code-Space Structural Cocoon}, a model-level metric that measures concentration in the generated SID space. Experimental results show that generative recommenders are generally less prone to exposure-level cocoon formation than traditional baselines, preserving broader exposure diversity and slowing cross-user homogenization. However, feedback loops can still induce concentration within the generated SID space. We further find that cocoon severity depends strongly on tokenization strategy and model scale: collaborative-signal tokenization produces stronger cocoon effects than semantic tokenization, whereas larger models maintain greater code-space diversity and better retain access to niche content. These findings suggest that information cocoons in generative recommendation are shaped not only by recommendation behavior, but also by item tokenization and model capacity. Our code is available at this https URL.

6. 【2606.17664】mporal Preference Optimization for Unsupervised Retrieval

链接：https://arxiv.org/abs/2606.17664

作者：HyunJin Kim,Jaejun Shim,Young Jin Kim,JinYeong Bak

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：retrieving semantically related, misaligned documents-an important, documents-an important aspect, collection spans multiple, introduces temporal ambiguity

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection spans multiple time periods (e.g., retrieving documents from 2018-2025 for "Who is the president in 2019?" introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible. We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which uses our novel training method Temporal Retrieval Preference Optimization (TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines. Compared to Qwen-Embedding-8B, despite being about 72.7x smaller, TPOUR Contriever improves average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries. We provide our code at this https URL.

7. 【2606.17468】RSRank: Learning Relevance from Representational Shifts

链接：https://arxiv.org/abs/2606.17468

作者：Archit Gupta,Sai Sundaresan,Debabrata Mahapatra

类目：Information Retrieval (cs.IR)

关键词：enterprises deploy RAG-based, deploy RAG-based systems, final filtering step, provide grounded responses, user queries

备注： Under Peer Review

点击查看摘要

Abstract:As enterprises deploy RAG-based systems to provide grounded responses to user queries, reranking has become a critical component for the final filtering step that separates relevant from distracting or irrelevant documents. Existing rerankers often rely on heuristic thresholds to achieve optimal filtering. Moreover, for relevance scoring, state-of-the-art methods use a language model's logit signals, which are designed for next-token prediction, not for assessing relevance. To address these limitations, we identify a principled signal for relevance: the representational shift (RS) induced in a query's internal state when conditioned on a document. We observe that the alignment between (a) RS induced by a candidate document and (b) RS induced by an oracle document-set provides a robust indicator of relevance. Building on this insight, we introduce a lightweight training framework that learns projections mapping RS to calibrated relevance scores. Our training objectives naturally filter irrelevant content at a zero threshold, reducing dependence on heuristic tuning. Across diverse retrieval datasets, our method delivers gains over SOTA rerankers.

8. 【2606.17276】On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

链接：https://arxiv.org/abs/2606.17276

作者：Sunwoo Kim,Sunkyung Lee,Clark Mingxuan Ju,Donald Loveland,Bhuvesh Kumar,Kijung Shin,Neil Shah,Liam Collins

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Generative recommendation, recommender systems, promising direction, direction for recommender, Generative

备注：

点击查看摘要

Abstract:Generative recommendation (GR) has emerged as a promising direction for recommender systems. Recently, large language models (LLMs) have been increasingly adopted for GR, as their rich pretrained knowledge is expected to help them generalize beyond common user behavior patterns that traditional memorization-oriented baselines can capture. However, existing LLM-based GR works largely ignore LLMs' well-known tendency to memorize, which, if present in LLMs fine-tuned for GR, would restrict their utilization of pretrained knowledge. In this work, we investigate this concern by examining one-hop memorization, where a model recommends items that are direct successors of items in the training data. We show that LLMs do this more than non-LLM-based GR models-in fact, the vast majority of their gains over GR baselines are actually on users whose target items can be predicted through one-hop memorization. We intuit that improving performance on the remaining users requires LLMs to learn richer item-item relations beyond one-hop transitions. To achieve this, we propose IIRG, a novel training strategy that teaches LLMs to capture: (1) collaborative relations derived from item co-occurrences across multiple hops in user sequences, and (2) semantic relations among items with similar themes, both of which can serve as useful recommendation signals. We show that IIRG significantly improves over LLMs trained solely with standard next-item prediction, with especially large gains for users whose test items are not covered by train-time one-hop transitions.

9. 【2606.17209】Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

链接：https://arxiv.org/abs/2606.17209

作者：Sidhaarth Murali,João Coelho,Jingjie Ning,João Magalhães,Bruno Martins,Chenyan Xiong

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：typically increases depth, agentic search typically, search typically increases, Test-time scaling, increases depth

备注： 15 pages, 8 figures; under review at EMNLP 2026

点击查看摘要

Abstract:Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields diminishing returns, tracing this to query redundancy at the first turn. When models issue similar first queries across rollouts, the threads retrieve overlapping evidence, and subsequent turns are conditioned on this shared retrieval. We address this limitation with DivInit, a training-free intervention at the first turn. Rather than sampling k independent first queries, DivInit draws n candidates from a single call, picks k n diverse seeds, and runs them as parallel trajectories. Across five open-weight models and eight benchmarks, DivInit consistently improves over standard parallel sampling, with average gains of five to seven points on multi-hop QA at matched compute. Code available at this https URL

10. 【2606.17397】Designing Recommendation Exposure and Favorite Lists: A Field Experiment in a Spot-Work Platform

链接：https://arxiv.org/abs/2606.17397

作者：Kazuki Sekiya,Suguru Otani,Yuki Komatsu,Shunsuke Ozeki,Shunya Noda

类目：General Economics (econ.GN); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)

关键词：recommendations shape access, short-lived opportunities, access to scarce, systems be designed, shape access

备注：

点击查看摘要

Abstract:How should recommender systems be designed when recommendations shape access to scarce, short-lived opportunities? We study this question in a production setting: Timee, Japan's largest platform for spot work, where workers favorite job templates and receive notifications when firms post shifts from those templates. Maximizing predicted favoriting can generate misdirected concentration: recommendations accumulate on popular templates that create few viable job openings, while templates with unmet labor demand receive too little exposure. We design exposure-control mechanisms for favorite-list management, reallocating template exposure based on posting activity and unfilled capacity. The proposed recommender, thresholded eligibility control (TEC), is fully parallelizable and suitable for large-scale digital platforms. In simulations calibrated to Timee data, TEC raises the per-round job-finding rate from 57.6\% to 70.0\%. A prefecture-level randomized field experiment increases realized matches and exposure per active template, reduces the share of low-exposure templates, and improves impression-level favoriting and downstream matching.

计算机视觉

1. 【2606.18250】Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion

链接：https://arxiv.org/abs/2606.18250

作者：Nils Morbitzer,Jonathan Evers,Artem Savkin,Thomas Stauner,Nassir Navab,Federico Tombari,Stefano Gasperini

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：environments is crucial, crucial for autonomous, Forecasting the evolution, dynamic environments, future dynamic

备注： ICML 2026. Project page: [this https URL](https://fr3d-wm.github.io)

点击查看摘要

Abstract:Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: this https URL.

2. 【2606.18249】Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

链接：https://arxiv.org/abs/2606.18249

作者：Wujian Peng,Lingchen Meng,Yuxuan Cai,Xianwei Zhuang,Yuhuan Yang,Rongyao Fang,Chenfei Wu,Junyang Lin,Zuxuan Wu,Shuai Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Modeling aims, Unified Multimodal Modeling, Modeling aims, aims to integrate, visual

备注： Accepted by ICML2026. Project page [this https URL](https://sharelab-sii.github.io/uniar-web)

点击查看摘要

Abstract:Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at this https URL.

3. 【2606.18243】MOCHI: Motion Enhancement of Collaborative Human-object Interactions

链接：https://arxiv.org/abs/2606.18243

作者：Jiye Lee,Yonghun Choi,Jungdam Won

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词：require mutual anticipation, Collaborative human-object interaction, Collaborative Human-object Interactions, Collaborative human-object, MHOI

备注： SIGGRAPH 2026 Journal (ACM TOG); Project page: [this https URL](https://jiyewise.github.io/projects/MOCHI/)

点击查看摘要

Abstract:Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.

4. 【2606.18242】EventDrive: Event Cameras for Vision-Language Driving Intelligence

链接：https://arxiv.org/abs/2606.18242

作者：Dongyue Lu,Rong Li,Ao Liang,Lingdong Kong,Wei Yin,Lai Xing Ng,Benoit R. Cottereau,Camille Simon Chane,Wei Tsang Ooi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high dynamic range, offering motion fidelity, Event cameras sense, capturing temporal structure, dynamic range

备注： CVPR2026, 34 pages, 15 figures, 15 tables, project page: [this https URL](https://dylanorange.github.io/projects/eventdrive)

点击查看摘要

Abstract:Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

5. 【2606.18231】Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

链接：https://arxiv.org/abs/2606.18231

作者：Rishit Dagli,Donglai Xiang,Vismay Modi,Xuning Yang,Gavriel State,David I.W. Levin,Maria Shugrina

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Young modulus, Poisson ratio, reliable physics simulation, digital worlds, lack this information

备注： Project Page and hi-res paper: [this https URL](https://research.nvidia.com/labs/sil/projects/adavomp/) . ICML 2026

点击查看摘要

Abstract:Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($\nu$) and density ($\rho$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $\nu$, $\rho$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

6. 【2606.18208】Looped World Models

链接：https://arxiv.org/abs/2606.18208

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Current world models, faithful long-horizon simulation, Current world, demands deep computation, world models face

备注： Technical Report

点击查看摘要

7. 【2606.18198】Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners

链接：https://arxiv.org/abs/2606.18198

作者：Xiaojun Jia,Jie Liao,Simeng Qin,Ke Ma,Wenbo Guo,Yebo Feng,Aishan Liu,Yang Liu

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：important attack surface, LLM-based systems, surface in LLM-based, skill, existing skill scanners

备注：

点击查看摘要

Abstract:Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimodal agents during deployment. To systematically investigate this threat, we propose SkillCamo, a document-mediated multimodal instruction attack that conceals malicious instructions within images bundled with a skill while rewriting the surrounding documentation to naturally reference those images as part of the normal workflow. Thus, the attack does not rely on the image alone, but on the joint interpretation of textual guidance and visual payload at execution time. To defend against such attacks, we further propose ExecScan, an execution-grounded multimodal scanning module that performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation over skill artifacts. ExecScan jointly analyzes documentation, code, referenced resources, and visual content to recover hidden instructions, reconstruct executable behavior chains, and identify downstream risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Extensive experiments show that image-hidden malicious instructions challenge existing skill scanners, while ExecScan can improve the skill scanning performance.

8. 【2606.18180】EgoCS-400K: An Egocentric Gameplay Dataset for World Models

链接：https://arxiv.org/abs/2606.18180

作者：Rongjin Guo,Dong Liang,Yuhao Liu,Fang Liu,Tianyu Huang,Gerhard P. Hancke,Rynson W. H. Lau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：require temporally aligned, models require temporally, world models require, temporally aligned, world modeling places

备注：

点击查看摘要

Abstract:The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.

9. 【2606.18156】ReAge3D: Re-Aging 3D Faces with View Consistency

链接：https://arxiv.org/abs/2606.18156

作者：Libing Zeng,Li Ma,Mingming He,Ning Yu,Paul Debevec,Nima Khademi Kalantari

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：produces highly detailed, identity-preserving results, realistic and controllable, highly detailed, framework for realistic

备注：

点击查看摘要

Abstract:We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

10. 【2606.18153】Neural Tree Reconstruction for the Open Forest Observatory

链接：https://arxiv.org/abs/2606.18153

作者：Marissa Ramirez de Chanlatte,Arjun Rewari,Trevor Darrell,Derek J. N. Young

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Open Forest Observatory, make low-cost forest, low-cost forest mapping, forest mapping accessible, Forest Observatory

备注： Published as a workshop paper at "Tackling Climate Change with Machine Learning", ICLR 2024

点击查看摘要

Abstract:The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for forest mapping by uncrewed aerial vehicle. Such data are useful for a variety of climate applications including prioritizing reforestation efforts, informing wildfire hazard reduction, and monitoring carbon sequestration. In the current iteration of the OFO's forest map database, 3D tree maps are created using classical structure-from-motion techniques. This approach is prone to artifacts, lacks detail, and has particular difficulty on the forest floor where the input data (overhead imagery) has limited visibility. These reconstruction errors can potentially propagate to the downstream scientific tasks (e.g. a wildfire simulation.) Advances in 3D reconstruction, including methods like Neural Radiance Fields (NeRF), produce higher quality results that are more robust to sparse views and support data-driven priors. We explore ways to incorporate NeRFs into the OFO dataset, outline future work to support even more state-of-the-art 3D vision models, and describe the importance of high-quality 3D reconstructions for forestry applications.

11. 【2606.18123】Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology

链接：https://arxiv.org/abs/2606.18123

作者：Tianyu Liu,Ziqing Wang,Zhaokang Liang,Tong Ding,Peter Humphrey,Lorraine Colón-Cartagena,Emily Ling-Lin Pai,Kenneth Tou En Chang,Mohamed Kahila,Jonathan Chong Kai Liew,Tinglin Huang,Rex Ying,Kaize Ding,Faisal Mahmood,Wengong Jin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：advancing precision oncology, single image modalities, Predicting immune biomarkers, precision oncology, biological information

备注： 5 figures

点击查看摘要

Abstract:Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture-of-experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel-level and slide-level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole-slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution- and tendency-aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state-of-the-art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI-assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.

12. 【2606.18115】HLS-GPT: A Generative Pretrained Transformer (GPT) for Continental-Scale NASA Harmonized Landsat and Sentinel-2 (HLS) Reflectance Reconstruction Across All Bands on Arbitrary Dates

链接：https://arxiv.org/abs/2606.18115

作者：Junjie Li,Hankui K. Zhang,David P. Roy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent deep learning, short temporal contexts, NASA Harmonized Landsat, Recent deep, limited geographic scalability

备注：

点击查看摘要

Abstract:Recent deep learning methods for Landsat and Sentinel-2 reflectance time series reconstruction remain limited by restricted spectral coverage, limited geographic scalability, or patch-based designs with short temporal contexts. We present HLS-GPT, a large-scale generative pretrained Transformer model for reconstructing NASA Harmonized Landsat Sentinel-2 30 m surface reflectance for all bands, any date, and any pixel location. HLS-GPT uses a hierarchical Transformer architecture to handle the different spectral band configurations of Landsat and Sentinel-2 and operates on single-pixel 12-month time series. To capture geographic and seasonal variability, the model was trained with nine years of HLS time series from more than 0.25 million training pixels across the conterminous United States. A random cropping and masking strategy extracts 12-month periods with varying start dates across epochs, masks 50% of valid observations, and trains the model to reconstruct the masked reflectance values from the remaining observations. Evaluation using more than 62,000 independent test pixels shows robust reconstruction under diverse land surface conditions, including complex crop phenology and sparse, irregular observations. Leave-one-observation-out evaluation achieved reconstruction RMSE below 0.026 for all HLS spectral bands, with relative RMSE below 35% for visible bands and below 13% for other bands. Red-edge band errors were comparable to red and near-infrared errors despite the absence of red-edge bands on Landsat. Sensitivity analyses that randomly masked 10% to 90% of test observations showed only modest degradation when 10% to 50% of observations were masked, with all-band RMSE below 0.028. Image reconstruction over nine independent 109 by 109 km CONUS HLS tiles further demonstrates that HLS-GPT outperforms two conventional methods and the NASA-IBM Prithvi model.

13. 【2606.18112】Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

链接：https://arxiv.org/abs/2606.18112

作者：Jiazhao Zhang,Gengze Zhou,Hale Yin,Yiyang Huang,Zixing Lei,Qihang Peng,Haoqi Yuan,Jie Zhang,Xudong Guo,Xiaoyue Chen,An Yang,Fei Huang,Junyang Lin,Dayiheng Liu,Jingren Zhou,Zhuoyuan Yu,Jingyang Fan,Zhixuan Liang,Pei Lin,Ye Wang,Anzhe Chen,Kun Yan,Xiao Xu,Jiahao Li,Lulu Hu,Minying Zhang,Shurui Li,Wenhu Xiao,Shuai Bai,Xuancheng Ren,Chenxu Lv,Chenfei Wu,Xiong-Hui Chen

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving share, object search, target tracking, inference time, base navigation model

备注：

点击查看摘要

Abstract:Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

14. 【2606.18069】Blended Chart Surfaces: A Seamless Explicit Representation for Smooth Surface Fitting

链接：https://arxiv.org/abs/2606.18069

作者：Romy Williamson,Niloy Mitra

类目：Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Blended Chart Surfaces, provide global smoothness, global smoothness guarantees, Blended Chart, Chart Surfaces

备注： 17 pages, 16 figures

点击查看摘要

Abstract:A surface representation suitable for geometry processing should be compact and explicit, provide global smoothness guarantees, support a wide range of surface topologies, and offer reliable access to differential quantities such as normals and surface energies, while remaining compatible with modern differentiable optimization. Existing neural representations typically sacrifice one or more of these properties: implicit fields typically require iso-surfacing for downstream use, while explicit neural maps are constrained by canonical-domain parametrizations or exhibit seam artifacts between local charts. We introduce Blended Chart Surfaces, a compact, network-free, explicit representation that is smooth by construction and anchored to user-provided topology. Given a coarse proxy mesh encoding the intended surface topology and approximate geometry, Blended Chart Surfaces jointly optimize for a polynomial map at each proxy vertex using an off-the-shelf optimizer to fit to an implicit target shape, avoiding the need for an input parametrization. Neighboring maps are fused using a smooth 'one-ring coordinate' blending scheme, decoupling topology and coarse geometry (carried by the proxy) from geometric details (carried by the local patches). The surface is globally smooth, fully differentiable, and enables stable evaluation of derivatives, making differential quantities and surface energies directly accessible. Additionally, our construction is equivariant to rigid motions and scaling of the proxy mesh. We evaluate Blended Chart Surfaces on various topologies and geometric complexity, and compare against explicit alternatives including interpolating-function baselines and mesh-displacement MLPs. Across these, Blended Chart Surfaces achieve a favorable trade-off among compactness, simplicity, access to differential quantities, and expressivity while remaining smooth across patch boundaries.

15. 【2606.18063】When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

链接：https://arxiv.org/abs/2606.18063

作者：Ruman Wang,Hangting Ye

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：models achieve remarkable, real-world clinical scenarios, achieve remarkable performance, fundamental dilemma, annotation costs

备注：

点击查看摘要

Abstract:Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

16. 【2606.18008】PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution

链接：https://arxiv.org/abs/2606.18008

作者：Zihan Gu,Ruoyu Chen,Junchi Zhang,Li Liu,Xiaochun Cao,Hua Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：interpreting modern vision, fundamental tool, tool for interpreting, interpreting modern, modern vision

备注： 26 pages, 29 figures

点击查看摘要

Abstract:Visual attribution is a fundamental tool for interpreting modern vision and vision-language models, particularly when their decisions must be inspected, diagnosed, or audited. Its goal is to explain how a model's decision depends on local regions of the visual input, typically by assigning an importance ordering over candidate image regions. Given an image partitioned into $n$ regions, faithful attribution can be cast as an ordered subset-search problem, in which progressively inserting the selected regions should recover the target model response as early as possible. Exhaustive search over region subsets incurs exponential cost, while the widely used greedy search still requires a quadratic number of model evaluations, because every selection step rescores all remaining candidates. We propose PhaseWin, an efficient subset-search algorithm for faithful visual attribution. PhaseWin reorganizes greedy region selection into a phased window-search procedure: rather than re-evaluating the full candidate set at every step, it alternates between global candidate screening, adaptive pruning, and localized window refinement, while preserving the essential region-ranking behavior of greedy search. We analyze PhaseWin under monotone evidence-accumulation conditions and show that, under feature-level structural assumptions, it attains controllable linear evaluation complexity together with near-greedy faithfulness guarantees. Extensive experiments on image classification, object detection, visual grounding, and image captioning show that, among all compared attribution methods, PhaseWin reaches high faithfulness with the fewest forward passes, empirically realizing the predicted reduction from $O(n^2)$ to $O(n)$. The code is available at this https URL.

17. 【2606.17998】AIGS-Net: Compact Illumination Field Modeling via 2D Gaussian Splatting for Fast Low-Light Image Enhancement

链接：https://arxiv.org/abs/2606.17998

作者：Yuhan Chen,Kunyang Huang,Fuchen Li,Zhuohan Qin,Guofa Li,Wenbo Chu,Keqiang Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing low-light image, Gaussian Splatting Network, Illumination Gaussian Splatting, Gaussian Splatting illumination, Existing low-light

备注：

点击查看摘要

Abstract:Existing low-light image enhancement methods often face a bottleneck between the representation capacity of illumination-field modeling and computational complexity. To address this issue, this paper proposes an Adaptive Illumination Gaussian Splatting Network (AIGS-Net), an ultra-lightweight architecture for fast low-light enhancement. Unlike conventional static priors, AIGS-Net constructs an input-adaptive 2D Gaussian Splatting illumination field. The opacity of Gaussian basis functions is dynamically modulated by relative luminance statistics of the input image, and spatially varying illumination compensation is rendered through ordered alpha compositing. To guide adaptive illumination compensation efficiently, a zero-parameter nonlinear multiscale contextual encoding module is introduced to extract low-frequency structures and local contrast cues without additional convolutional weights. To suppress noise amplification and sensor-induced color bias, AIGS-Net integrates noise-mask estimation, locked single-channel Gamma mapping, cross-channel consistency regularization, and target color-alignment constraints. Experiments on LOL and LSRW benchmarks show that AIGS-Net improves detail recovery and color fidelity while requiring only approximately 40 learnable parameters, achieving an effective trade-off between enhancement quality and extreme inference efficiency.

18. 【2606.17989】Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis

链接：https://arxiv.org/abs/2606.17989

作者：Yonghao Chen,Sicheng Yang,Rui Tang,Lei Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：magnetic resonance imaging, Multi-contrast magnetic resonance, resonance imaging, clinical diagnosis, magnetic resonance

备注： Code: [this https URL](https://github.com/script-Yang/RSF)

点击查看摘要

Abstract:Multi-contrast magnetic resonance imaging (MRI) provides complementary information for clinical diagnosis. However, acquiring all MRI sequences is often time-consuming and costly. Recent generative models perform cross-contrast synthesis to address this issue by inferring absent contrasts from the available ones. Nevertheless, synthesizing 3D MRI presents significant challenges. Due to the massive volume sizes, operating directly in the pixel space is computationally prohibitive; therefore, a common approach is to first compress the 3D volumes into a latent space and subsequently train generative models in that space. We observe that existing compression architectures face several critical issues: they under-preserve long-range anatomical coherence, discard clinically meaningful semantics, and rely on optimization objectives that lead to over-smoothed reconstructions. Ultimately, these shortcomings compromise the performance of subsequent generative models. In this work, we propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis. Specifically, we introduce a Latent Harmonization Encoder (LHE) to capture global anatomical dependencies, ensuring coherent volumetric representations. To mitigate semantic degradation during latent compression, we further design a Semantic Recovery Block (SRB) that injects high-level priors from a self-supervised semantic teacher, enhancing contrast-aware separability in the latent space. Additionally, we propose an Anatomy-aware Frequency Loss (AFL) to adaptively preserve diagnostically relevant high-frequency structures. Extensive experiments on two public multi-contrast MRI datasets demonstrate consistent improvements in reconstruction fidelity and cross-contrast synthesis quality. Our code is available at this https URL.

19. 【2606.17985】Gaussian Light Field Splatting: A Physical Prior-Driven Vision Transformer for Unsupervised Low-Light Image Enhancement

链接：https://arxiv.org/abs/2606.17985

作者：Yuhan Chen,Wenxuan Yu,Guofa Li,Fuchen Li,Kunyang Huang,Yicui Shi,Ying Fang,Wenbo Chu,Keqiang Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing unsupervised low-light, Existing unsupervised, encounter local exposure, local exposure imbalance, Vision Transformers lack

备注：

点击查看摘要

Abstract:Existing unsupervised low-light image enhancement methods often encounter local exposure imbalance and color distortion under complex non-uniform illumination. In addition, most Vision Transformers lack an explicit mechanism for modeling the physical priors of illumination degradation. To address these limitations, we propose GLFS, a Gaussian light field splatting-based Vision Transformer that integrates continuous physical illumination modeling from Gaussian splatting into the Transformer architecture. In GLFS, scene illumination is represented by a superposition of anisotropic Gaussian basis functions. Physics-guided biases are introduced into self-attention to adaptively infer a spatial gain field, enabling accurate and uniform restoration under complex illumination. To reduce color bias and structural degradation during enhancement, a color-vector angular loss and a luminance-edge loss are further developed. These losses enforce hue consistency and improve the structural fidelity of local details. Extensive ablation studies and quantitative evaluations show that GLFS provides clear advantages in illumination correction and detail preservation. It achieves state-of-the-art performance and offers a new representation paradigm for low-light image enhancement.

20. 【2606.17972】SegDINO: Introducing Multi-Scale Structure into DINO for Efficient Medical Image Segmentation

链接：https://arxiv.org/abs/2606.17972

作者：Sicheng Yang,Hongqiu Wang,Zhaohu Xing,Sixiang Chen,Qiuxia Yang,Yize Mao,Guang Yang,Lei Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Self-supervised DINO models, transferable visual representations, models provide strong, provide strong transferable, strong transferable visual

备注： Code: [this https URL](https://github.com/script-Yang/segdino_v2)

点击查看摘要

Abstract:Self-supervised DINO models provide strong transferable visual representations, yet applying them directly to image segmentation remains challenging. Existing approaches commonly rely on heavy decoders with complex upsampling, introducing substantial parameter and computational overhead. We observe that introducing scale into DINO features is far more critical than increasing decoder capacity. In this work, we present SegDINO, an efficient segmentation framework that integrates a DINOv3 backbone with lightweight scale modeling. SegDINO introduces Token Pyramid Adaptation (TPA) to reorganize intermediate DINO features into a pseudo multi-scale hierarchy, and Scale-Aware Decoding (SAD) for efficient intra-scale refinement and top-down multi-scale propagation. We further curate PanCT, a new CT dataset containing 284 patients with expert-annotated pancreatic tumors, to assess SegDINO's ability to handle difficult small-lesion cases. Extensive experiments on PanCT and three public benchmarks demonstrate that SegDINO achieves state-of-the-art results with high efficiency. The code is available at this https URL.

21. 【2606.17966】Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation

链接：https://arxiv.org/abs/2606.17966

作者：Sheng-Wei Chan,Hsin-Jui Pan,Jen-Shiun Chiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Mamba-based state space, state space models, space models offer, models offer linear-time, offer linear-time long-range

备注： 23 pages, 4 figures, 17 tables. Code will be released soon

点击查看摘要

Abstract:Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semantic segmentation framework that addresses this propagation-induced response dilution through three segmentation-specific designs: (i) a boundary-supervised local detail prior that is explicitly trained with ground-truth boundary masks to identify regions requiring response restoration; (ii) a class-uncertainty-aware Reload Gate that incorporates per-pixel class entropy from a pre-reload auxiliary head as an additional gating signal, a formulation that is informative only under multi-class dense prediction; and (iii) a hierarchical multi-level Reload mechanism that applies anti-dilution refinement at three decoder levels and fuses the restored representations top-down. Built upon a ConvNeXt-Tiny encoder with a multi-scale decoder and four-directional Mamba scanning with pixel-wise directional attention, Reload-Mamba achieves 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K and 83.2% single-scale mIoU on Cityscapes. With ResNet-101 + COCO pre-training under the standard DeepLab-style protocol, Reload-Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation-specific designs contributes beyond a direct port of the prior anti-dilution architecture proposed for binarization, cumulatively improving over the direct-port baseline by +2.2 mIoU on ADE20K.

22. 【2606.17961】Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

链接：https://arxiv.org/abs/2606.17961

作者：Andrea Santomauro,Luigi Portinale,Giorgio Leonardi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Transformer architectures, Positional encoding, arrangement of inputs, injects information, spatial or sequential

备注：

点击查看摘要

Abstract:Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

23. 【2606.17958】Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation

链接：https://arxiv.org/abs/2606.17958

作者：Yuming Chen,Yuxin Xie,Tao Zhou,Yi Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Semi-supervised medical image, medical image analysis, mitigating annotation scarcity, medical image segmentation, dominant research problem

备注： Accepted to MICCAI 2026

点击查看摘要

Abstract:Semi-supervised medical image segmentation has emerged as a dominant research problem in medical image analysis, mitigating annotation scarcity by leveraging consistency regularization on unlabeled data. However, existing approaches operate predominantly via visual pattern matching, relying heavily on pixel-level similarities. This visual-centric dependency often falters in clinical scenarios characterized by the visual-semantic mismatch, where visually similar lesions warrant distinct diagnostic conclusions, thus failing to capture the underlying diagnostic logic used by experts. To address this, we move beyond visual cues and propose CERS (CoT-Enhanced Reasoning Segmentation), a framework that integrates Chain-of-Thought (CoT) reasoning to distinguish pathologically distinct cases. Specifically, we construct a knowledge pool enriched with linguistic reasoning descriptions generated by large language models (LLMs). A semantic-aware reference selection strategy is introduced to identify historical evidence, filtering candidates first by morphology, and then refining them via CoT consistency to eliminate hard negatives. Furthermore, a multi-scale coordinate attention module (MCAM) is designed to effectively fuse this reasoning-derived context into the decoding process. Extensive experiments demonstrate the superiority of CERS against state-of-the-art approaches, particularly in resolving boundary ambiguities and semantic inconsistencies. The code is available at this https URL.

24. 【2606.17953】MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias

链接：https://arxiv.org/abs/2606.17953

作者：Xingming Li,Ao Cheng,Qiyao Sun,Xixiang He,Xuanyu Ji,Runke Huang,Qingyong Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal large language, images provide clear, provide clear evidence, consistently favor text, large language models

备注： Accepted at IJCAI 2026. 16 pages, 10 figures

点击查看摘要

Abstract:When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual override". The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they're correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.

25. 【2606.17950】Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

链接：https://arxiv.org/abs/2606.17950

作者：Jinghan Wu,Jing Li,Ivor W. Tsang,Xuetao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multi-modal Coreference Resolution, notable performance gains, coreference resolution, existing Multi-modal Coreference, leading to notable

备注：

点击查看摘要

Abstract:Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emph{alignment model} for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31\% and 2.12\% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.

26. 【2606.17935】MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization

链接：https://arxiv.org/abs/2606.17935

作者：Guo Pu,Yixuan Han,Haofeng Li,Yao Zhang,Hui Zhou,Zhouhui Lian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ongoing research topic, monocular image sequences, research topic, monocular image, image sequences

备注： SIGGRAPH 2026

点击查看摘要

Abstract:Online 3D reconstruction from monocular image sequences is a challenging and ongoing research topic. 3D Gaussian Splatting (3DGS), leveraging its high-quality real-time rendering capability, empowers online 3D reconstruction to represent dense scenes with enhanced expressiveness, and thus holds great promise for a wide range of applications such as robotics and AR/VR. However, existing online 3DGS methods still suffer from some key challenges: fragile camera pose estimation due to the lack of global optimization, and low optimization efficiency in large-scale or long-sequence scenarios. To address these issues, we propose a robust and efficient online voxelized 3DGS reconstruction framework integrated with global $\text{Sim}(3)$ optimization, which enables reliable camera tracking and efficient global loop closure for both camera poses and voxelized 3DGS. To accelerate the convergence of the voxelized 3DGS, we further introduce a color residual learning strategy, which not only boosts optimization speed but also enhances rendering quality. Extensive experiments on diverse indoor and outdoor datasets demonstrate that our method achieves state-of-the-art performance in both camera pose estimation accuracy and rendering quality, while retaining real-time efficiency. Additionally, we develop and deploy a real-world UAV-based active reconstruction system grounded on our proposed method, validating its robustness and generalizability for practical online 3D reconstruction tasks. Our code and data are available at this https URL.

27. 【2606.17874】Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations

链接：https://arxiv.org/abs/2606.17874

作者：Takaya Kawakatsu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Multi-task table recognition, jointly addresses table, table structure prediction, recognition jointly addresses, addresses table structure

备注： ICDAR 2026

点击查看摘要

Abstract:Multi-task table recognition jointly addresses table structure prediction, cell localization, and cell content recognition within a unified framework. Existing approaches often rely on autoregressive decoders to generate table structures and reuse their hidden states for cell localization and content recognition. This autoregressive generation process can make cell representations order-dependent, degrading global consistency across cells. This paper proposes a structural refinement module that produces order-independent cell features through non-causal attention. This design enables parallel inference of cell contents while conditioning each cell on global context encoded in the refined features. Experiments on two large datasets demonstrate consistent gains in cell localization and end-to-end recognition, while reducing overall inference time by around threefold.

28. 【2606.17867】A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

链接：https://arxiv.org/abs/2606.17867

作者：Antonio Scardace,Daniele Ravì

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remain poorly understood, modalities remain poorly, approaches in Alzheimer, enhance disease characterization, Alzheimer Disease

备注： Accepted to ICTS4eHealth 2026

点击查看摘要

Abstract:Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: this https URL.

29. 【2606.17846】Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

链接：https://arxiv.org/abs/2606.17846

作者：Haoqi Yuan,Zhixuan Liang,Anzhe Chen,Ye Wang,Haoyang Li,Pei Lin,Yiyang Huang,Zixing Lei,Tong Zhang,Jiazhao Zhang,Jie Zhang,Jingyang Fan,Gengze Zhou,Qihang Peng,Chenxu Lv,Xiaoyue Chen,An Yang,Fei Huang,Junyang Lin,Dayiheng Liu,Jingren Zhou,Chenfei Wu,Xiong-Hui Chen

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：multimodality achieve strong, achieve strong generalization, language and multimodality, aligning heterogeneous data, achieve genuine generalization

备注： 44 pages

点击查看摘要

Abstract:Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $\pi$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

30. 【2606.17836】High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach

链接：https://arxiv.org/abs/2606.17836

作者：Hui Wang,Xiaowei Li,Chenxin Zhang,Yifan Feng,Jianwei Zuo,Yumeng Tang,Xiuli Sun,Jianliu Wang,Bing Xie,Jiajia Luo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Graphics (cs.GR)

关键词：downstream patient-specific analysis, geometry from MRI, MRI is important, patient-specific analysis, downstream patient-specific

备注：

点击查看摘要

Abstract:Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism--where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.

31. 【2606.17824】Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows

链接：https://arxiv.org/abs/2606.17824

作者：Paul Julius Kühn,Saptarshi Neil Sinha,Jakob Hansen,Robin Horst

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：regions remains challenging, require user control, meaningful regions remains, assets into meaningful, remains challenging

备注：

点击查看摘要

Abstract:Segmenting 3D assets into meaningful regions remains challenging, especially when segmentation criteria are application-dependent and require user control. We present a human-in-the-loop pipeline for generating a segmented 2D parameterized atlas from a 3D model for interactive media, game, and XR content workflows. Our method first selects a compact set of rendered views using a greedy set cover strategy over sampled surface points, and then supports interactive segmentation of these views with SAM~2 and Label Studio. The resulting masks are back-projected onto the model's UV parameterization to produce a unified segmented atlas that supports downstream production tasks such as segment-wise material assignment, style transfer, and semantic labeling. We assess the pipeline through a demonstration-based technical evaluation on eight cultural heritage objects. The results show that the approach can generate usable segmented atlases across diverse geometries while revealing recurring sources of manual correction, particularly fine structures, cavities, and weak appearance boundaries.

32. 【2606.17809】Million-scale multimodal pollen microscopy with expert-guided foundation models

链接：https://arxiv.org/abs/2606.17809

作者：András Biricz,Björn Gedda,Donát Magyar,Antonio Spanu,János Fillinger,Péter Pollner,István Csabai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated pollen identification, retaining palynological interpretability, Automated pollen, scanner settings, bottleneck in aerobiology

备注： 31 pages, 5 main figures, supplementary information included. Submitted to Scientific Reports

点击查看摘要

Abstract:Automated pollen identification from microscopy remains a bottleneck in aerobiology, palaeoecology and biodiversity monitoring, because scalable systems must generalise across specimen preparation, scanner settings and geographic origins while retaining palynological interpretability. To address this gap, we present a million-scale multimodal pollen microscopy resource, Pollen AI Atlas, assembled from pure-species whole-slide bright-field images spanning four geographic origins, four scanner settings and 46 taxon labels across 31 botanical families. Seeded by one manually selected exemplar per source slide, token-level mining and filtering produced 1,511,390 released grain detections with 99.6\% proposal precision in expert-curated test regions. Each detection was paired with machine-generated grain-level morphological captions from five open-weight vision-language models, guided by expert-verified palynological anchors, yielding structured descriptions of aperture systems, wall ornamentation, shape and size. Among the evaluated models, Gemma4 provided the most controlled primary caption set, combining tight length control, no leakage and the strongest text-retrieval performance. Baseline benchmarks with frozen visual features reached 88.16\% top-1 accuracy, while cross-regional retrieval showed that caption-derived text embeddings remained robust when image similarity degraded (mAP@20 0.811 versus 0.262). Released data, annotations, captions, splits, code, and weights provide a benchmark for pollen recognition, cross-regional domain adaptation and domain-specific multimodal microscopy learning.

33. 【2606.17800】MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

链接：https://arxiv.org/abs/2606.17800

作者：Lichen Bai,Tianhao Zhang,Shitong Shao,Dingwei Tan,Qiyu Zhong,Zhengpeng Xie,Haopeng Li,Qinghao Huang,Dandan Shen,Tengjiao Ji,Wei Wang,Peicheng Wu,Yuxuan Zhao,Xiangyu Zhu,Welly Luo,Shurui Yang,Zeke Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：global video content, interactive social purposes, global video, video content, increasing majority

备注： 32 pages, 13 figures, 3 tables

点击查看摘要

Abstract:As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

34. 【2606.17798】LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

链接：https://arxiv.org/abs/2606.17798

作者：Zhenyu Yang,Kairui Zhang,Bing Wang,Shengsheng Qian,Changsheng Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Video Large Language, Large Language Models, Large Language, simultaneously process continuous, process continuous video

备注：

点击查看摘要

Abstract:Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at this https URL.

35. 【2606.17791】he Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

链接：https://arxiv.org/abs/2606.17791

作者：Samar Ansari

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：tools increasingly summarize, reformat radiology reports, large language models, documentation tools increasingly, increasingly summarize

备注：

点击查看摘要

36. 【2606.17742】BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics

链接：https://arxiv.org/abs/2606.17742

作者：Junfeng Xia,Wenhao Ye,Junxiang Zhang,Xuanye Pan,Mo Wang,Quanying Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词：conditional predictive generation, existing fMRI foundation, conditional predictive, fMRI foundation models, functional brain dynamics

备注：

点击查看摘要

Abstract:Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative model for whole-brain 4D fMRI dynamics. BrainWorld uses sMRI as subject-level anatomical context to guide future fMRI generation, integrating structural information into the denoising process rather than treating it as a parallel modality. Evaluated on 22 datasets spanning diverse cohorts and brain states, BrainWorld generates stable 4D fMRI trajectories up to 400 frames, improves downstream performance through generated-example augmentation, and learns transferable multimodal representations that outperform baselines. Together, these results establish BrainWorld as a condition-aware generative framework for long-horizon brain dynamics modeling and multimodal representation learning.

37. 【2606.17739】ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

链接：https://arxiv.org/abs/2606.17739

作者：Lina Magoula,Nikolaos Koursioumpas,Nancy Alonistioti,Ramin Khalili

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词：natural disaster management, strict operational constraints, support environmental monitoring, resource limitations, disaster management

备注： 14 pages, 9 figures

点击查看摘要

Abstract:Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

38. 【2606.17730】ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

链接：https://arxiv.org/abs/2606.17730

作者：Zhexiao Xiong,Yizhi Song,Hao Kang,Qing Yan,Liming Jiang,Jenson Yang,Zhoujie Fu,Stathi Fotiadis,Angtian Wang,Zichuan Liu,Bo Liu,Yiding Yang,Xin Lu,Nathan Jacobs

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：simulate environment dynamics, real-time user actions, aim to simulate, simulate environment, environment dynamics

备注： Project page: [this https URL](https://interactwm.github.io/ActWorld)

点击查看摘要

Abstract:Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at this https URL.

39. 【2606.17722】GSPan: A Continuous Gaussian Primitive Representation for Arbitrary-Scale Pansharpening

链接：https://arxiv.org/abs/2606.17722

作者：Fangyi Li,Xiaoyuan Yang,Yixiao Li,Zongyang Sui,Kangqing Shen,Gemine Vivone

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generate high-resolution multispectral, fusing low-resolution multispectral, high-resolution multispectral, low-resolution multispectral, aims to generate

备注：

点击查看摘要

Abstract:Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose GSPan, a framework that introduces 2D Gaussian Splatting (GS) into pansharpening. Instead of directly predicting pixels, GSPan represents band-wise residual details as continuous and learnable 2D Gaussian primitives. We design a Dual-Stream Hierarchical Interaction (DSHI) architecture with a Spatial-Spectral Interactive Attention (SSIA) module to estimate these primitives from complementary PAN and MS observations. The predicted primitives are rendered as a residual detail field and injected into the upsampled MS image. This continuous representation allows GSPan to render fused images on arbitrary target sampling grids without scale-specific retraining. It further enables a Scale-Decoupled Asymmetric Inference (SDAI) strategy, which estimates primitives at a reduced resolution and renders the fused image at the target resolution for efficient large-scene pansharpening. Experiments on QuickBird, GaoFen-2, WorldView-3, and WorldView-3-4K datasets show that GSPan delivers state-of-the-art fusion performance. Moreover, SDAI markedly accelerates inference, achieving a favorable trade-off between computational efficiency and fusion quality. Our results demonstrate the potential of continuous Gaussian residual representations as a flexible and scale-decoupled alternative to fixed-grid prediction.

40. 【2606.17713】Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

链接：https://arxiv.org/abs/2606.17713

作者：Jiangong Xu,Weibao Xue,Xiaoyu Yu,Jun Pan,Xinlian Lianga,Mi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Optical remote sensing, remote sensing imagery, land cover, remote sensing, frequently degraded

备注：

点击查看摘要

Abstract:Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone this http URL project is publicly available at: this https URL

41. 【2606.17711】Structured Adversarial Camouflage via Voronoi Diagrams

链接：https://arxiv.org/abs/2606.17711

作者：Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Pixel-wise adversarial patches, Pixel-wise adversarial, visually detectable, limiting utility, present adversarial Voronoi

备注：

点击查看摘要

Abstract:Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria - COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (=0.17), highlighting a structure-palette coupling. The parameter-efficient, palette-constrained design improves visual plausibility while degrading real-time detector performance. Physical validation and color calibration are left for future work. Code: this https URL This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.17711 [cs.CV]

(or
arXiv:2606.17711v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.17711

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

42. 【2606.17710】Vision-language models for chest radiography do not always need the image

链接：https://arxiv.org/abs/2606.17710

作者：Mahshad Lotfinia,Sebastian Ziegelmayer,Lisa Adams,Daniel Truhn,Andreas Maier,Soroosh Tayebi Arasteh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Medical vision-language models, report strong chest, strong chest radiograph, Medical vision-language, vision-language models report

备注：

点击查看摘要

43. 【2606.17702】SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

链接：https://arxiv.org/abs/2606.17702

作者：Wan Siti Halimatul Munirah Wan Ahmad,Faris Syahmi Samidi,Mohammad Badal Ahmmed,Vimal Angela Thiviyanathan,Selvam James Thavaraj,Anwar P.P. Abdul Majeed

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：routine HE-stained histology, requires simultaneous cell, Characterising the tumour, simultaneous cell segmentation, interpretable clinical reporting

备注：

点击查看摘要

Abstract:Characterising the tumour microenvironment (TME) from routine HE-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on 100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

44. 【2606.17678】See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

链接：https://arxiv.org/abs/2606.17678

作者：Yilian Liu,Sicong Leng,Guoshun Nan,Junyi Zhu,Jiayu Huang,Minghao Sun,Xuancheng Zhu,Yisong Chen,Zexian Wei,Xiaofeng Tao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal large language, integrate strong text, indicating ineffective utilization, strong text reasoning, Multimodal large

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

45. 【2606.17675】Do We Really Need Diffusion? A Fast U-Net for Paired Medical Image Translation

链接：https://arxiv.org/abs/2606.17675

作者：Alicia Pirwass,Birte Glimm,Michael Munz,Hans-Joachim Wilke

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Magnetic resonance imaging-signal, imaging-signal fat fraction, quantifies tissue fat, resonance imaging-signal fat, Magnetic resonance

备注：

点击查看摘要

Abstract:Magnetic resonance imaging-signal fat fraction (MRI-SFF) quantifies tissue fat and serves as an established biomarker for metabolic and musculoskeletal disorders. The acquisition requires, however, specialized MRI sequences, which are not available routinely. We investigate whether SFF can be estimated from widely available T2-weighted (T2w) MRI via image-to-image translation (I2I). We further compare a lightweight 4-level U-Net to a state-of-the-art Denoising Diffusion Probabilistic Model (DDPM) using a dataset of 230 048 paired 2D images (183 517 train, 23 621 val, 22 910 test) from the German National Cohort (NAKO). Both models clearly outperform the identity baseline (Pearson correlation r = 0.769, mean absolute error MAE = 0.070 +/- 0.054), which confirms that the models learn a non-trivial cross-modal mapping. Interestingly, the lightweight U-Net outperforms the DDPM in both correlation (r = 0.975 vs. 0.962) and error (MAE = 0.014 +/- 0.015 vs. 0.019 +/- 0.019), while reducing inference time by a factor of 208 (25.2 ms vs. 5 227.2 ms per image using 50 Denoising Diffusion Implicit Model (DDIM) steps). The strong clinical performance at substantially reduced computational cost enables real-time clinical use.

46. 【2606.17650】MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

链接：https://arxiv.org/abs/2606.17650

作者：Hao-Yuan Ma,Li Zhang,Minjie Qiang,Jie Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Text-guided Open-vocabulary Object, Open-vocabulary Object Counting, Text-guided Open-vocabulary, large scale variations, Object Counting

备注：

点击查看摘要

47. 【2606.17644】Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

链接：https://arxiv.org/abs/2606.17644

作者：Nick Jochum,Tobias Alt-Veit,Christian Schön,Alexander Lück,René Schuster,Didier Stricker

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：undergo continuous refinement, scenarios typically grow, processing scenarios typically, annotations undergo continuous, Label Propagation

备注： 17 pages, 3 figures, to appear in proceedings of ICDAR 2026, Vienna, Austria

点击查看摘要

Abstract:Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

48. 【2606.17639】ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

链接：https://arxiv.org/abs/2606.17639

作者：Hong Yang,Basura Fernando

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：situated visual observations, Generalist embodied agents, Generalist embodied, environmental constraints, object recognition

备注： under review at NeurIPS

点击查看摘要

Abstract:Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available this https URL and the project page at this https URL.

49. 【2606.17627】Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

链接：https://arxiv.org/abs/2606.17627

作者：Alessandro Sottovia,Alessandro Torcinovich,Oswald Lanz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：small visual cues, Fine-grained action recognition, visual cues, Fine-grained action, VLM orchestrator chunks

备注：

点击查看摘要

Abstract:Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

50. 【2606.17619】RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

链接：https://arxiv.org/abs/2606.17619

作者：Qiwei Yan,Zhiqiang Yuan,Chongyang Li,Jiapei Zhang,Ying Deng,Jinchao Zhang,Jie Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains poorly understood, made rapid progress, subjects remains poorly, Reference-driven image generation, reliable viewpoint control

备注：

点击查看摘要

Abstract:Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

51. 【2606.17615】SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

链接：https://arxiv.org/abs/2606.17615

作者：Edoardo Bianchi,Antonio Liotta

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Estimating human proficiency, Estimating human, automated skill assessment, music pedagogy, surgical training

备注：

点击查看摘要

Abstract:Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

52. 【2606.17606】Flux-Guard: Facial Identity Protection using diffusion models

链接：https://arxiv.org/abs/2606.17606

作者：Jie Wang,Tao Wang,Ru Zhang,Jianyi Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：systems exposes personal, exposes personal images, personal images shared, face editing, widespread deployment

备注：

点击查看摘要

Abstract:The widespread deployment of face recognition (FR) systems exposes personal images shared on social media and public platforms to identity linkage and privacy risks. Existing adversarial privacy protection methods can degrade unauthorized FR performance but are not compatible with generative face editing. Artificial intelligence-driven face editing tools are gaining popularity, which has significantly increased user demand for personalized portrait generation and social sharing. However, current editing methods often preserve identity features, making the edited images still susceptible to tracking by malicious FR systems. Thus, this paper proposes Flux-Guard, a privacy-preserving face editing framework based on adversarial attacks, which integrates face editing and privacy protection within a unified generative process. Specifically, we design a flow trajectory control method to align semantic manipulations with the generative process and introduce latent-space adversarial optimization with an adaptive perceptual-loss-driven weighting strategy, dynamically adjusting adversarial strength to maximize attack effectiveness while preserving visual quality. Extensive experiments demonstrate that Flux-Guard supports face editing while significantly improving attack success rates against cross-domain face recognition models on the CelebA-HQ and LADN datasets. Furthermore, evaluation results for commercial APIs have confirmed its effectiveness in real-world applications. The code is released at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.17606 [cs.CV]

(or
arXiv:2606.17606v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.17606

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

53. 【2606.17601】st-Time Training for Robust Text-Guided Open-Vocabulary Object Counting

链接：https://arxiv.org/abs/2606.17601

作者：Hao-Yuan Ma,Yuda Zou,Li Zhang,Yongchao Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Text-guided Open-vocabulary Object, Open-vocabulary Object Counting, arbitrary object categories, Open-vocabulary Object, offering substantially greater

备注：

点击查看摘要

Abstract:Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.

54. 【2606.17598】MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

链接：https://arxiv.org/abs/2606.17598

作者：Xingyuming Liu,Ruichun Ma,Heyu Guo,Qixiu Li,Qingwen Yang,Lin Luo,Shiqi Jiang,Chenren Xu,Jiaolong Yang,Baining Guo

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Humans naturally leverage, Humans naturally, robotics rely solely, naturally leverage diverse, RGB observations

备注：

点击查看摘要

Abstract:Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.

55. 【2606.17590】vTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

链接：https://arxiv.org/abs/2606.17590

作者：Weiliang Chen,Yuanhui Huang,Xuebo Wang,Yueqi Duan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scalable video generation, tokens directly determines, TIV tokens, fundamental to scalable, directly determines

备注：

点击查看摘要

Abstract:Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

56. 【2606.17584】Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

链接：https://arxiv.org/abs/2606.17584

作者：Semin Kim,Jihwan Yoon,Seunghoon Hong

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Finding the initial, training-free image editing, data sample, initial noise, noise that generates

备注：

点击查看摘要

Abstract:Finding the initial noise that generates a given data sample, known as inversion, is a key component for downstream applications such as training-free image editing. Existing fixed-point inversion methods improve inversion accuracy by formulating each inversion step as a fixed-point problem, but they lack a principled mechanism for selecting among multiple fixed-point solutions that can arise in practice. We observe that different selections induce different inversion trajectories, leading to substantial variation in reconstruction and editing quality. For rectified flows, we further find that this variation is closely associated with trajectory straightness, motivating straightness as a principled selection criterion. We propose SelFix, a fixed-point inversion method that selects fixed-point solutions inducing straighter inverse trajectories while retaining convergence to an exact inverse root under standard local assumptions. Experiments on FLUX.1-dev and PIE-Bench show that SelFix improves fixed-point inversion, achieving stronger real-image reconstruction and better source-preserving prompt-based editing than prior inversion baselines. The code is available at this https URL.

57. 【2606.17564】Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery

链接：https://arxiv.org/abs/2606.17564

作者：Qiyan Luo,Jie Yang,Yingdong Pi,Lekang Wen,Mi Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：complex imaging geometries, Standardized evaluation protocols, Standardized evaluation, remote sensing, imaging geometries

备注： The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2026)

点击查看摘要

Abstract:Standardized evaluation protocols are indispensable for robust benchmarking in remote sensing, particularly as foundation features are increasingly transferred across diverse sensors and complex imaging geometries. In satellite multi-view reconstruction, conventional evaluations relying on unconstrained 2D global matching are often misleading. The Rational Function Model (RFM) and its Rational Polynomial Coefficients (RPC) dictate a curved, height-dependent epipolar geometry that render flat 2D search spaces physically inconsistent. We propose a geometry-faithful and reproducible protocol tailored for the RPC framework. Our approach integrates an RPC-projected 3D consistency metric with a geometry-constrained dense matching proxy, specifically evaluating whether similarity responses remain localized and unique under physically plausible search manifolds. A pivotal finding of our joint reporting strategy is the decoupling of semantic agreement and geometric localization: high cross-view similarity at a projected 3D point does not guarantee reliable matchability in practical inference. Our benchmark demonstrates that incorporating geometric constraints is fundamental to the problem definition in satellite imagery. Furthermore, we show that state-of-the-art 2D backbones remain remarkably competitive against specialized 3D-aware models when subjected to this RPC-consistent evaluation.

58. 【2606.17561】RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

链接：https://arxiv.org/abs/2606.17561

作者：Hao-Yuan Ma,Li Zhang,Zhiwei Zhu,Jie Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Text-guided open-vocabulary object, count objects belonging, natural language descriptions, Text-guided open-vocabulary, open-vocabulary object counting

备注：

点击查看摘要

Abstract:Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model's counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: this https URL.

59. 【2606.17557】Universal Image Restoration via Internalized Chain-of-Thought Reasoning

链接：https://arxiv.org/abs/2606.17557

作者：Yu Guo,Zhengru Fang,Shengfeng He,Senkang Hu,Yihang Tao,Phone Lin,Yuguang Fang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recover high-quality images, ill-posed under complex, seeks to recover, recover high-quality, degraded inputs

备注：

点击查看摘要

Abstract:Image restoration seeks to recover high-quality images from degraded inputs but becomes highly ill-posed under complex, mixed degradations. While unified all-in-one models are common, their performance declines as degradation complexity increases. Recent works adopt Chain-of-Thought (CoT) reasoning for multi-round restoration using specialized modules. However, this approach faces two key limitations: (i) increased computational cost due to multi-step processing, and (ii) weak modeling of interactions between degradations during stepwise inference. We introduce CoTIR, a universal image restoration framework that internalizes CoT reasoning within a single model. Concretely, we view image restoration as a specialized subtask of image editing, which implies that a large-scale pre-trained editing model provides a more favorable optimization starting point. Building on this, we fine-tune the model for restoration and further encode structured CoT-style reasoning into the learning objective via a differentiable formulation inspired by Lagrangian optimization, enabling holistic restoration without chaining specialized restorers. To facilitate training and evaluation, we further present CoTIR-Bench, a large-scale benchmark comprising 5.2 million samples with CoT-style reasoning traces. Extensive experiments on CoTIR-Bench and broad real composite degradation scenes show that CoTIR achieves stronger perceptual quality and more competitive fidelity than both all-in-one models and multi-round restoration methods. The source code is available at this https URL.

60. 【2606.17540】aFD: Threat-Aware Frequency Decoupling for Adversarial Robustness against Heterogeneous Attacks

链接：https://arxiv.org/abs/2606.17540

作者：Mengda Xie,Yiling He,Meie Fang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-threat robustness remains, deep learning, remains a fundamental, fundamental challenge, challenge in deep

备注：

点击查看摘要

Abstract:Multi-threat robustness remains a fundamental challenge in deep learning. Although joint adversarial training (JAT) is widely adopted, it suffers from negative transfer under heterogeneous threats, particularly between $\ell_p$-bounded and semantic attacks. Through first-order gradient analysis, we formalize this as gradient incompatibility and theoretically establish the necessity of decoupled optimization. We further reveal that these conflicting threats exhibit separable spectral characteristics in the frequency domain. Motivated by this observation, we propose Threat-aware Frequency Decoupling (TaFD), a two-stage defense framework that reformulates JAT as a frequency-domain divide-and-conquer paradigm. TaFD first discovers latent threat domains via unsupervised clustering of attack spectral prototypes and trains a lightweight classifier for inference-time threat domain identification. Conditioned on the prediction, TaFD employs a Frequency-Conditional Convolution that learns threat-domain-specific spectral masks and routes each sample to the corresponding expert, enforcing structural parameter separation and alleviating optimization conflicts. We validate TaFD on three representative image-classification benchmarks (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and on two representative architectures (the convolutional ResNet and the hybrid-transformer MobileViT). Extensive results demonstrate that TaFD achieves more balanced robustness against heterogeneous attacks than existing JAT and frequency-domain baselines, improving average robust accuracy by approximately 11\% over the strongest baseline while maintaining leading clean accuracy.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.17540 [cs.CV]

(or
arXiv:2606.17540v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.17540

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

61. 【2606.17539】Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

链接：https://arxiv.org/abs/2606.17539

作者：Yatai Ji,An-Chieh Cheng,Yang Fu,Yukang Chen,Han Zhang,Zhaojing Yang,Wei Huang,Ka Chun Cheung,Song Han,Vidya Nariyambut Murali,Pavlo Molchanov,Jan Kautz,Simon See,Hongxu Yin,Ping Luo,Sifei Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：relations remains challenging, made substantial progress, scene relations remains, reasoning requiring multi-step, requiring multi-step inference

备注：

点击查看摘要

Abstract:Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

62. 【2606.17536】OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

链接：https://arxiv.org/abs/2606.17536

作者：Zijie Meng,Yufei Liu,Chengqian Ma,Zhiyu Li,Jiyuan Liu,Wenhua Nie,Bingcai Wei,Shuqin Chen,Weichen Xu,Jiquan Yuan,Miao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：heterogeneous control injection, incompatible representational spaces, autonomous driving face, camera poses reside, Generative world models

备注： 24 pages, 10 figures

点击查看摘要

Abstract:Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

63. 【2606.17520】GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

链接：https://arxiv.org/abs/2606.17520

作者：Jiawei Zhang,Yiming Yan,Chao Liang,Nuo Xu,Seson Sun,Qichen Zhang,Yuhao Xu,Yantai Yang,Yingqiao Wang,Qin Jin,Zhipeng Zhang

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：real world requires, world requires skilled, requires skilled operators, Training embodied agents, expensive hardware

备注：

点击查看摘要

Abstract:Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

64. 【2606.17511】MagicSim: A Unified Infrastructure for Executable Embodied Interaction

链接：https://arxiv.org/abs/2606.17511

作者：Haoran Lu,Songling Liu,Yue Chen,Guo Ye,Mutian Shen,Shuyang Yu,Yu Xiao,Jihai Zhao,Shang Wu,Jianshu Zhang,Xiangtian Gui,Chuye Hong,Yuran Wang,Maojiang Su,Jiayi Wang,Ruihai Wu,Zhaoran Wang,Han Liu

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：substrate linking control, execution substrate linking, fixed task environment, linking control, require simulation

备注：

点击查看摘要

Abstract:Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command-Skill-Planner-Robot-Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

65. 【2606.17482】SPHINX: First Explain, Then Explore

链接：https://arxiv.org/abs/2606.17482

作者：Nguyen Do,Tue M. Cao,Tien Van Do,András Hajdu,Tamás Bérczes,My T. Thai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating adversarial driving, vehicle decision-making systems, Large Language Models, Generating adversarial, systems in simulation

备注： 13 pages

点击查看摘要

Abstract:Generating adversarial driving scenarios is critical for evaluating and improving autonomous vehicle decision-making systems in simulation. Recent approaches, such as ChatScene and LLM-Attacker, rely primarily on the prior knowledge of Large Language Models and Vision-Language Models to generate driving scenarios procedurally. We argue that adversarial scenes should be generated based on the failure diagnosis (e.g., indecisiveness, multi-frame inconsistency) of the driving policy to specifically address the policy's weaknesses instead of relying on prior assumptions. In this paper, we propose SPHINX, a closed-loop framework for adversarial scenario synthesis guided by a simple principle: first explain, then explore. Beyond blindly exploring the scenario space, SPHINX leverages explainable artificial intelligence methods to analyze the policy, identifying key visual concepts and their influence on policy outputs, and the uncertainty of the decisions. Given the interpretable evidence extracted from the policy's own decision process, we use a vision language model to rationalize and criticize failure modes of the current policy. These critics are then used to generate targeted adversarial scenarios for policy retraining and improvement. We demonstrate that SPHINX can highlight an interpretable account of policy failures while other adversarial scene generation cannot. Across the evaluated benchmarks and test suites, SPHINX can be applied to diverse state-of-the-art autonomous vehicle architectures and yields consistent robustness improvements over existing scenario-generation methods.

66. 【2606.17480】GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

链接：https://arxiv.org/abs/2606.17480

作者：Haoyu Wang,Guoqing Ma,Zeyu Zhang,Yandong Guo,Boxin Shi,Hao Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：reliable robot trajectories, plan reliable robot, reusable manipulation experience, evidence and reusable, robot trajectories

备注：

点击查看摘要

Abstract:Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: this https URL. Website: this https URL.

67. 【2606.17477】heoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer

链接：https://arxiv.org/abs/2606.17477

作者：Salimeh Sekeh,Xin Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：evolving data distributions, rejecting semantic-shifted OOD, open-world environments requires, dynamic open-world environments, dynamic OOD detection

备注：

点击查看摘要

Abstract:Out-of-distribution (OOD) detection in dynamic open-world environments requires a model to continually adapt to evolving data distributions while generalizing to covariate-shifted inputs and rejecting semantic-shifted OOD examples. Most existing OOD detection methods optimize only the current-step objective and do not explicitly account for how post-deployment environment changes affect future OOD behavior. In this paper, we establish a theoretical grounding for dynamic OOD detection using a reinforcement learning (RL)-guided optimizer that explicitly favors updates that reduce the semantic OOD false positive rate over time. We develop a novel augmented optimizer that uses an RL-guided correction term on top of standard gradient descent (GD) and show its improvement over both future-domain generalization and semantic-OOD rejection. We analyze temporal error decomposition in terms of model-change and environment-change generalization errors and develop a new theoretical framework for comparing the generalization errors under both GD and RL-guided optimizers.

68. 【2606.17475】StereoFactory: A Unified Merging Framework for Robust Stereo Matching

链接：https://arxiv.org/abs/2606.17475

作者：Xianda Guo,Pinhan Fu,Ruilin Wang,Wenke Huang,Mang Ye,Qin Zou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：costly joint retraining, Stereo matching, foundation models trained, data requires costly, requires costly joint

备注：

点击查看摘要

Abstract:Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7--3.7\% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: this https URL.

69. 【2606.17463】WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation

链接：https://arxiv.org/abs/2606.17463

作者：Shoujing Zhu,Zhenyang Liu,Fungmiu Wang,Jiafeng Wang,Bo Yue,Guiliang Liu,Simo Wu,Xiangyang Xue,Taiping Zeng

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：remarkable single-step manipulation, achieved remarkable single-step, remain brittle precisely, single-step manipulation, achieved remarkable

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy's short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $\pi_{0.5}$ backbone, WeaveLA's gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0\%$ to $47.8\%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.

70. 【2606.17449】MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

链接：https://arxiv.org/abs/2606.17449

作者：Zehang Wei,Jiaxin Dai,Jiamin Yan,Xiang Xiang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：Large Vision-Language Models, enhances Large Vision-Language, Multimodal Retrieval-Augmented Generation, remains highly susceptible, enhances Large

备注： To be presented at ACL 2026

点击查看摘要

71. 【2606.17446】AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

链接：https://arxiv.org/abs/2606.17446

作者：Haoran Lu,Mutian Shen,Shuyang Yu,Yu Xiao,Songling Liu,Jianshu Zhang,Shang Wu,Yue Chen,Guo Ye,Jiayi Wang,Zhaoran Wang,Han Liu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Simulation enables scalable, physical knowledge needed, robot data collection, enables scalable robot, scalable robot data

备注：

点击查看摘要

Abstract:Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: this https URL.

72. 【2606.17438】Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects

链接：https://arxiv.org/abs/2606.17438

作者：Ingu Yeo,Hyung-Gun Chi,Jae-Sang Hyun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vision-based tactile sensing, tactile sensing family, sensing family pioneered, Digital Fringe Projection, commercially successful GelSight

备注：

点击查看摘要

Abstract:This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.

73. 【2606.17437】Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

链接：https://arxiv.org/abs/2606.17437

作者：Bo Gou,Jicheng Zhang,Jianlong Xiong,Tao He,Bentian Liu,Hai Wu,Yijiao Wang,Yu Zhang,Yujia Yang,Yun Dai,Jian Liu,Jie Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Automated classification, efficient clinical workflow, Recurrent Neural Networks, clinical workflow, workflow but faces

备注：

点击查看摘要

Abstract:Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at this https URL.

74. 【2606.17436】UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning

链接：https://arxiv.org/abs/2606.17436

作者：Xiongjun Guan,Jianjiang Feng,Jie Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：task-specific pipelines, optimized in isolation, dominated by task-specific, structural parsing, Fingerprint recognition

备注：

点击查看摘要

Abstract:Fingerprint recognition is still dominated by task-specific pipelines, where enhancement, structural parsing, alignment, and matching are optimized in isolation. Although effective in narrow settings, this design limits representation reuse across sensors, qualities, and downstream applications. We therefore present UoU, short for ``a \textbf{U}niversal fingerprint foundation model based \textbf{o}n large-scale \textbf{U}nsupervised learning,'' which reframes fingerprint feature extraction as a domain-specific foundation-model problem. UoU is organized around a multi-level representation hierarchy spanning image restoration, structural fields, semantic tokens, point-level biometric entities, and compact global descriptors. Its training recipe combines a supervised cold start on precise annotations, large-scale weakly supervised refinement, and large-scale unsupervised consolidation, with the latter two stages iterated during large-scale training so that weak supervision broadens semantic coverage while unsupervised learning stabilizes correspondences, invariances, and representation geometry. Rather than treating fingerprint imagery as generic texture, UoU exploits domain-specific symmetries and intermediate structure, including orientation flow, periodic ridge patterns, sparse biometric entities, and spatial equivariance. The framework is intentionally architecture-agnostic: while the present study includes an initial transformer-based structured-prediction instantiation, the broader design supports multi-task learning, scalable model configurations, and downstream specialization for matching, alignment, enhancement, registration, and related fingerprint applications. This paper presents the technical motivation, system design, and validation protocol of UoU, and part of the baseline implementation is publicly available at this https URL.

75. 【2606.17433】LADBench: A Benchmark for Logical Fault Detection in Images

链接：https://arxiv.org/abs/2606.17433

作者：Sahasra Kondapalli,Lara Radovanovic,Aadi Palnitkar,Mingyang Mao,Xiaomin Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision Language, Vision Language Models, Large Vision, Vision Language, visual question answering

备注： Accepted to the IEEE International Conference on Development and Learning (ICDL 2026)

点击查看摘要

Abstract:Large Vision Language Models (VLMs) excel at visual question answering and semantic grounding, but their capacity for autonomous logical reasoning remains underexplored. Existing anomaly benchmarks emphasize visual errors or direct prompting rather than the physical and social common sense needed for open-world deployment. To address this, we introduce LAD-bench, a benchmark of more than 1,000 curated synthetic images with logical anomalies across four domains: Residential, Urban, Collaborative, and Nature. We further propose a Tiered Prompting Protocol based on progressive disclosure, which measures how much explicit assistance a model needs to localize and reason about a logical fault. Evaluating leading foundation models reveals substantial weaknesses: even the best achieves only 70.11% overall accuracy, showing that implicit logical fault detection remains unsolved. Crucially, models often fail to identify anomalies even after receiving explicit hints in deeper tiers. By surfacing these limitations in sequential multimodal reasoning, LAD-Bench offers a rigorous framework for advancing the safety, reliability, and cognitive alignment of autonomous visual systems. Dataset and Code: this https URL

76. 【2606.17432】Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting

链接：https://arxiv.org/abs/2606.17432

作者：Duy-Dat Tran,Trung-Nghia Le

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian splatting, instruction-guided diffusion, Gaussian, unified framework, Abstract

备注： SOICT 2025

点击查看摘要

Abstract:We present Edit3DGS, a unified framework for dynamic 3D head editing that integrates 2D instruction-guided diffusion with 3D Gaussian splatting. Unlike prior approaches that separately address frame-based edits or static 3D reconstruction, our method couples semantic controllability in the image domain with photorealistic, temporally consistent 3D representations. Given an input video, editable facial regions are masked and modified using a text-conditioned diffusion model to support fine-grained operations such as expression transformation, attribute modification, and appearance refinement. The edited frames are then aggregated through 3D Gaussian splatting to produce a coherent, high-fidelity avatar that preserves both identity and motion dynamics. To enforce consistency, Edit3DGS incorporates multi-view batch editing and lightweight inpainting strategies that recover lost expressions across timesteps. Experimental results demonstrate that our framework enables controllable, artifact-free head editing with smooth temporal transitions, offering practical applications in virtual avatars, immersive communication, film production, and interactive media.

77. 【2606.17431】Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art

链接：https://arxiv.org/abs/2606.17431

作者：Quoc-Duy Tran,Anh-Tuan Vo,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：interpreting ambiguous shapes, artistic images, interpreting ambiguous, advanced the ability, ability to render

备注： SOICT 2025

点击查看摘要

Abstract:Generative AI has advanced the ability to render photorealistic or artistic images, yet it remains limited in a key aspect of human creativity: interpreting ambiguous shapes. This phenomenon, rooted in pareidolia, allows humans to perceive meaningful forms in random patterns such as clouds, stones, or leaves. To computationally replicate this imaginative process, we introduce Visual Retrieval-Augmented Generation (Visual-RAG), a framework that generates animal art directly from natural silhouettes. Our method retrieves structurally similar animal shapes from a curated corpus of 28,586 high-quality silhouettes and uses them as reference exemplars to guide diffusion-based generation with ControlNet and IP-Adapter. Ablation studies confirm that shape Context with RANSAC provides the most accurate alignment, while removing shape standardization reduces the inlier ratio to just 13.4\%, underscoring the importance of structural fidelity in Visual-RAG. A user study with 12 participants evaluated the outputs in terms of aesthetics, silhouette fidelity, and overall impression. Results reveal that while Visual-RAG provides plausible interpretations, challenges remain in achieving high perceptual impact. This work lays the foundation for computational pareidolia, showing how machines can contribute to the early stages of imaginative discovery.

78. 【2606.17430】CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2606.17430

作者：Trinh Thi Thu Hien,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Event-enriched image captioning, image captioning describes, Event-enriched image, including timing, Contextual Image-Article Narrator

备注： SOICT 2025

点击查看摘要

Abstract:Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framework that enriches captions with external narratives. CIAN retrieves relevant articles using SigLIP, summarizes them to guide a Narrative Generation stage with a LoRA-fine-tuned Qwen model, and applies N-Gram-based Refinement for fluency and coherence. On the OpenEvents-V1 benchmark, CIAN achieves high retrieval performance (mAP 0.979) and improves caption quality, increasing CIDEr from 0.030 to 0.094. These results highlight the effectiveness of retrieval-augmented reasoning combined with linguistic refinement for generating context-aware, human-like captions.

79. 【2606.17427】Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications

链接：https://arxiv.org/abs/2606.17427

作者：Damian M. Manzone,Mathew Szymanowski,Olga Taran,Shuo Cai,Melissa Marquez-Chin,Tammy Zeng,Hardeep Singh,Cesar Marquez-Chin,José Zariffa

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：Mixed reality applications, pose estimation, Mixed reality, pose estimation accuracy, estimation

备注：

点击查看摘要

Abstract:Mixed reality applications can be designed for hand rehabilitation. Augmented reality (AR) head mounted displays (HMDs) specifically allow for ecologically valid tasks because individuals can see their real environment and interact with real objects while receiving additional cues on the HMD. While these applications rely on accurate hand pose estimation, there is a gap in investigating the influence of hand impairment or occlusion from real-object interactions on pose estimation accuracy. Further, comparisons between AR HMD predictions and state-of-the-art pose estimation methods have not been established. The current study assessed pose estimation accuracy of the HoloLens 2 HMD and state-of-the-art pose estimation algorithms (WiLoR, HaMeR, WildHands, and MediaPipe) while individuals with cervical spinal cord injury (cSCI; n = 13, Neurological Level of Injury: C3-C6; American Spinal Injury Association Impairment Scale: A-D) and 15 uninjured controls interacted with clear and opaque objects. Ground truth estimates of 3D joint positions were generated via triangulation from a multi-camera setup. Pose estimation accuracy did not differ between the cSCI and uninjured control groups suggesting that 3D joint predictions from the HoloLens 2 and pose estimation algorithms can generalize to populations with hand impairment. Further, clear objects provided a small accuracy advantage over opaque objects (0.1 mm) and predictions from both WiLoR and HaMeR were slightly more accurate than the HoloLens 2 (2 mm). Overall, these results suggest that the HoloLens 2 may be viable for hand rehabilitation applications and the dataset generated can be used to refine pose estimation methods for hand-impaired populations.

80. 【2606.17412】Enhancing Pathological VLMs with Cross-scale Reasoning

链接：https://arxiv.org/abs/2606.17412

作者：Chi Phan,Tianyi Zhang,Qiaochu Xue,Yufeng Wu,Dan Hu,Zeyu Liu,Sudong Wang,Yueming Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：global tissue architecture, inherently multi-scale, requiring pathologists, accurate diagnosis, pathologists to integrate

备注：

点击查看摘要

Abstract:Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

81. 【2606.17410】Attention Alignment Between Humans and Vision-Language Models

链接：https://arxiv.org/abs/2606.17410

作者：Isaac R. Christian,Udith Haputhanthrige,Hanna Hornfeld,Declan Campbell,Samuel Nastase,Taylor Webb,Michael Graziano

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual perception depends, bottom-up sensory mechanisms, sensory mechanisms, perception depends, Transformer decoders

备注：

点击查看摘要

Abstract:Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40--50 percentage points (80--87\% vs.\ 40--59\% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5--20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85--87\%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.

82. 【2606.17408】Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

链接：https://arxiv.org/abs/2606.17408

作者：Meipo Dai,Qiyuan Zhuang,He-Yang Xu,Ying-Jie Shuai,Yijun Wang,Qi Dou,Xiu-Shen Wei

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：observation-independent standard Gaussian, typically begin action, standard Gaussian distribution, action generation begin, begin action generation

备注：

点击查看摘要

Abstract:Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines -- including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy -- by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.

83. 【2606.17406】Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

链接：https://arxiv.org/abs/2606.17406

作者：Marina Chagas Bulach Gapski,Vinicius Atsushi Sato Kawai,Gustavo Rosseto Leticio,Lucas Pascotti Valem,Daniel Carlos Guimarães Pedronette,Mohand Said Allili

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Feature extraction involves, Convolutional Neural Networks, Neural Networks, Graph Neural Networks, Graph Convolutional Networks

备注：

点击查看摘要

Abstract:Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.

84. 【2606.17403】Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

链接：https://arxiv.org/abs/2606.17403

作者：Shikha V. Chandel,Yadav Raj Ghimire,Timothy Agboada,Leila Hashemi-Beni

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：effective disaster response, Rapid assessment, response and recovery, essential for effective, effective disaster

点击查看摘要

Abstract:Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.

85. 【2606.17389】Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

链接：https://arxiv.org/abs/2606.17389

作者：Logan Mann,Yi Xia,Ajit Saravanan,Ishan Dave,Saadullah Ismail,Shikhar Shiromani,Emily Huang,Ruizhe Li,Kevin Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Multimodal Foundation Models, Multimodal Foundation, Foundation Models, Multimodal, reliability

备注： 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: [this https URL](https://github.com/itsloganmann/VLM-Reliability-Probe)

点击查看摘要

86. 【2606.17386】rraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

链接：https://arxiv.org/abs/2606.17386

作者：Zikang Xiong,Weixin Li,Zhouchonghao Wu,Akshay Rangesh,Saarth Bonde,Grantland Hall,Chen Tang,Yihan Hu,Wei Zhan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：performance on benchmarks, real-world deployments, benchmarks and real-world, autonomous driving, vision backbone

备注：

点击查看摘要

Abstract:End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

87. 【2606.17384】Improving and Evaluating Hand-Object Interaction Detection

链接：https://arxiv.org/abs/2606.17384

作者：Ahmad Darkhalil,Dima Damen,David Fouhey

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reconstruction and robotics, Understanding hands, objects they interact, key step, step for tasks

备注： Project page: [this https URL](https://ahmaddarkhalil.github.io/HOI-DETR/)

点击查看摘要

Abstract:Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. Our paper provides several contributions to the Hand-Object Interaction (HOI) understanding literature: (1) HOI-DETR, a new framework that introduces hand-object and object-object interactions to the Co-DETR architecture to produce a state-of-the-art method; (2) a comprehensive HOI evaluation suite of 4 diverse datasets, including a video benchmark derived from the HD-EPIC dataset and fresh annotations that improve the Hands23 benchmark and (3) a trained checkpoint that significantly improves the state of the art across Hands23, HOIST, FineBio, and HD-EPIC, including mAP gains of over 20 percentage points on Hands23 and FineBio. Our ablations confirm the contributions of each model component.

88. 【2606.17379】MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

链接：https://arxiv.org/abs/2606.17379

作者：Casey Meisenzahl,Jon Heiselman,Michael Holtz,Yubo Ye,Michael Miga,Linwei Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：Accurate intraoperative liver, substantial soft-tissue deformation, Accurate intraoperative, substantial soft-tissue, sparse intraoperative measurements

备注：

点击查看摘要

Abstract:Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

89. 【2606.17376】Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

链接：https://arxiv.org/abs/2606.17376

作者：Milind Rampure,Shadman Sakib,Haley Patel,Zahid Hasan,Nirmalya Roy

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：minimizing physical contact, reduce responder risk, improve operational safety, disaster recovery, emergency response

备注： 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026

点击查看摘要

Abstract:Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.

90. 【2606.17362】DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

链接：https://arxiv.org/abs/2606.17362

作者：Xinglong Sun,Kevin Xie,Jenny Schmalfuss,Despoina Paschalidou,Xiuming Zhang,Sanja Fidler,Kashyap Chitta,Jose M. Alvarez

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Autonomous driving, policy learning, interpretable policy evaluation, driving, highly context-dependent

备注： Under Review

点击查看摘要

Abstract:Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

91. 【2606.17355】Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations

链接：https://arxiv.org/abs/2606.17355

作者：Sharva Gogawale,Iddo Hakim,Gal Grudka,Mohammad Suliman,Omer Ventura,Daria Vasyutinsky-Shapira,Berat Kurar-Barakat,Nachum Dershowitz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digitized corpora suffer, poor resolution, automatic transcription, digitized corpora, corpora suffer

备注：

点击查看摘要

Abstract:Many digitized corpora suffer from low resources because annotations may be scarce, page scans are noisy and of poor resolution, or layouts are structurally complex in ways that negatively affect the quality of automatic transcription. Developing robust classification models for low-resource languages is inhibited by the lack of large-scale annotated data and by the frequent semantic complexity of page layouts. To this end, we have curated a complex-layout dataset, manually classified into eight distinct layout types based on their separator regions. To overcome data scarcity, we propose a novel training strategy in the form of a CNN-based classifier that employs strong, domain-aware augmentations to improve generalization. We utilize narrow anisotropic Gaussian masking to suppress incidental textual details while preserving essential separations, compelling the model to learn global geometric arrangements. Additionally, we implement reflection-induced label transformations to enrich the training distribution while maintaining label consistency across asymmetric categories. The results demonstrate that layout-specific augmentations can substantially improve page-level layout classification under severe annotation scarcity.

92. 【2606.17352】MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

链接：https://arxiv.org/abs/2606.17352

作者：Rahim Hossain,Md Tawheedul Islam Bhuian,Md Farhan Shadiq,Kyoung-Don Kang

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Multilayer Mahalanobis, strictly post-hoc, fully unsupervised, Multilayer, Mahalanobis

备注：

点击查看摘要

Abstract:We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.

93. 【2606.17343】Bayesian Magnetic Resonance Joint Image Reconstruction and Uncertainty Quantification using Sparsity Prior Models and Markov Chain Monte Carlo Sampling

链接：https://arxiv.org/abs/2606.17343

作者：Ahmed Karam Eldaly,Matteo Figini,Daniel C. Alexander

类目：Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)

关键词：compressed sensing magnetic, sensing magnetic resonance, resonance image reconstruction, magnetic resonance image, quantification using compressed

备注：

点击查看摘要

Abstract:We propose a novel framework for uncertainty quantification using compressed sensing magnetic resonance image reconstruction. The problem is formulated within a Bayesian framework as a linear inverse problem, with prior distributions assigned to the unknown model parameters. Specifically, the image to be reconstructed is assumed to be sparse in a given basis. We develop a general framework applicable to any basis and as examples, we test the sparsity of the image in its (1) spatial gradients using a total variation prior model, and in its (2) wavelet transform. A Markov chain Monte Carlo (MCMC) method, based on a split-and-augmented Gibbs sampler, is then employed to sample from the posterior distribution of the unknown parameters. The non-differentiable conditional distributions are efficiently sampled using a proximal MCMC method. The proposed algorithms are validated on both single-coil and multi-coil datasets using various k-space sub-sampling patterns and ratios. The results demonstrate the superior performance of each proposed approach in reconstructing images compared to its counterpart optimisation-based method. Moreover, our framework effectively quantifies uncertainty, showing a notable correlation between estimated uncertainty maps and error maps computed using ground truth and reconstructed images, compared with existing deep learning-based methods.

94. 【2606.17342】Learning a Maximum Entropy Model for Visual Textures using Diffusion

链接：https://arxiv.org/abs/2606.17342

作者：Xinyuan Zhao,Eero P. Simoncelli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：provide important cues, visual scenes, repeated elements, field of grass, Visual textures

备注：

点击查看摘要

Abstract:Visual textures -- spatially homogeneous image regions containing repeated elements (e.g. a field of grass, the bark of a tree) -- are ubiquitous in visual scenes and provide important cues for recognizing and analyzing materials and objects. A number of existing texture models extract essential statistics from a single texture image, and can then generate high-quality samples that are visually similar to the original by matching these statistics. However, their statistics are either hand-designed or based on a network pretrained for another purpose (e.g., object recognition). Here, we develop the first principled method for unsupervised learning of a set of statistics that are used to constrain a maximum entropy probability model. We leverage methods developed for generative diffusion models to derive training and sampling procedures, and compare these to the traditional method of sampling via matching the statistics. Despite the compactness of our trained model (512 statistics), it generates texture images whose quality is as good as or better than the current state-of-the-art model (~177k statistics). A more direct comparison of the two models, obtained by synthesizing images that are indistinguishable for one model but maximally different for the other, reveals their relative strengths and weaknesses. Finally, we show that unlike previous statistical texture models, a straight trajectory in the representation space of our model generates homogeneous texture samples that interpolate smoothly between the features of the two end points.

95. 【2606.17340】Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

链接：https://arxiv.org/abs/2606.17340

作者：Hongchao Shu,Roger D. Soberanis-Mukul,Hao Ding,Morgan Ringel,Mali Shen,Saif Iftekar Sayed,Hedyeh Rafii-Tari,Mathias Unberath

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：weak tissue texture, substantial appearance variation, Accurate vision-based navigation, non-rigid deformation, weak tissue

备注：

点击查看摘要

Abstract:Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

96. 【2606.17334】FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

链接：https://arxiv.org/abs/2606.17334

作者：Md Tawheedul Islam Bhuian,Kyoung-Don Kang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offering inherent advantages, asynchronously capture logarithmic, capture logarithmic intensity, offering inherent, cameras are bio-inspired

备注：

点击查看摘要

Abstract:Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

97. 【2606.17321】ProCUA-SFT Technical Report

链接：https://arxiv.org/abs/2606.17321

作者：Jaehun Jung,Ximing Lu,Brandon Cui,Muhammad Khalifa,Shaokun Zhang,Hao Zhang,Jin Xu,Amala Sanjay Deshmukh,Karan Sapra,Andrew Tao,Yejin Choi,Jan Kautz,Mingjie Liu,Yi Dong

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：full desktop environments, mouse actions, requires large-scale, screenshots and keyboard, Training computer-use agents

备注： 15 pages, 5 figures

点击查看摘要

Abstract:Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

98. 【2606.17310】SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

链接：https://arxiv.org/abs/2606.17310

作者：Suttisak Wizadwongsa,Hyelin Nam,Supasorn Suwajanakorn,Jeong Joon Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating novel renderings, user-defined camera trajectories, single monocular video, visual effects, scene along user-defined

备注： 20 pages, 13 figures

点击查看摘要

Abstract:Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: this https URL.

99. 【2606.17298】Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

链接：https://arxiv.org/abs/2606.17298

作者：Yiqing Shen,Hao Ding,Mathias Unberath

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：operating rooms, enabling technology, stakeholders to retrieve, retrieve and inspect, inspect recordings

备注：

点击查看摘要

Abstract:Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

100. 【2606.17296】Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration

链接：https://arxiv.org/abs/2606.17296

作者：Xiwen Wei,Mark Nutter,Madhusudhanan Srinivasan,Radu Marculescu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：single autoregressive transformer, Unified multimodal models, integrating multimodal understanding, Unified multimodal, autoregressive transformer

备注：

点击查看摘要

Abstract:Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.

101. 【2606.17279】raining LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

链接：https://arxiv.org/abs/2606.17279

作者：Yiqing Shen,Han Zhang,Mathias Unberath

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：question answering requires, video question answering, answering requires multi-step, Surgical video question, requires multi-step reasoning

备注：

点击查看摘要

Abstract:Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.

102. 【2606.17257】Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

链接：https://arxiv.org/abs/2606.17257

作者：Rohit Kundu,Arindam Dutta,Sarosij Bose,Athula Balachandran,Amit K. Roy-Chowdhury

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：degrades general capability, Open-weight video diffusion, apply external filters, video diffusion models, generate photorealistic unsafe

备注：

点击查看摘要

Abstract:Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

103. 【2606.17256】Contrastive Action-Image Pre-training for Visuomotor Control

链接：https://arxiv.org/abs/2606.17256

作者：Yuvan Sharma,Dantong Niu,Anirudh Pai,Zekai Wang,Zhuoyang Liu,Baifeng Shi,Stefano Saravalle,Boning Shao,Ruijie Zheng,Jing Wang,Konstantinos Kallidromitis,Yusuke Kato,Fabio Galasso,Yuke Zhu,Danfei Xu,Linxi "Jim" Fan,Jitendra Malik,Trevor Darrell,Roei Herzig

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing vision encoders, Existing vision, robotic datasets lack, fundamental bottleneck, face a fundamental

备注：

点击查看摘要

Abstract:Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

104. 【2606.17246】GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

链接：https://arxiv.org/abs/2606.17246

作者：Maram Hasan,Aman Verma,Savitra Roy,Hariseetharam Gunduboina,Daksh Jain,Muhammad Haris Khan,Subhasis Chaudhuri,Biplab Banerjee

类目：Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词：Remote-sensing vision-language models, advanced Earth-observation analysis, demands tool-grounded spatial, tool-grounded spatial reasoning, Remote-sensing vision-language

备注： 28 pages, 11 Figures

点击查看摘要

Abstract:Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.

105. 【2606.17242】Landsat-Sentinel-2 Algal Bloom Mapping Using Vision Transformers: Model Description, Implementation, and Examples

链接：https://arxiv.org/abs/2606.17242

作者：Thainara Lima,Vitor Martins

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：monitoring requires frequent, spatially detailed, requires frequent, Coastal algal bloom, globally consistent observations

备注：

点击查看摘要

Abstract:Coastal algal bloom monitoring requires frequent, spatially detailed, and globally consistent observations, provided by Landsat-8/9 and Sentinel-2 A/B/C. Together, these missions offer over a decade of medium-resolution multispectral imagery with near-global coverage every 2-3 days, enabling the detection of fragmented bloom structures not resolvable by coarse ocean-color sensors. However, their use in aquatic environments remains challenging due to limited spectral coverage and a lack of harmonized reflectance products. As an alternative to traditional bio-optical methods, deep learning-based image classification offers a data-driven approach that can overcome many of these limitations. This study presents the first successful implementation of vision transformer-based coastal algal bloom mapping using 30-m Landsat-Sentinel-2 images. A globally distributed bloom patch dataset was generated across bloom-prone coastal hotspots worldwide. Four transformer-based architectures were compared against a standard convolutional baseline for fine-scale bloom detection, and assessed under different optical water types and atmospheric and surface conditions. All deep learning models showed strong capabilities in detecting floating bloom areas, with omission and commission errors of 8-65%. Under cloud and glint stress in a time series, the Swin Transformer outperformed traditional spectral-index approaches, which produced widespread false positives, effectively avoiding cloud- and glint-affected pixels. Comparisons with MODIS-derived products further highlighted the benefits of higher spatial resolution in detecting fragmented and irregularly affected blooms. Our findings support deep learning as a reliable tool for medium-resolution, consistent monitoring of floating algal blooms in dynamic coastal environments.

106. 【2606.17241】Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception

链接：https://arxiv.org/abs/2606.17241

作者：Aditya Mishra,Haroon Lone

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)

关键词：Jetson Orin Nano, NVIDIA Jetson Orin, including temporal instability, resource-constrained edge hardware, edge hardware introduces

备注：

点击查看摘要

Abstract:Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.

107. 【2606.17222】Quantum Enchanced Multi-Scale CNN with Bi-directional Mamba for Crop Field Analysis

链接：https://arxiv.org/abs/2606.17222

作者：Mohammad Salman Khan,Ehsan Atoofian,Saad B. Ahmed

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：accurate crop monitoring, monitoring and assessment, captures rich spectral, HSI classification remains, essential for precision

备注：

点击查看摘要

Abstract:Hyperspectral image (HSI) crop analysis is essential for precision agriculture because it captures rich spectral and spatial information for accurate crop monitoring and assessment. However, HSI classification remains challenging due to high spectral dimensionality, spatial complexity, class imbalance, and limited labeled samples. To address these challenges, this paper proposes a BiSpectral Mamba-based framework that combines multi-scale convolutional feature extraction, spectral attention, bidirectional state-space modeling, and quantum-inspired learning. A multi-scale CNN backbone first extracts hierarchical spatial-spectral representations through feature fusion across multiple resolutions. A spectral attention mechanism then emphasizes informative bands while suppressing redundant and noisy channels. The refined features are processed by a BiSpectral Mamba module that captures long-range dependencies in both forward and backward directions by modeling hyperspectral feature maps as sequential tokens. In addition, class-weighted optimization and feature fusion strategies are incorporated to improve training stability and mitigate class imbalance. Experimental evaluation on the UAVHSI-Crop dataset demonstrates the effectiveness of the proposed framework, achieving an overall accuracy of 84.83%. The results show that integrating convolutional, attention-based, and state-space modeling components enables robust spatial-spectral feature learning for crop classification. The proposed framework also shows potential for broader agricultural and remote sensing applications, including crop disease detection, yield prediction, and soil moisture estimation, while highlighting the effectiveness of structured state-space and quantum-inspired architectures for hyperspectral image analysis.

108. 【2606.17213】Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

链接：https://arxiv.org/abs/2606.17213

作者：Vanshali Sharma,Andrea M. Bejar,Halil Ertugrul Aktas,Quoc-Huy Trinh,Debesh Jha,Gorkem Durak,Ulas Bagci

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：large language models, including large language, Recent advances, demonstrated strong adaptability, language models

备注：

点击查看摘要

109. 【2606.17188】Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

链接：https://arxiv.org/abs/2606.17188

作者：Prabhjot Singh,Bhushan Pawar,Madhu Reddiboina,Rajvee Sheth

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Visual Reasoning, Punjabi Multimodal Visual, overlooking billions, billions of users, Punjabi Multimodal

备注：

点击查看摘要

110. 【2606.17080】HRDX: A Large-Scale Vector HD-Map Dataset

链接：https://arxiv.org/abs/2606.17080

作者：Sahith Reddy Chada,Isht Dwivedi,Nirav Savaliya

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Reliable autonomous driving, autonomous driving requires, driving requires vectorized, Reliable autonomous, semantically rich

备注： [this https URL](https://usa.honda-ri.com/hrdx)

点击查看摘要

Abstract:Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX's scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at this https URL

111. 【2606.17504】wo-Stage Fine-Tuning of ResNet50 for High-Sensitivity Melanoma Detection on Dermoscopic Images

链接：https://arxiv.org/abs/2606.17504

作者：Aryan Bhagat

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：survival rates exceeding, five-year survival rates, disease spreads, dangerous form, form of skin

备注： 13 pages, 4 figures, 4 tables. Code available at [this https URL](https://github.com/Aryanbhagat23/melanoma-detection)

点击查看摘要

Abstract:Melanoma is the most dangerous form of skin cancer with five-year survival rates exceeding 99% when detected early but falling sharply once the disease spreads. This paper proposes and evaluates a two-stage fine-tuning approach for ResNet50 applied to binary melanoma classification on dermoscopic images. The core challenges addressed are class imbalance and suboptimal transfer learning from single-stage fine-tuning. After stratified train/validation/test splitting, random oversampling was applied exclusively to the training set to achieve a 1:1 class balance. Stage 1 trained only the classification head with the ResNet50 base frozen, while Stage 2 fine-tuned all layers jointly at a low learning rate of 1e-5 to prevent catastrophic forgetting of learned visual features. On an independent test set of 3,826 images, the model achieved an AUC-ROC of 0.9559, accuracy of 88.34%, sensitivity of 87.56%, specificity of 89.13%, and F1-score of 88.29%. An ablation study confirms the two-stage protocol significantly outperforms single-stage fine-tuning, with sensitivity gains of over 4%. Grad-CAM visualizations demonstrate correct lesion localization. A fully deployable Streamlit detection application is provided alongside all training code.

112. 【2606.17295】Phenotyping TPF via Self-Supervised Learning: A Label-Agnostic Framework with Expert Validation

链接：https://arxiv.org/abs/2606.17295

作者：Miral Elnakib,Muhammad Saad,Ahmad Al-Kabbany

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：characterisation remains unrealised, causing supervised models, learn human disagreement, stable fracture morphology, plateau fracture characterisation

备注：

点击查看摘要

Abstract:The full potential of artificial intelligence in tibial plateau fracture characterisation remains unrealised, constrained by a fundamental dependency on labelled datasets whose consistency cannot be guaranteed: conventional classification schemes such as Schatzker and AO/OTA suffer from inter-observer variability, causing supervised models to learn human disagreement rather than stable fracture morphology. We design, implement, and validate a label-agnostic framework that eliminates this constraint by learning fracture representations directly from imaging data without observer-assigned labels. A RadImageNet-pretrained ResNet-50 encoder is fine-tuned on 154 cleaned knee radiographs using the SimCLR contrastive objective, preceded by a data cleaning protocol and followed by UMAP dimensionality reduction and k-means clustering to discover four imaging-derived phenotypes. Phenotype validity is assessed through a blinded expert review protocol administered to two independent clinicians. The four phenotypes demonstrate robust stability (bootstrap ARI = 0.319 +/- 0.041), strong internal cohesion (silhouette = 0.511), and coherence ratings of 3-5/5 from both reviewers under blinded conditions; one phenotype was unanimously identified as exhibiting comminution -- a high-complexity feature isolated without any supervisory signal. Inter-partition comparison against Schatzker labels yields ARI = 0.013, confirming orthogonality to conventional classification boundaries. Notably, expert reviewers anchored to established classification vocabularies perceived imaging-derived groups as heterogeneous precisely where Schatzker alignment was lowest, suggesting that Schatzker-trained perception and label-agnostic embedding geometry measure orthogonal dimensions. These findings establish label-agnostic SSL phenotyping as a reproducible and clinically interpretable complement to conventional classification.