本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新666篇论文，其中：

自然语言处理90篇
信息检索18篇
计算机视觉124篇

自然语言处理

1. 【2606.20529】LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

链接：https://arxiv.org/abs/2606.20529

作者：Md Nayem Uddin,Amir Saeidi,Eduardo Blanco,Chitta Baral

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Policy-adherent tool-calling agents, obeying domain policies, Policy-adherent tool-calling, task states, turns while calling

备注： Work in Progress

点击查看摘要

Abstract:Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

2. 【2606.20527】StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

链接：https://arxiv.org/abs/2606.20527

作者：Shaghayegh Kolli,Timo Cavelius,Nafiseh Nikeghbal,Samantha Dalal,Jana Diesner

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：societally consequential settings, remain poorly understood, judge people remain, people remain poorly, Multimodal large language

备注： Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: this https URL and this https URL.

3. 【2606.20487】Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

链接：https://arxiv.org/abs/2606.20487

作者：Shu Yao,Yuhua Luo,Qian Long,Jingru Fan,Zhuoyuan Yu,Yuheng Wang,Lin Wu,Yufan Dang,Huatao Li,Chen Qian

类目：Computation and Language (cs.CL)

关键词：Real-world computer-use tasks, span multiple applications, coordinate heterogeneous environments, Real-world computer-use, dynamic runtime failures

备注：

点击查看摘要

Abstract:Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

4. 【2606.20482】Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

链接：https://arxiv.org/abs/2606.20482

作者：Haw-Shiuan Chang,Jeffrey Gomez,Mehul Patwari,Aryan Sajith,Hamed Zamani

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：Large Language Model, Large Language, align a Large, Language Model, existing methods collect

备注：

点击查看摘要

Abstract:To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at this https URL.

5. 【2606.20477】Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

链接：https://arxiv.org/abs/2606.20477

作者：Yusuf Salcan(1 and 4),Simon Ging(1 and 2),Robin Schirrmeister(3),Philipp Arnold(3),Elmar Kotter(3),Behzad Bozorgtabar(2),Thomas Brox(1) ((1) Computer Vision Group, University of Freiburg, Germany, (2) Adaptive amp; Agentic AI (A3) Lab, Aarhus University, Denmark, (3) Department of Radiology, Medical Center -- University of Freiburg, Germany, (4) CRIION-AI Lab, Freiburg, Germany)

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：train visually grounded, visually grounded vision-language, manual spatial annotations, grounded vision-language models, train visually

备注： Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

点击查看摘要

Abstract:We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

6. 【2606.20369】CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

链接：https://arxiv.org/abs/2606.20369

作者：Helena Bonaldi,Genoveffa Martone,Marco Guerini

类目：Computation and Language (cs.CL)

关键词：Online hate speech, NLP research, Online hate, misinformation frequently overlap, zero-shot models frequently

备注：

点击查看摘要

Abstract:Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.

7. 【2606.20295】oken-Operations-Oriented Inference Optimization Techniques for Large Models

链接：https://arxiv.org/abs/2606.20295

作者：Shiguo Lian,Kai Wang,Zhaoxiang Liu,Wen Liu,Minjie Hua,Yutong Liu,Jiangze Yan,Xin Wang,Cong Wang,Yilin Zhang,Yi Shen,Jieyun Huang,Fang Zhao,Huanlin Gao,Ping Chen,Xinyu Yang,Kaikai Zhao,Yao Zhao,Xinggang Wang,Huishuai Zhang,Dongyan Zhao,Junping Du,Tao Chen,Xiang Gao,Qinghuai Ma

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：highly stable operation, inference optimization serves, large model services, model inference optimization, Large model inference

备注： 62 pages, 36 figures

点击查看摘要

Abstract:Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

8. 【2606.20287】PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

链接：https://arxiv.org/abs/2606.20287

作者：Wei Xia,Jin Wu,Haoran Shi,Xiangyu Wang,Chanjin Zheng

类目：Computation and Language (cs.CL)

关键词：Effective Automated Essay, Automated Essay Scoring, Effective Automated, Automated Essay, Large Language Model

备注：

点击查看摘要

Abstract:Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

9. 【2606.20255】he Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

链接：https://arxiv.org/abs/2606.20255

作者：Celestine Achi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：separates surface sentiment, true communicative intent, Meaning Intelligence Framework, Meaning Intelligence Score, Nigerian public discourse

备注： Preprint. 12 pages, 2 tables. Supplementary materials: MIF Master Specification v2.0, Annotation Guidelines v1.0, and 30-item public calibration set with gold labels available from the author

点击查看摘要

Abstract:We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task (positive, negative, neutral). We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate a frontier language model (Gemini 2.5 Flash) under zero-shot and schema-informed prompting conditions. The headline finding is the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. The composite Meaning Intelligence Score increases by 5.4 points (73.2 to 78.6) under schema-informed prompting, with the largest practical gains in register identification, coded-subtext detection (+10 points), and strategic action recommendation (+10.3 points). We release the framework specification, annotation guidelines, and the 30-item public calibration set to support reproducibility, while retaining a private holdout corpus for contamination-protected evaluation.

10. 【2606.20225】Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

链接：https://arxiv.org/abs/2606.20225

作者：Abdul Rafay Syed

类目：Computation and Language (cs.CL)

关键词：Fine-tuning language models, poorly understood internal, Fine-tuning language, induces emergent misalignment, understood internal structure

备注： 12 pages, 2 figures

点击查看摘要

Abstract:Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

11. 【2606.20212】CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

链接：https://arxiv.org/abs/2606.20212

作者：Josef Jon,Ondřej Bojar

类目：Computation and Language (cs.CL)

关键词：Ukrainian and English, Czechia-primarily Ukrainian, covering Czech, portions of Vietnamese, Czech and minority

备注：

点击查看摘要

Abstract:We present CzechDocs, a multiway parallel dataset of formatted documents (HTML, DOCX, and PDF) covering Czech and minority languages used in Czechia-primarily Ukrainian and English, with smaller portions of Vietnamese, Russian and other languages. The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation. We provide a comparison of the most common approaches to format-preserving machine translation on a validation subset of the dataset. This validation split, together with the evaluation toolkit, is publicly released for further research. A held-out test split will be reserved for a future shared task focused on document-level translation with formatting preservation.

12. 【2606.20205】Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

链接：https://arxiv.org/abs/2606.20205

作者：Jelena Meyer,David Garcia,Dirk U. Wulff

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：assign large language, affect their usability, participants in research, stable psychological profiles, Psychological instruments designed

备注：

点击查看摘要

Abstract:Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

13. 【2606.20198】Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores

链接：https://arxiv.org/abs/2606.20198

作者：Augustin Bouquillard(X),Florent Jacquemard(CEDRIC - VERTIGO)

类目：Computation and Language (cs.CL)

关键词：Key Signature, global Key Signature, key estimation, algorithm for pitch, pitch spelling

备注：

点击查看摘要

Abstract:We present an algorithm for pitch spelling and key estimation. Given an input in MIDI-like format, containing information on note pitches (expressed in semitones relative to the lowest reference note) and bar boundaries, it estimates the appropriate note names, a global Key Signature, and a local scale for each bar. This related information elements are evaluated jointly during two stages of optimisation. During an initial 'modal' stage, a probable scale is proposed for each bar, minimising the number of accidentals to be printed in the printed score with a shortest-path search. Then, during a second stage called 'tonal', these local scales are used to estimate the Key Signature and note names that would result in the best musical notation for the entire piece. We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores for piano and monophonic instruments. Our procedure was originally designed for use in music transcription, specifically for building digital collections of jazz solos transcribed from audio recordings, for the purposes of music analysis, teaching and the preservation of cultural heritage. This method should also prove useful for other tasks related to the processing of musical notation. Furthermore, to this end, we have defined new distances between various common jazz scales, which may be of some interest to musicological studies.

14. 【2606.20179】ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

链接：https://arxiv.org/abs/2606.20179

作者：Maxim Melichov,Yakov Kolani,Morris Alper

类目：Computation and Language (cs.CL)

关键词：creating substantial ambiguity, vowels largely unwritten, International Phonetic Alphabet, conversion for Modern, leaves vowels largely

备注：

点击查看摘要

Abstract:Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.

15. 【2606.20164】MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

链接：https://arxiv.org/abs/2606.20164

作者：Aueaphum Aueawatthanaphisut

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词：Real-world clinical decision, Real-world clinical, Multimodal Health Intelligence, clinical, support requires reasoning

备注： 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

点击查看摘要

Abstract:Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

16. 【2606.20155】NAMESAKES: Probing Identity Memorization in Text-to-Image Models

链接：https://arxiv.org/abs/2606.20155

作者：Morris Alper,Vasudha Varadarajan,Moran Yanuka,Angelina Wang,Hadar Averbuch-Elor

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：raising privacy concerns, generate realistic likenesses, models generate realistic, raising privacy, privacy concerns

备注：

点击查看摘要

Abstract:Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.

17. 【2606.20152】From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

链接：https://arxiv.org/abs/2606.20152

作者：Jiaxu Zuo,Mu You,Kaixin Lan,Tao Fang,Yujia Huo,Henghua Shen,Lidia S. Chao,Derek F. Wong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, substantially transformed Automated, Language Models, Large Language, transformed Automated Essay

备注： This is a preprint of a manuscript currently under peer review

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

18. 【2606.20138】Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

链接：https://arxiv.org/abs/2606.20138

作者：Po-Chin Chang,Nicholas Hogan,Aske Plaat,Michiel T. van der Meer

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：diverse academic disciplines, current static-prompt tutoring, static-prompt tutoring systems, tutoring systems struggle, LLMs can personalize

备注：

点击查看摘要

Abstract:LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

19. 【2606.20113】When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2606.20113

作者：Elroy Galbraith

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Streaming Retrieval-Augmented Generation, reduces user-perceived latency, Retrieval-Augmented Generation, Streaming RAG, ongoing user input

备注：

点击查看摘要

Abstract:Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence {\delta}, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, {\delta}=3w/s, {\theta}=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, {\phi}_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

20. 【2606.20097】HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

链接：https://arxiv.org/abs/2606.20097

作者：Zhentao Tan,Wei Chen,Jingyi Shen,Yao Liu,Xu Shen,Yue Wu,Jieping Ye

类目：Computation and Language (cs.CL)

关键词：spurring interest, quadratic complexity, poses a critical, critical bottleneck, attention

备注：

点击查看摘要

Abstract:The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

21. 【2606.20093】Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

链接：https://arxiv.org/abs/2606.20093

作者：William Guey,Pierrick Bougault

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, increasingly review, revise text, review and revise

备注： 7 pages, 3 tables. Code and data: [this https URL](https://github.com/williamguey/self-preference-revision)

点击查看摘要

Abstract:Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

22. 【2606.20089】IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

链接：https://arxiv.org/abs/2606.20089

作者：Arash Ghafouri,Mahdi Firouzmandi,Hossein Saberi,Mohammad Reza Hasani Ahangar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：high-quality pretraining corpora, pretrained language models, Persian PLM trained, monolingual Persian PLM, Persian pretrained language

备注：

点击查看摘要

Abstract:Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.

23. 【2606.20075】What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

链接：https://arxiv.org/abs/2606.20075

作者：Xinghao Chen,Chak Tou Leong,Wenjin Guo,Jian Wang,Wenjie Li,Xiaoyu Shen

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：continuous hidden states, discrete reasoning traces, verbose discrete reasoning, Latent, hidden states

备注：

点击查看摘要

Abstract:Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{this https URL}{this repository}.

24. 【2606.20072】Source-Grounded Data Generation for Text-to-JSON Learning

链接：https://arxiv.org/abs/2606.20072

作者：Sunghee Ahn,Guijin Son,Youngjae Yu

类目：Computation and Language (cs.CL)

关键词：legacy industries rely, industries rely heavily, store high-value information, clinical records, legacy industries

备注： Preprint

点击查看摘要

Abstract:From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

25. 【2606.20065】Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

链接：https://arxiv.org/abs/2606.20065

作者：Pratyush Kumar(Ranqo)

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：search engine optimization, scrolling search results, Generative Engine Optimization, Answer Engine Optimization, engine optimization

备注： 14 pages, 4 tables; v1.0 preprint

点击查看摘要

Abstract:People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

Comments:
14 pages, 4 tables; v1.0 preprint

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)

ACMclasses:
H.3.3

Cite as:
arXiv:2606.20065 [cs.IR]

(or
arXiv:2606.20065v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.20065

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

26. 【2606.20023】When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

链接：https://arxiv.org/abs/2606.20023

作者：Kaiyue Yang,Yuyan Bu,Jingwei Yi,Yuchi Wang,Biyu Zhou,Juntao Dai,Songlin Hu,Yaodong Yang

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：LLM agents increasingly, privileges become safety-relevant, select tools autonomously, agents increasingly select, increasingly select tools

备注： code: [this https URL](https://github.com/AISafetyHub/agent-tool-selection-bias)

点击查看摘要

Abstract:As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

27. 【2606.20002】Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

链接：https://arxiv.org/abs/2606.20002

作者：Yanxi Chen,Weijie Shi,Yuexiang Xie,Boyi Hu,Yaliang Li,Bolin Ding,Jingren Zhou

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, future tasks conditioned, training large language, updated context, language models

备注： Work in progress; we will continuously update the codebase and arXiv version

点击查看摘要

Abstract:This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{this https URL}.

28. 【2606.19996】Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

链接：https://arxiv.org/abs/2606.19996

作者：Yongqi Shao,Hong Huo,Flavio Bertini,Danilo Montesi,Tao Fang

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：non-invasive digital biomarker, cognitive impairment detection, low-cost and non-invasive, non-invasive digital, digital biomarker

备注： 15 pages, 7 figures, 5 tables

点击查看摘要

Abstract:\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

Comments:
15 pages, 7 figures, 5 tables

Subjects:

Sound (cs.SD); Computation and Language (cs.CL)

Cite as:
arXiv:2606.19996 [cs.SD]

(or
arXiv:2606.19996v1 [cs.SD] for this version)

https://doi.org/10.48550/arXiv.2606.19996

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yongqi Shao [view email] [v1]
Thu, 18 Jun 2026 09:32:24 UTC (4,429 KB)

29. 【2606.19946】GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

链接：https://arxiv.org/abs/2606.19946

作者：Yu Deng

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：modifying intermediate hidden, intermediate hidden states, controls model behavior, time without retraining, Activation steering controls

备注： 30 pages, 5 figures, 20 tables. Code and logs are available at: [this https URL](https://github.com/LuLu663939/gems-multi-semantic-steering)

点击查看摘要

Abstract:Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

30. 【2606.19911】Multi-Agent Transactive Memory

链接：https://arxiv.org/abs/2606.19911

作者：To Eun Kim,Xuhong He,Dishank Jain,Ambuj Agrawal,Negar Arabzadeh,Fernando Diaz

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：deployment of LLM, diverse tasks motivates, tasks motivates infrastructure, LLM agents, diverse capabilities

备注：

点击查看摘要

Abstract:The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

31. 【2606.19910】Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

链接：https://arxiv.org/abs/2606.19910

作者：Syeda Faiza Ahmed Sara,Shammur Absar Chowdhury

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：Training automated pronunciation, automated pronunciation assessment, Training automated, labeled learner errors, costly to collect

备注： Accepted to Interspeech 2026

点击查看摘要

Abstract:Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

32. 【2606.19881】REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

链接：https://arxiv.org/abs/2606.19881

作者：Guneesh Vats,Anubha Agrawal,Shikha Singhal,Ajita Dash,Praison Selvaraj,Vidhan Jhawar,Ranga Prasad Chenna,Bharadwaj Y M G

类目：Computation and Language (cs.CL)

关键词：personally identifiable information, existing corpora cover, hoc generation conditions, detection remains limited, entity types

备注： 14 pages, 5 figures

点击查看摘要

Abstract:Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task's hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.

33. 【2606.19864】he Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

链接：https://arxiv.org/abs/2606.19864

作者：Serge Sharoff

类目：Computation and Language (cs.CL)

关键词：Large Language Models, prominence of Large, public discourse presents, Language Models, Large Language

备注： Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026

点击查看摘要

Abstract:The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.

34. 【2606.19857】Large Language Models Do Not Always Need Readable Language

链接：https://arxiv.org/abs/2606.19857

作者：Jiayi Zhu,Haoxuan Peng,Junxi Wang,Liang Ke,Chen Zhang,Linfeng Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, commonly prompted, prompted and interfaced, interfaced with human-readable

备注： 23 pages, 10 figures. Preprint

点击查看摘要

Abstract:Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

35. 【2606.19852】Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

链接：https://arxiv.org/abs/2606.19852

作者：Aman Pathak(1),Cheng Peng(1),Mengxian Lyu(1),Ziyi Chen(1),Reema Solan(1),Sankalp Talankar(1),Yasir Khan(1),Hiren Mehta(2),Aokun Chen(3),Yi Guo(1),Yonghui Wu(1)

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：tumor registry population, cancer staging, tumor registry, registry population, Named Entity Recognition

备注： 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA

点击查看摘要

Abstract:Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

36. 【2606.19847】AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

链接：https://arxiv.org/abs/2606.19847

作者：Yanyu Yao,Shangze Li,Zhi Zheng,Hui Zheng,Qi Liu,Tong Xu,Enhong Chen

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, windows limit long-term, limit long-term information, long-term information accumulation

备注： 19 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

37. 【2606.19831】Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

链接：https://arxiv.org/abs/2606.19831

作者：Hongliang Liu

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Aligned language models, language models gate, Aligned language, models gate behaviors, sparse feed forward

备注：

点击查看摘要

Abstract:Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

38. 【2606.19830】JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

链接：https://arxiv.org/abs/2606.19830

作者：Jianwen Sun,Chuanhao Li,Zizhen Li,Yukang Feng,Fanrui Zhang,Yifei Huang,Yu Dai,Kaipeng Zhang

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：Current AI-driven game, made substantial progress, remains largely unexplored, largely unexplored due, web-based game coding

备注：

点击查看摘要

Abstract:Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

39. 【2606.19819】CREDENCE: Claim Reduction for Decomposition Enhanced Credibility -- Semantic Metrics and Convergence Analysis

链接：https://arxiv.org/abs/2606.19819

作者：Phuong Huu Vu Tran,Thuan Duc Mai,Bach Xuan Le

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Decomposing compound sentences, reliable automated fact-checking, Decomposing compound, sentences into atomic, compound sentences

备注： 40 pages, 6 figures, 19 tables. Submitted to Language Resources and Evaluation

点击查看摘要

Abstract:Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard's penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.

40. 【2606.19815】Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

链接：https://arxiv.org/abs/2606.19815

作者：Jiechao Gao,Rohan Kumar Yadav,Yuangang Li,Yuandong Pan,Jie Wang,Ying Liu,Michael Lepech

类目：Computation and Language (cs.CL)

关键词：BERT achieve strong, achieve strong text, strong text classification, lack transparency, high-stakes settings

备注：

点击查看摘要

Abstract:Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

41. 【2606.19808】hink Again or Think Longer? Selective Verification for Budget-Aware Reasoning

链接：https://arxiv.org/abs/2606.19808

作者：Sajib Acharjee Dip,Dawei Zhou,Liqing Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Test-time reasoning, repair failed attempts, serving-time control knob, uniformly valuable, waste compute

备注：

点击查看摘要

Abstract:Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

42. 【2606.19788】CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

链接：https://arxiv.org/abs/2606.19788

作者：Yuxu Zhou,Ondřej Kuželka,Yuyi Wang,Yuanhong Wang,Yi Chang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, evaluating combinatorial counting, large language, typed Cofola specification, evaluating combinatorial

备注： under review. Code: [this https URL](https://github.com/YuxuZhou-CN/combination-problem-generation)

点击查看摘要

Abstract:We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{this https URL}.

43. 【2606.19782】AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

链接：https://arxiv.org/abs/2606.19782

作者：Aravind Narayanan,Shaina Raza

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：regulated settings demands, external model providers, send client data, answering in regulated, regulated settings

备注：

点击查看摘要

Abstract:Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

44. 【2606.19750】Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

链接：https://arxiv.org/abs/2606.19750

作者：Darrien McKenzie,Nicklas Hansen,Xiaolong Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：improving reasoning capabilities, training efficiency depends, efficiency depends critically, large language models, Reinforcement learning

备注： Webpage: [this https URL](https://darrienmckenzie.com/manifold-bandits/)

点击查看摘要

Abstract:Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

45. 【2606.19749】Benchmarking Agentic Review Systems

链接：https://arxiv.org/abs/2606.19749

作者：Dang Nguyen,Wanqing Hao,Yanai Elazar,Chenhao Tan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：class of agentic, agentic review systems, peer review systems, review systems, agentic review

备注： 11 pages, 7 tables, 4 figures

点击查看摘要

Abstract:A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

46. 【2606.19744】Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

链接：https://arxiv.org/abs/2606.19744

作者：Pranav Bhandari,Nicolas Fay,Amitava Datta,Usman Naseem,Mehwish Nasim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：Aligning language models, requires optimising multiple, optimising multiple behavioural, Aligning language, multiple behavioural objectives

备注： Submitted to EMNLP 2026

点击查看摘要

Abstract:Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

47. 【2606.19727】NRITYAM: Language Models Meet Art and Heritage of Dance

链接：https://arxiv.org/abs/2606.19727

作者：Punit Kumar Singh,Niladri Ghosh,Advait Joshiınst,Shailee Choudhary,Michael Färber,Haiqin Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：shaping modern workflows, Language models, modern workflows, essential tools, tools in shaping

备注： 18 pages, 12 figures, in ECML_PKDD'26

点击查看摘要

Abstract:Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{this https URL}.

48. 【2606.19719】Closing the Calibration Gap in Semantic Caching

链接：https://arxiv.org/abs/2606.19719

作者：Aditeya Baral,Radoslav Ralev,Iliya Sotirov Zhechev,Srijith Rajamohan,Jen Agarwal

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：cuts LLM inference, LLM inference costs, semantically similar queries, caching cuts LLM, cuts LLM

备注： 23 pages, 2 figures. Source code: [this https URL](https://github.com/aditeyabaral/calibration-gap-semantic-caching) ; Models and Datasets: [this https URL](https://huggingface.co/redis)

点击查看摘要

Abstract:Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

49. 【2606.19710】FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

链接：https://arxiv.org/abs/2606.19710

作者：Elijah Feldman,Dipak Meher,Carlotta Domeniconi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Court proceedings, knowledge graph construction, human smuggling networks, buried within unstructured, proceedings contain valuable

备注： Code available at [this https URL](https://github.com/ElijahFeldman7/FineREX)

点击查看摘要

Abstract:Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

50. 【2606.19706】NEST: Narrative Event Structures in Time for Long Video Understanding

链接：https://arxiv.org/abs/2606.19706

作者：Ali Asgarov,Kaushik Narasimhan,Najibul Haque Sarker,Hani Alomari,Chia-Wei Tang,Anushka Sivakumar,Zaber Ibn Abdul Hakim,Shaurya Mallampati,Chris Thomas

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：handle extended token, extended token streams, long video sequences, increasingly long video, Long Video Understanding

备注：

点击查看摘要

Abstract:Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

51. 【2606.19700】rraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

链接：https://arxiv.org/abs/2606.19700

作者：Jyotsna Singh,Ash Black,Jeff Larsen,Scott R. Saleska

类目：Computation and Language (cs.CL)

关键词：Researchers are interested, habitable for humans, interested in learning, eventually become habitable, Small Language Model

备注： 16 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

52. 【2606.19698】What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

链接：https://arxiv.org/abs/2606.19698

作者：Jason Potteiger

类目：Computation and Language (cs.CL)

关键词：customer support data, sentiment analysis, companies read, customers sound, customers

备注： 25 pages, 6 figures

点击查看摘要

Abstract:Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer's satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single "Neutral" label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is "tolerated friction," customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer's language, offering strong potential for new business metrics grounded instead in the customer's state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.

53. 【2606.19697】Efficiently Representing Algorithms With Chain-of-Thought Transformers

链接：https://arxiv.org/abs/2606.19697

作者：Yanhong Li,Anej Svete,Ashish Sabharwal,William Merrill

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Word RAM, perform arbitrary computation, Word RAM algorithms, emph, RAM

备注：

点击查看摘要

Abstract:The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

54. 【2606.19668】Code-Switching Reveals Language Anchoring in Multilingual LLMs

链接：https://arxiv.org/abs/2606.19668

作者：Jeonghyun Park,Seunghyun Yoon,Yonghyun Jun,Hwanhee Lee

类目：Computation and Language (cs.CL)

关键词：Multilingual Large Language, Large Language Models, Multilingual Large, frequently degrades performance, degrades performance relative

备注： 36 pages, 13 figures, 27 tables

点击查看摘要

Abstract:Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

55. 【2606.19667】CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

链接：https://arxiv.org/abs/2606.19667

作者：Kaizhen Tan,Rong Gu,Mingyuan Li

类目：Computation and Language (cs.CL)

关键词：improves factual grounding, raises prefill cost, improves factual, factual grounding, Retrieval-Augmented Generation

备注：

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

56. 【2606.19660】A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

链接：https://arxiv.org/abs/2606.19660

作者：Gulshan Saleem,Nisar Ahmed,Muhammad Imran Zaman,Ali Hassan

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：OWASP Top, existing defenses operate, large language model, isolated pipeline stages, LLM Applications

备注： Submitted in ICCK Transactions on Information Security and Cryptography

点击查看摘要

Abstract:Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

57. 【2606.19659】SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

链接：https://arxiv.org/abs/2606.19659

作者：Yuhang Zhou,Lizhu Zhang,Yifan Wu,Mingyi Wang,Bo Peng,Jiayi Liu,Xiangjun Fan,Zhuokai Zhao

类目：Computation and Language (cs.CL)

关键词：mitigating exposure bias, OPD, trajectories induced, promising approach, approach for mitigating

备注： 21 pages, 3 figures

点击查看摘要

Abstract:On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single-turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token-level OPD becomes brittle, as it may over-penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off-distribution histories. We propose SAGE-OPD, a verifier-free selective intervention framework specifically designed for multi-turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE-OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE-OPD weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE-OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn-level weighting. Experiments on agent tasks show that SAGE-OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn-level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi-turn OPD should remain on-policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.

58. 【2606.19647】From 50K to 8.2 Million in 24 Hours: Vozinha's Algorithmic Consecration and the Multilingual Making of World Cup Visibility

链接：https://arxiv.org/abs/2606.19647

作者：Vinicius Covas

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)

关键词：FIFA World Cup, Cape Verde goalkeeper, Cape Verde, FIFA World, World Cup

备注： 11 pages, 4 figures, 3 tables; v0.1 pilot preprint. Dataset and evidence package available at [this https URL](https://doi.org/10.5281/zenodo.20722235)

点击查看摘要

Abstract:We present a multilingual computational discourse analysis of how language constructed the algorithmic consecration of Vozinha, the 40-year-old Cape Verde goalkeeper, after Spain 0-0 Cape Verde at the 2026 FIFA World Cup. The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human validation; and an analysis of cross-lingual narrative diffusion across discourse phases. We treat the platform follower count itself, narrated as "50k to 8M", as a linguistic object: a circulating and narratable proof of visibility rather than a mere measurement. The follower-growth timeline is used only as contextual metadata: we reconstruct a conservative phase structure, not a continuous API-native series, and type every datapoint by value class, confidence, and evidence type. The only exact primary scraper anchor is 8,235,652 followers at 2026-06-16 15:47 UTC; all other figures are reported as estimated ranges or thresholds, including an estimated pre-match baseline of 45k-56k. Findings suggest that distinct languages carried distinct frames: Portuguese mobilization, Spanish crisis, English nation-making, and a shared platform-metric spectacle through which peripheral athletic performance became globally visible. As a v0.1 pilot, the paper releases the corpus schema, frame taxonomy, annotation guidelines, hashed visual-evidence log, and typed timeline, while flagging full double annotation and inter-annotator agreement as planned work.

59. 【2606.19640】Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

链接：https://arxiv.org/abs/2606.19640

作者：Yunkai Xu,Saeed Abdullah

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：address global mental, emerged as promising, promising tools, tools to address, mental health

备注： 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

点击查看摘要

Abstract:AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

60. 【2606.19638】MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

链接：https://arxiv.org/abs/2606.19638

作者：David M. Smiley

类目：Computation and Language (cs.CL)

关键词：Hebrew Bible, Modern Hebrew encoder, parallel involves paraphrase, lexical substitution, Modern Hebrew

备注：

点击查看摘要

Abstract:Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model's reliable scope to narrative textual reuse. MiqraBERT is publicly available at this https URL

61. 【2606.19637】Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

链接：https://arxiv.org/abs/2606.19637

作者：Priyanshi Garg,Ishita Rao,Jieqiong Ding,Amandalynne Paullada

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：electronic health record, detect suicidal behaviors, NLP increasingly relies, health record, suicidal behaviors

备注： To appear in the Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

点击查看摘要

Abstract:Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

62. 【2606.19626】oten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

链接：https://arxiv.org/abs/2606.19626

作者：Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Byte-Pair Encoding tokenization, fragmenting physical quantities, lexically arbitrary subwords, Byte-Pair Encoding, structured technical entities

备注：

点击查看摘要

Abstract:Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple O, classify, {inst_tau}: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

63. 【2606.19625】Where Does Social Reasoning Come From? Capability Provenance in Language Models

链接：https://arxiv.org/abs/2606.19625

作者：Glenn Matlin,Chandreyi Chakraborty,Saehee Eom,Mika Okamoto,Rayan Castilla,Louis Jaburi,Alvin Deng,Taywon Min,Lucia Quirke,Stella Biderman,Mark Riedl

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：social-reasoning versus STEM-reasoning, support social-reasoning versus, corpus support social-reasoning, pretraining corpus support, corpus regions support

备注： Under review at COLM 2026 (Conference)

点击查看摘要

Abstract:We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

64. 【2606.19591】A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

链接：https://arxiv.org/abs/2606.19591

作者：Vu Nguyen Nguyen Xuan,Huy Ngo Quang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Speech Processing, International Workshop, Language and Speech, multi-document abstractive summarization, Vietnamese multi-document abstractive

备注： originally written in 2022

点击查看摘要

Abstract:In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

65. 【2606.19559】Uncertainty Decomposition for Clarification Seeking in LLM Agents

链接：https://arxiv.org/abs/2606.19559

作者：Gregory Matsnev

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Recent position papers, large language model, shared mental-model building, position papers argue, interactive large language

备注： 26 pages, 8 figures. Source code: [this https URL](https://github.com/PE51K/udcs-in-llm-agents)

点击查看摘要

Abstract:Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

66. 【2606.19558】Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

链接：https://arxiv.org/abs/2606.19558

作者：Miloš Nikolić,Ali Hadi Zadeh,Enrique Torres Sanchez,Andreas Moshovos

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Fidelity metrics, per-token KL divergence, low-cost proxies, rho, benchmark quality

备注：

点击查看摘要

Abstract:Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($\rho=-0.72$ on Qwen and $\rho=-0.86$ on Devstral, both with $p0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($\rho=+0.00$ on Qwen and $\rho=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $\rho=+0.94$ ($p0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

67. 【2606.19552】LaViSA: A Language and Vision Structural Ambiguity Benchmark

链接：https://arxiv.org/abs/2606.19552

作者：Lee Sangmyeong,Shun Inadumi,Koichiro Yoshino

类目：Computation and Language (cs.CL)

关键词：admits multiple valid, multiple valid interpretations, valid interpretations due, Structural ambiguity, single sentence admits

备注：

点击查看摘要

Abstract:Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

68. 【2606.19544】Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

链接：https://arxiv.org/abs/2606.19544

作者：Justin D. Norman,Michael U. Rivera,D. Alex Hughes

类目：Computation and Language (cs.CL)

关键词：overstates discriminative ability, systematically overstates discriminative, dominant evaluation paradigm, language models, discriminative ability

备注：

点击查看摘要

Abstract:LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (0.95) coexists with severe position bias (0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

69. 【2606.19534】PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

链接：https://arxiv.org/abs/2606.19534

作者：Yueyi Sun,Yuhao Wang,Jason Li,Ye Tian,Tao Zhang,Jacky Mai,Yihan Wang,Haochen Wang,Jinbin Bai,Ling Yang,Yunhai Tong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：achieved remarkable progress, Multimodal large language, achieved remarkable, remarkable progress, multimodal diffusion language

备注： Code available at [this https URL](https://github.com/MSALab-PKU/PerceptionDLM)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

70. 【2606.19501】DeXposure-Claw: An Agentic System for DeFi Risk Supervision

链接：https://arxiv.org/abs/2606.19501

作者：Aijie Shu,Bowei Chen,Wenbin Wu,Cathy Yi-Hsuan Chen,Fengxiang He

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Risk Management (q-fin.RM)

关键词：Decentralized finance exposes, networked credit risks, finance exposes supervisors, Decentralized finance, supervisors to fast-moving

备注：

点击查看摘要

Abstract:Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at this https URL.

71. 【2606.19475】Diffusion Language Models: An Experimental Analysis

链接：https://arxiv.org/abs/2606.19475

作者：Thomas Bertolani,Davide Bucciarelli,Leonardo Zini,Marcella Cornia,Lorenzo Baraldi

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Diffusion Language Models, Large Language, enabling strong performance, revolutionized language modeling

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

72. 【2606.19468】Characterizing Narrative Content in Web-scale LLM Pretraining Data

链接：https://arxiv.org/abs/2606.19468

作者：Teagan Johnson,Elliott Ash,Andrew Piper,Maria Antoniak

类目：Computation and Language (cs.CL)

关键词：corpora remains largely, remains largely unexplored, pretraining corpora remains, web-scale LLM pretraining, LLM pretraining corpora

备注： 8 pages of main content, 28 total pages. 30 figures

点击查看摘要

Abstract:The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

73. 【2606.19404】hermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

链接：https://arxiv.org/abs/2606.19404

作者：Salim Khazem

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：large language models, attention-derived graph Laplacians, graph Laplacians carries, Laplacians carries strong, carries strong signal

备注：

点击查看摘要

Abstract:Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

74. 【2606.19388】Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

链接：https://arxiv.org/abs/2606.19388

作者：Li Gu,Zihuan Jiang,Linqiang Guo,Zhixiang Chi,Ziqiang Wang,Huan Liu,Yuanhao Yu,Tse-Hsun Chen,Yang Wang

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：emit screen interactions, Recent advances, screen interactions, Claude Code, GUI

备注：

点击查看摘要

Abstract:Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

75. 【2606.19379】How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

链接：https://arxiv.org/abs/2606.19379

作者：Stuart Whipp

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Transformer feed-forward networks, feed-forward networks, rarely been measured, trained FFN block, FFN

备注： 14 pages, 5 figures

点击查看摘要

Abstract:Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (0.99) to strongly nonlinear (0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

Comments:
14 pages, 5 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
I.2.6; I.2.7

Cite as:
arXiv:2606.19379 [cs.LG]

(or
arXiv:2606.19379v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.19379

Focus to learn more

              arXiv-issued DOI via DataCite</p>

76. 【2606.19356】rustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

链接：https://arxiv.org/abs/2606.19356

作者：Anantha Sharma

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：LLM systems produce, multi-agent LLM systems, produce bad answers, LLM systems, Argent Signaling Protocol

备注： 17 pages

点击查看摘要

Abstract:When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

Comments:
17 pages

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.19356 [cs.CL]

(or
arXiv:2606.19356v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.19356

Focus to learn more

              arXiv-issued DOI via DataCite</p>

77. 【2606.19354】Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

链接：https://arxiv.org/abs/2606.19354

作者：Ardit Krasniqi,Luan Vejsiu,Elira Dervishi

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Test-time scaling, investing additional compute, large language models, inference time, powerful paradigm

备注：

点击查看摘要

Abstract:Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

78. 【2606.19353】Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

链接：https://arxiv.org/abs/2606.19353

作者：Jinseok Chung,Minkyoung Song,Hyunji Jung,Namhoon Lee

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：In-Context Learning, remains a concern, understand the context, obscuring whether failures, reliability remains

备注： Accepted to ACL 2026

点击查看摘要

Abstract:In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

79. 【2606.19352】Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

链接：https://arxiv.org/abs/2606.19352

作者：Yiming Ni,Zhi-Qi Cheng,Jiayu Li,Wei Cheng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：expressive visual languages, expressive visual, Sign languages, visual languages, DHH

备注： Accepted to ACL 2026 Main. 27 pages, 5 figures

点击查看摘要

Abstract:Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (this https URL) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.

80. 【2606.19351】Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

链接：https://arxiv.org/abs/2606.19351

作者：Xinyan Zhu,Yaoqi Liu,Yue Gao,Huadong Ma,Cheng Yang,Chuan Shi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：question answering, widely applied, applied in question, reasoning infers, decision support

备注：

点击查看摘要

Abstract:Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

81. 【2606.19350】Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

链接：https://arxiv.org/abs/2606.19350

作者：Amogh Sheth,Biruk Assefa,Yi Wen Huang,Andrew Lin,Yuhao Ge

类目：Computation and Language (cs.CL)

关键词：substantial inference cost, incur substantial inference, Large language models, language models, excel at multi-step

备注： Accepted at the ICLR 2026 Workshop on LLM Reasoning. 13 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.

82. 【2606.19349】Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

链接：https://arxiv.org/abs/2606.19349

作者：Zhengheng Li,Panrui Li,Xuyang Liu,Puzhi Xia

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Diffusion Large Language, Large Language Models, remains largely unexplored, Diffusion Large, Large Language

备注： 9 figures, 4 tables

点击查看摘要

Abstract:While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

83. 【2606.19348】DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

链接：https://arxiv.org/abs/2606.19348

作者：DeepSeek-AI,Anyi Xu,Bangcai Lin,Bing Xue,Bingxuan Wang,Bingzheng Xu,Bochao Wu,Bowei Zhang,Chaofan Lin,Chen Dong,Chenchen Ling,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chengyu Hou,Chenhao Xu,Chenze Shao,Chong Ruan,Conner Sun,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Donghao Li,Dongjie Ji,Erhang Li,Fang Wei,Fangyun Lin,Fangzhou Yuan,Feiyu Xia,Fucong Dai,Guangbo Hao,Guanting Chen,Guoai Cao,Guolai Meng,Guowei Li,Han Yu,Han Zhang,Hanwei Xu,Hao Li,Haofen Liang,Haoling Zhang,Haoming Luo,Haoran Wei,Haotian Yuan,Haowei Zhang,Haowen Luo,Haoyu Chen,Haozhe Ji,Hengqing Zhang,Honghui Ding,Hongxuan Tang,Huanqi Cao,Huazuo Gao,Hui Qu,Hui Zeng,J Yang,JQ Zhu,Jia Luo,Jia Song,Jia Yu,Jialiang Huang,Jialu Cai,Jian Liang,Jiangting Zhou,Jiasheng Ye,Jiashi Li,Jiaxin Xu,Jiewen Hu,Jieyu Yang,Jin Chen,Jin Yan,Jingchang Chen,Jingli Zhou,Jingting Xiang,Jingyang Yuan,Jingyuan Cheng,Jingzi Zhou,Jinhua Zhu,Jiping Yu,Joseph Sun,Jun Ran,Junguang Jiang,Junjie Qiu,Junlong Li,Junmin Zheng,Junxiao Song,Kai Dong,Kaige Gao,Kang Guan,Kexing Zhou,Kezhao Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Wang,Leyi Xia,Li Zhang,Liang Zhao,Lihua Guo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Compressed Sparse Attention, Heavily Compressed Attention, combines Compressed Sparse, including two strong, present a preview

备注：

点击查看摘要

Abstract:We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models -- DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) -- both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at this https URL.

84. 【2606.19347】How LLMs Fail and Generalize in RTL Coding for Hardware Design?

链接：https://arxiv.org/abs/2606.19347

作者：Guan-Ting Liu,Chao-Han Huck Yang,Chenhui Deng,Zhongzhi Yu,Brucek Khailany,Yu-Chiang Frank Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)

关键词：Translating sequential programming, sequential programming priors, parallel temporal logic, Translating sequential, large language models

备注： Preview, under submission for EMNLP 2026

点击查看摘要

Abstract:Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

85. 【2606.19346】Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

链接：https://arxiv.org/abs/2606.19346

作者：Ahmed Haj Ahmed,Ruochen Zhang,Alvin Grissom II

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：evaluating zero-shot reading, zero-shot reading comprehension, Arabic and evaluating, comprehension on Semitic, non-Semitic controls

备注：

点击查看摘要

Abstract:We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding -- the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

86. 【2606.19345】Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

链接：https://arxiv.org/abs/2606.19345

作者：Zhyar Rzgar K. Rostam,Márta Péntek,János Tibor Czere,Zsombor Zrubka,László Gulácsi,Gábor Kertész

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：systematic literature reviews, increasingly resource consuming, scientific publications leads, literature reviews, resource consuming

备注： 6 pages, 7 tables, 8 equations

点击查看摘要

Abstract:The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers. This study investigates the use of Google's Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier. Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results. The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions. The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.

87. 【2606.19344】Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

链接：https://arxiv.org/abs/2606.19344

作者：Matteo Pelossi,Rita Sevastjanova,Thilo Spinner,Mennatallah El-Assady

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, nature of text, evaluate LLM bias, difficult to evaluate

备注： 14 pages

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

88. 【2606.18649】Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

链接：https://arxiv.org/abs/2606.18649

作者：Serena A. Hoffstedde,Machiko Hirota,Akshara Nadayanur Sathis Kanna,Rihito Kotani,Ujwal Kumar,Gabriele Trovato,Phan Xuan Tan

类目：Multiagent Systems (cs.MA); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large language models, LLM hiring decisions, Large language, focused on English-language, Western-format resumes

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

89. 【2606.20137】PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

链接：https://arxiv.org/abs/2606.20137

作者：Masaya Kawamura,Yuma Shirahata,Kentaro Mitsui,Reo Shimizu

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词：utterance-level naturalness MOS, typically predict utterance-level, predict utterance-level naturalness, Speech Quality Assessment, localized pitch-accent errors

备注： Accepted to INTERSPEECH 2026

点击查看摘要

Abstract:Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at this https URL.

90. 【2606.19951】Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

链接：https://arxiv.org/abs/2606.19951

作者：Masato Takagi,Masaya Kawamura,Reo Shimizu,Yuma Shirahata

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词：fidelity remains unclear, capture quality differences, remains unclear, proxy metrics, ability to capture

备注： Accepted to INTERSPEECH 2026

点击查看摘要

Abstract:Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

信息检索

1. 【2606.20554】Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

链接：https://arxiv.org/abs/2606.20554

作者：Ruizhong Qiu,Yinglong Xia,Dongqi Fu,Hanqing Zeng,Ren Chen,Xiangjun Fan,Hong Li,Hong Yan,Hanghang Tong

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Generative recommendation, aiming to predict, emerging paradigm, shown promise, predict users'

备注：

点击查看摘要

Abstract:Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

2. 【2606.20550】Easy Reads: A Python program for making Scientific Papers on arXiv more Reader Friendly and Accessible

链接：https://arxiv.org/abs/2606.20550

作者：Vishal Verma

类目：Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：tightly arranged figures, Easy Reads, line spacing, arranged figures, frequently dense

备注： 9 pages. Open-source software project available at: [this https URL](https://github.com/Curious-flow/Easy-Reads)

点击查看摘要

Abstract:Scientific papers are frequently dense and characterized by features such as small fonts and line spacing, double columns of text, and tightly arranged figures. While these features make papers more compact, they can hinder readability, make them less accessible, and can strain the reader. arXiv is a premier open-access repository for scientific papers across different fields and is used extensively by researchers, including those in the physics and astrophysics communities. Easy Reads is an automated, end-to-end, open-source Python program that helps address the stated challenge by making papers from arXiv more reader-friendly and accessible. Easy Reads can automatically fetch a paper from arXiv via its URL and work with the source TeX file to allow custom formatting of the paper features, primarily the font size, and the number of columns used. The main goal of Easy Reads is to facilitate ease of reading of scientific papers.

Comments:
9 pages. Open-source software project available at: this https URL

Subjects:

Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

Cite as:
arXiv:2606.20550 [cs.DL]

(or
arXiv:2606.20550v1 [cs.DL] for this version)

https://doi.org/10.48550/arXiv.2606.20550

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

3. 【2606.20280】ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

链接：https://arxiv.org/abs/2606.20280

作者：Yuhan Liu,Pei Fu,Hang Li,Yukun Qi,Chao Jiang,Jingwen Fu,Zhen Liu,Bin Qin,Zhenbo Luo,Jian Luan,Jingmin Xin

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Leveraging Multimodal Large, Multimodal Large Language, Large Language Models, Universal Multimodal Retrieval, Leveraging Multimodal

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

Comments:
Accepted by ECCV 2026

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.20280 [cs.IR]

(or
arXiv:2606.20280v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.20280

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

4. 【2606.20235】ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

链接：https://arxiv.org/abs/2606.20235

作者：Tingyue Pan,Mingyue Cheng,Daoyu Wang,Yitong Zhou,Jie Ouyang,Qi Liu,Enhong Chen

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：intent-driven literature exploration, Academic paper search, Academic paper, paradigm for iterative, core step

备注：

点击查看摘要

Abstract:Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

5. 【2606.20113】When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2606.20113

作者：Elroy Galbraith

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Streaming Retrieval-Augmented Generation, reduces user-perceived latency, Retrieval-Augmented Generation, Streaming RAG, ongoing user input

备注：

点击查看摘要

6. 【2606.20065】Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

链接：https://arxiv.org/abs/2606.20065

作者：Pratyush Kumar(Ranqo)

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：search engine optimization, scrolling search results, Generative Engine Optimization, Answer Engine Optimization, engine optimization

备注： 14 pages, 4 tables; v1.0 preprint

点击查看摘要

Comments:
14 pages, 4 tables; v1.0 preprint

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)

ACMclasses:
H.3.3

Cite as:
arXiv:2606.20065 [cs.IR]

(or
arXiv:2606.20065v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.20065

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

7. 【2606.20047】PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

链接：https://arxiv.org/abs/2606.20047

作者：Manu Ghulyani,Arunabh Singh,Karan Bharadwaj,Ankit Nath,Suranjan Goswami

类目：Information Retrieval (cs.IR)

关键词：tool-using LLM agents, Conversational and tool-using, tool-using LLM, LLM agents operate, LLM agents

备注：

点击查看摘要

Abstract:Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emph{already-present} pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.20047 [cs.IR]

(or
arXiv:2606.20047v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.20047

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

8. 【2606.19960】Stellar: Scalable Multimodal Document Retrieval for Natural Language Queries

链接：https://arxiv.org/abs/2606.19960

作者：Yuxiang Guo,Zhonghao Hu,Yuren Mao,Yuhang Liu,Congcong Ge,Xiaolu Zhang,Jun Zhou,Yunjun Gao

类目：Information Retrieval (cs.IR)

关键词：Retrieval-Augmented Generation, natural language query, relevant multimodal document, plays an essential, Multimodal document retrieval

备注：

点击查看摘要

Abstract:Multimodal document retrieval--selecting the most relevant multimodal document from a large corpus to answer a natural language query--plays an essential role in Retrieval-Augmented Generation (RAG) systems. State-of-the-art methods represent each document and query with multiple token-level embeddings and use late interaction to achieve high effectiveness. However, such multi-vector representations incur substantial memory overhead during retrieval, leading to poor scalability and hindering real-world deployment. In this paper, we present Stellar, a scalable multimodal document retrieval framework that stores token-level document embeddings on disk and loads only a small set of candidate embeddings into memory for late interaction. Stellar comprises two key components: (i) Lexical Representation-based Filtering (LRF), which fine-tunes a Multimodal Large Language Model (MLLM) as a sparse encoder to produce high-quality lexical representations, enabling efficient and effective document filtering to significantly reduce the candidate set; (ii) Efficient Disk-backed Late Interaction (DLI), which designs an on-disk token embedding storage layout guided by a balanced clustering algorithm, and dynamically loads only the necessary token embeddings into memory using a simple yet effective cost model. Extensive experiments on four real-world benchmarks and a newly presented large-scale dataset demonstrate that Stellar reduces memory overhead and query latency by 1-2 orders of magnitude compared to existing methods without compromising retrieval effectiveness.

9. 【2606.19911】Multi-Agent Transactive Memory

链接：https://arxiv.org/abs/2606.19911

作者：To Eun Kim,Xuhong He,Dishank Jain,Ambuj Agrawal,Negar Arabzadeh,Fernando Diaz

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：deployment of LLM, diverse tasks motivates, tasks motivates infrastructure, LLM agents, diverse capabilities

备注：

点击查看摘要

10. 【2606.19898】Query-aware Routing for Filtered Approximate Nearest Neighbors Search

链接：https://arxiv.org/abs/2606.19898

作者：Qianqian Xiong,Mengxuan Zhang

类目：Databases (cs.DB); Information Retrieval (cs.IR)

关键词：Filtered ANN search, combines vector similarity, modern vector databases, Filtered ANN, filtered ANN methods

备注： 12 pages

点击查看摘要

Abstract:Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

11. 【2606.19719】Closing the Calibration Gap in Semantic Caching

链接：https://arxiv.org/abs/2606.19719

作者：Aditeya Baral,Radoslav Ralev,Iliya Sotirov Zhechev,Srijith Rajamohan,Jen Agarwal

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：cuts LLM inference, LLM inference costs, semantically similar queries, caching cuts LLM, cuts LLM

备注： 23 pages, 2 figures. Source code: [this https URL](https://github.com/aditeyabaral/calibration-gap-semantic-caching) ; Models and Datasets: [this https URL](https://huggingface.co/redis)

点击查看摘要

12. 【2606.19692】When Global Gating Is Enough: Admission-Time Hubness Control in Anisotropic Vector Retrieval Systems

链接：https://arxiv.org/abs/2606.19692

作者：Prashant Kumar Pathak,Tarun Kumar Sharma

类目：Cryptography and Security (cs.CR); Databases (cs.DB); Information Retrieval (cs.IR)

关键词：influence unrelated requests, creates a poisoning, retrieval-augmented generation, unrelated requests, nearest neighbors

备注：

点击查看摘要

Abstract:Vector hubness, where a few points become nearest neighbors of many queries, creates a poisoning risk in retrieval-augmented generation (RAG): one injected document can influence unrelated requests. Existing defenses use periodic reverse-kNN scans, leaving an exposure window and repeated corpus-wide work. We study admission-time control, scoring each candidate against sentinel queries and quarantining hub-like documents before insertion. Across two 100,000-document corpora, five encoders, and disjoint attacker and defender query sets, a global gate achieves recall 1.0 at the decisive embedding-space point (=0.92 across the effective range) and 0.91 +/- 0.07 on HotFlip attacks, with 1% false positives on general documents. A per-topic gate provides no reliable benefit, consistent with anisotropy coupling local and global visibility. Thresholds are maintained incrementally, with corpus-size-independent insertion cost and amortized deletion cost. On HNSW, admission adds about 3.1% to ingestion latency, scoring remains flat to 10^6 vectors, and 1.2% of decisions flip under approximate indexing, none involving attacks. Provenance complements the gate for natural or tight-domain hubs.

13. 【2606.19658】Denoising Implicit Feedback for Cold-start Recommendation

链接：https://arxiv.org/abs/2606.19658

作者：Gaode Chen,Shicheng Wang,Shikun Li,Rui Huang,Xinghua Zhang,Yunze Luo,Shipeng Li,Shiming Ge,Ruina Sun,Yinjie Jiang,Jun Zhang

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词：recommender systems due, denoising implicit feedback, Implicit feedback, presents noisy samples, position bias

备注： Accepted by KDD 2026 ADS Track

点击查看摘要

Abstract:Implicit feedback is widely used in recommender systems due to its accessibility and generality, yet it usually presents noisy samples (e.g., clickbait, position bias). Meanwhile, recommenders inevitably face the item cold-start problem due to the continuous influx of new items. We identify that cold items are more prone to noisy samples due to the aforementioned factors, and researchers often overlook the significance of denoising implicit feedback for cold items. Previous denoising studies usually identify noisy samples based on heuristic patterns, such as higher loss values, and mitigate noise through sample selection or re-weighting. However, these methods have limited adaptability and are ineffective in cold-start scenarios. To achieve denoising implicit feedback for cold-start recommendation, we propose a model-agnostic denoising method called DIF. First, user preferences for content remain stable, which allows us to infer pseudo-labels indicating whether a user is interested in a cold item through content-similar warm items. Furthermore, to improve pseudo-label accuracy, we model the confidence of pseudo-labels based on the content similarity between the cold item and warm items, and then aggregate multiple pseudo-labels for each sample. Finally, we explicitly estimate the uncertainty of the noisy sample label by considering its relative entropy and the cold-start status of the item, which adaptively guides the role of pseudo-labels to correct the noisy labels at the sample level. DIF's superiority is supported by both theoretical justification and extensive experiments on real-world datasets. The method has been deployed on a billion-user scale short video application Kuaishou and has significantly improved various commercial metrics within cold-start scenarios.

14. 【2606.19646】SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

链接：https://arxiv.org/abs/2606.19646

作者：Ayush Dwivedi,Qixin Wang,Ashvi Soni,Ruoteng Wang,Han Li,Animesh Mahapatra,Neeraj Agrawal,Xintao Wu

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：chart question answering, lightweight language reasoning, Vision-language models, question answering, VLM

备注： Demo paper submitted at CIKM 2026. 4 pages, 2 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

15. 【2606.19635】oken Factory: Efficiently Integrating Diverse Signals into Large Recommendation Models

链接：https://arxiv.org/abs/2606.19635

作者：Xilun Chen,Shao-Chuan Wang,Baykal Cakici,Lukasz Heldt,Lichan Hong,Raghu Keshavan,Aniruddh Nath,Li Wei,Xinyang Xi

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：demonstrated promising capabilities, Large Recommendation Models, industry-scale recommendation tasks, Large Recommendation, demonstrated promising

备注： 8 pages, 10 figures

点击查看摘要

Abstract:Large Recommendation Models (LRMs) have demonstrated promising capabilities in industry-scale recommendation tasks. However, holistically integrating traditional signals into these transformer-based architectures effectively and efficiently remains a major challenge. Conventional approaches that "textualize" these signals directly or create discrete item representations often lead to excessively long prompts, substantial memory footprints, and high computational overhead. To overcome these limitations, we propose "Token Factory", a framework designed to transform traditional signals into "soft tokens" that can be directly processed by LRMs. This approach enables efficient integration and compression of heterogeneous input features, preventing prompt length explosion while enhancing model performance. We detail the architecture of Token Factory and present experimental results validating its effectiveness in a production-scale recommendation environment.

16. 【2606.19627】VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

链接：https://arxiv.org/abs/2606.19627

作者：Katya Mirylenka,Egor Malykh,Mahdyar Ravanbakhsh,Michael Gygli,Marco-Andrea Buchmann,Andrew Dzhoha,Svitlana Borzenko,Francesca Catino,Mohamed Gaafar,Maarten Versteegh,Thomas Kober,Dario d'Andrea,Ellie Langhans

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：digital commerce landscape, shifting from static, search-driven catalogs, catalogs to dynamic, digital commerce

备注：

点击查看摘要

Abstract:The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

17. 【2606.19458】MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems

链接：https://arxiv.org/abs/2606.19458

作者：Oğuzhan Yenen

类目：Information Retrieval (cs.IR)

关键词：embedded vector-search kernel, Randomized Hadamard Transform, network connectivity, kernel for edge, server infrastructure

备注： 27 pages, 11 figures. Code and artifacts: [this https URL](https://github.com/mona-hq/monavec) (PyPI: monavec; [this http URL](http://crates.io) : monavec-core). Zenodo: doi: [https://doi.org/10.5281/zenodo.20559587](https://doi.org/10.5281/zenodo.20559587)

点击查看摘要

Abstract:We present MonaVec, a deterministic, embedded vector-search kernel for edge and offline AI -- settings where server infrastructure, network connectivity, and training data are all unavailable. Existing vector-search systems assume a persistent server, gigabytes of RAM, or a training pass over the corpus; MonaVec instead targets the deployment profile of SQLite: one file, one function call, runs anywhere. Its quantization core is training-free by default and data-oblivious: a Randomized Hadamard Transform (RHDH) conditions any input distribution toward N(0,1), so precomputed Lloyd-Max tables quantize to 4 bits (8x smaller) with no learned codebook and no data pass. The index persists as a single .mvec file whose embedded ChaCha20 rotation seed makes results reproducible across architectures and byte-identical within a build -- a determinism guarantee that parallel-build graph libraries cannot offer. On semantic embeddings (AG News, 45K x 1024-dim BGE-M3, cosine), MonaVec 4-bit BruteForce reaches 0.960 Recall@10 in 27 MB -- leading float32 FAISS-IVF and 8-bit usearch on recall -- while trading peak throughput for byte-identical determinism. A single-pass global standardization (fit()) extends the same data-oblivious pipeline to magnitude-sensitive L2 data, and optional IvfFlat and HNSW backends carry it to million-vector corpora. MonaVec is implemented in pure Rust with Python bindings and runtime SIMD dispatch (AVX-512/AVX2/NEON/scalar). It targets on-device RAG, offline agents, and embedded retrieval -- the niche SQLite occupies for relational data: one file, one call, runs anywhere.

Comments:
27 pages, 11 figures. Code and artifacts: this https URL (PyPI: monavec; this http URL: monavec-core). Zenodo: doi:https://doi.org/10.5281/zenodo.20559587

Subjects:

Information Retrieval (cs.IR)

ACMclasses:
H.3.3; E.4

Cite as:
arXiv:2606.19458 [cs.IR]

(or
arXiv:2606.19458v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.19458

Focus to learn more

              arXiv-issued DOI via DataCite</p>

18. 【2606.19376】Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

链接：https://arxiv.org/abs/2606.19376

作者：Herbert Woisetschläger,Arastun Mammadli,Ryan Zhang,Shiqiang Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：large language model, Service Level Agreements, rising infrastructure cost, language model, applications are rapidly

备注： Preprint. Under review

点击查看摘要

Abstract:Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements (SLAs), creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.

计算机视觉

1. 【2606.20563】JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

链接：https://arxiv.org/abs/2606.20563

作者：Siang-Ling Zhang,Huai-Hsun Cheng,Tsung-Ju Yang,Yu-Lun Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mesh that reveals, viewing angles, tough challenge, fascinating but tough, Creating

备注： ECCV 2026. Project page: [this https URL](https://siang1105.github.io/JanusMesh.github.io/)

点击查看摘要

Abstract:Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: this https URL

2. 【2606.20561】meProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

链接：https://arxiv.org/abs/2606.20561

作者：Arkaprava Sinha,Dominick Reilly,Siddharth Krishnan,Hieu Le,Srijan Das

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video Question Answering, Question Answering, Long Video Question, requires identifying sparse, hours-long untrimmed videos

备注：

点击查看摘要

Abstract:Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

3. 【2606.20559】UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

链接：https://arxiv.org/abs/2606.20559

作者：Wenhao Chi,Arkaprava Sinha,Dominick Reilly,Hieu Le,Srijan Das

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：single modality, Egocentric video understanding, Egocentric video, single viewpoint, wearable cameras

备注：

点击查看摘要

Abstract:Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

4. 【2606.20556】hinking in Boxes: 3D Editing in Real Images Made Easy

链接：https://arxiv.org/abs/2606.20556

作者：Pradhaan S Bhat,Naveen Chandra R,Rishubh Parihar,Vaibhav Vavilala,R. Venkatesh Babu,D.A. Forsyth,Anand Bhattad

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：interfaces provide weak, motions and camera, large object motions, provide weak, Text

备注： Project Page: [this https URL](https://thinking-in-boxes.github.io/)

点击查看摘要

Abstract:Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

5. 【2606.20547】he Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

链接：https://arxiv.org/abs/2606.20547

作者：Przemyslaw Musialski

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO); Differential Geometry (math.DG)

关键词：matrix Lie group, Lie group elements, bare matrix Lie, matrix Lie, Lie group

备注： preprint, 19 pages, 3 figures

点击查看摘要

Abstract:We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $\rho(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

6. 【2606.20545】Current World Models Lack a Persistent State Core

链接：https://arxiv.org/abs/2606.20545

作者：Jinpeng Lu,Dexu Zhu,Haoyuan Shi,Linghan Cai,Guo Tang,Yinda Chen,Jie Cao,Duyu Tang,Yi Zhang,Yong Dai,Xiaozhu Ju

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：artificial general intelligence, rendering convincing frames, physical world demands, internal world state, decoupled from observation

备注： 39 pages, 16 figures

点击查看摘要

Abstract:World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

7. 【2606.20543】SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

链接：https://arxiv.org/abs/2606.20543

作者：Shilong Xiang,Zirui Zhang,Lijun Yu,Chengzhi Mao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mirroring language modeling, Spatially Speculative Decoding, mirroring language, language modeling, Autoregressive models excel

备注：

点击查看摘要

Abstract:Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

8. 【2606.20542】CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

链接：https://arxiv.org/abs/2606.20542

作者：Ilona Demler,Xinran Xie,Blake Werner,Anna Szczuka,Pietro Perona

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Caltech Tennis Dataset, Caltech Tennis, Tennis Dataset, pose estimation, Caltech

备注：

点击查看摘要

Abstract:The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.

9. 【2606.20536】he FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

链接：https://arxiv.org/abs/2606.20536

作者：Nicolas Dufour,Alexei A. Efros,Patrick Pérez

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Frechet Inception Distance, Inception Distance, Frechet Inception, FID, facto arbiter

备注： Website: [this https URL](https://kyutai.org/fid-lottery)

点击查看摘要

Abstract:The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

10. 【2606.20531】VisDom: Sparse Novel View Synthesis with Visible Domain Constraint

链接：https://arxiv.org/abs/2606.20531

作者：Mariia Gladkova*,Tarun Yenamandra*,Edmond Boyer,Robert Maier,Tony Tung,Daniel Cremers

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ambiguity of recovering, remains challenging due, Gaussian Splatting, view synthesis, sparse settings

备注：

点击查看摘要

Abstract:Sparse novel view synthesis (NVS) remains challenging due to the ambiguity of recovering 3D geometry from few input views. While NeRF- and Gaussian Splatting (GS)-based methods perform well with dense supervision, they often overfit in sparse settings, producing floating artifacts and inconsistent geometry. Silhouette consistency is commonly used as a regularizer, but it remains insufficient, as silhouette-consistent regions can extend beyond the true object geometry. We introduce VisDom, a learning-free geometric constraint that augments classical carving-based visual hull reconstruction by enforcing a minimum multi-view visibility requirement. Specifically, we define a visible domain as the subset of 3D space observed by at least $K$ views and use it as an additional filtering criterion on top of standard silhouette-based reconstruction. This provides a stronger spatial prior in sparse-view settings. We integrate VisDom into both implicit (NeRF) and explicit (GS) pipelines by restricting volumetric sampling and guiding Gaussian placement during optimization. Experiments on three challenging datasets show consistent improvements in sparse-view NVS, enabling high-quality object-centric reconstruction from as few as four input images. Our method is domain-agnostic, requires only silhouettes, and introduces no learned parameters, making it a simple complement to existing approaches. Applying VisDom on top of GaussianObject further improves performance on Omni3D and MipNeRF360, while matching or surpassing it at 22 $\times$ lower training cost.

11. 【2606.20527】StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

链接：https://arxiv.org/abs/2606.20527

作者：Shaghayegh Kolli,Timo Cavelius,Nafiseh Nikeghbal,Samantha Dalal,Jana Diesner

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：societally consequential settings, remain poorly understood, judge people remain, people remain poorly, Multimodal large language

备注： Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

点击查看摘要

12. 【2606.20523】SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

链接：https://arxiv.org/abs/2606.20523

作者：Solène Debuysère,Nicolas Trouvé,Nathan Letheule,Elise Colin,Georgia Channing

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词：synthetic aperture radar, Multimodal foundation models, Ground Range Detected, SAR, remain limited

备注：

点击查看摘要

Abstract:Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at this https URL.

13. 【2606.20521】HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

链接：https://arxiv.org/abs/2606.20521

作者：Juncheng Ma,Jianxin Bi,Yufan Deng,Xuanran Zhai,Kewei Zhang,Ye Huang,Bo Liang,Shukai Gong,Jiankai Tu,Xiaotian Tang,Jiaxin Li,Kaiqi Chen,Duomin Wang,Yuqi Wang,Bingyi Kang,Eric Huang,Zhiyang Dou,Zhen Dong,Enze Xie,Wojciech Matusik,Tat-Seng Chua,Daquan Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Embodied foundation models, large language models, tighter data bottleneck, Embodied foundation, egocentric human video

备注： Github: [this https URL](https://github.com/DAGroup-PKU/HumanNet/)

点击查看摘要

Abstract:Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

14. 【2606.20515】S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

链接：https://arxiv.org/abs/2606.20515

作者：Yalun Dai,Hao Li,Shulin Tian,Runmao Yao,Yuhao Dong,Fangzhou Hong,Zhaoxi Chen,Fangfu Liu,Baoliang Tian,Dingwen Zhang,Tao Wang,Kim-Hui Yap,Ziwei Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Real-world spatial intelligence, largely remain tied, isolated visual observations, intelligence requires reasoning, tool-augmented agents largely

备注： Project Page : [this https URL](https://Ropedia.github.io/S-Agent)

点击查看摘要

Abstract:Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

15. 【2606.20506】FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

链接：https://arxiv.org/abs/2606.20506

作者：Jinghong Lan,Wei Cheng,Yunuo Chen,Ziqi Ye,Peng Xing,Yixiao Fang,Rui Wang,Yufeng Yang,Xuanyang Zhang,Xianfang Zeng,Difan Zou,Gang Yu,Chi Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：http URL recent, http URL treat, http URL address, Style-content dual-reference generation, URL treat community

备注： 35 pages, 26figures. Project page: [this https URL](https://github.com/Blue2Giant/FreeStyle)

点击查看摘要

Abstract:Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style this http URL recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style this http URL this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA this http URL treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base this http URL address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference this http URL also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage this http URL experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.

16. 【2606.20491】Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation

链接：https://arxiv.org/abs/2606.20491

作者：Fatma Youssef Mohammed,Grzegorz Malczyk,Kostas Alexis

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：efficiently process scenes, Liquid Neural Networks, leverages Liquid Neural, process scenes, existing predictive models

备注： Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

点击查看摘要

Abstract:Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.

17. 【2606.20488】How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

链接：https://arxiv.org/abs/2606.20488

作者：Jingwen Zhou,Mingzhe Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：AI-generated images promise, images promise generator-agnostic, promise generator-agnostic deployment, single controlled protocol, classifier training

备注：

点击查看摘要

Abstract:Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores -- an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) -- plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet - VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to 0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

18. 【2606.20477】Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

链接：https://arxiv.org/abs/2606.20477

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：train visually grounded, visually grounded vision-language, manual spatial annotations, grounded vision-language models, train visually

备注： Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

点击查看摘要

19. 【2606.20455】PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

链接：https://arxiv.org/abs/2606.20455

作者：Haoyuan Shen,Kuihao Wang,Ruisheng Wang,Yujun Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remote sensing, task in photogrammetry, computer vision, fundamental task, footprint extraction

备注： 14 pages, 9 figures

点击查看摘要

Abstract:Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{this https URL}.

20. 【2606.20449】InfantFace: Detecting infant faces in neonatal clinical environments

链接：https://arxiv.org/abs/2606.20449

作者：Abdullah Bin-Obaid,Maria M. Cobo,Rebeccah Slater,Lionel Tarassenko,Mauricio Villarroel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：facial expression analysis, cardiorespiratory signal extraction, video-camera based non-contact, distress related facial, related facial expression

备注： 32 pages, 7 figures, 4 tables; supplementary information included

点击查看摘要

Abstract:Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.

21. 【2606.20419】Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

链接：https://arxiv.org/abs/2606.20419

作者：Karn Tiwari,Varnith Chordia,Prathosh A P

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visually unsupported descriptions, mentioning objects absent, Vision-language models, Product Steering, unsupported descriptions

备注： Under Review

点击查看摘要

Abstract:Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR$_s$ reduction of $4.0\%$, while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.

22. 【2606.20416】On the Redundancy of Timestep Embeddings in Diffusion Models

链接：https://arxiv.org/abs/2606.20416

作者：José A. Chávez

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：models rely heavily, Diffusion models rely, rely heavily, modulate the denoising, denoising process

备注： 17 pages

点击查看摘要

Abstract:Diffusion models rely heavily on explicit timestep embeddings to modulate the denoising process across various noise scales. In this work, we challenge the necessity of these temporal signals by analyzing their impact on U-Net and Diffusion Transformer architectures. Beyond empirical evidence, we provide a theoretical framework demonstrating that, under certain conditions, the global minimizer of the diffusion training objective can be achieved without explicit timestep conditioning. Our findings reveal a surprising robustness when timestep embeddings are completely removed. Extensive ablation studies on the CelebA and CIFAR-10 datasets show that these time-agnostic models can maintain high structural fidelity and even surpass their conditioned counterparts in competitive metrics, including FID, precision, and recall. Our analysis suggests these architectures can implicitly infer noise scales from the corrupted input under specific assumptions, rendering explicit temporal conditioning redundant. This study challenges long-standing temporal conditioning paradigms and paves the way for more efficient and structurally focused generative architectures.

23. 【2606.20404】FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

链接：https://arxiv.org/abs/2606.20404

作者：Daniel Gilo,Sven Elflein,Ido Sobol,Or Litany

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：flow models routinely, models routinely fail, Conditional diffusion, define their task, diffusion and flow

备注： Project page: [this https URL](https://flow-bender.github.io/)

点击查看摘要

Abstract:Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: this https URL

24. 【2606.20390】Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification

链接：https://arxiv.org/abs/2606.20390

作者：Muhammad Azeem,Tanveer Hussain,Amr Ahmed,Ardhendu Behera

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated skin cancer, strong intra-class variability, dermoscopic images remains, images remains challenging, remains challenging due

备注： Accepted at MICCAI 2026

点击查看摘要

Abstract:Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.

25. 【2606.20312】Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

链接：https://arxiv.org/abs/2606.20312

作者：Ning Dong,Yingna Su,Xin Dong,Ziyun Jiao,Xinnian Guo,Zhuangzhuang Pan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video anomaly detectors, tracked skeleton windows, Pose-flow video anomaly, provide likelihood-based rankings, video anomaly

备注： 15 pages, 5 figures, 7 tables. Code available at [this https URL](https://github.com/iNing10/RPC)

点击查看摘要

Abstract:Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.

26. 【2606.20310】hrough the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

链接：https://arxiv.org/abs/2606.20310

作者：Haoxuan Wu,Lai Man Po,Mengyang Liu,Kun Li,Hongzheng Yang,Wei Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：VAE decoding costs, incurs massive VAE, massive VAE decoding, pixel-based reward models, Evaluating video generation

备注：

点击查看摘要

Abstract:Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

27. 【2606.20303】GEN-Guard: Correcting Generalization Failures for Deployable Federated Surgical AI

链接：https://arxiv.org/abs/2606.20303

作者：Julia Alekseenko,Pietro Mascagni,AI4SafeChole Consortium,Nicolas Padoy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enables collaborative model, collaborative model training, Federated Learning, sharing sensitive data, video AI enables

备注：

点击查看摘要

Abstract:Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the "best" global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.

28. 【2606.20302】CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection

链接：https://arxiv.org/abs/2606.20302

作者：Giovanni Affatato,Sara Mandelli,Edoardo Daniele Cannas,Paolo Bestagini,Stefano Tubaro

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high-profile individual, democracies and societies, POI, modern deepfake detectors, POI video deepfake

备注：

点击查看摘要

Abstract:Deepfakes targeting a high-profile individual, known as Person-of-Interest (POI), are a threat to modern democracies and societies. Current POI deepfake detection methods still struggle to combine robustness to post-processing, efficiency and interpretability, focal aspects of modern deepfake detectors. In this paper we propose CUPID, a POI video deepfake detector that combines UV texture maps, a facial appearance representation derived from 3D face reconstructions, with the representation learning capabilities of the Masked Autoencoder (MAE). Our method does not require any deepfake videos in its training phase. Moreover, it does not even require to include a specific POI in the training set: the combination of UV texture maps extracted from real video frames and the MAE context-guided reconstruction yields a latent space that captures rich and discriminative facial features also for identities unseen during training. In the testing phase, the embeddings extracted from a query video depicting the POI can be matched against pristine reference videos to assess the video authenticity. Furthermore, operating in the UV space naturally provides an additional layer of interpretability. Specifically, we can extract decoded residual maps that highlight which facial regions of a test video deviate most from the identity representation of the corresponding POI. Experiments on four deepfake datasets show that CUPID outperforms current state of the art on most datasets and achieves the best overall robustness against strong downscaling and compression, providing also substantially faster inference. Our experimental code will be released at this https URL.

29. 【2606.20300】CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

链接：https://arxiv.org/abs/2606.20300

作者：Junhao Cai,Deyu Zeng,Junhao Pang,Junyu Chen,Qiwei Liang,Xiaopin Zhong,Zongze Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Few-shot anomaly detection, remains challenging due, detection remains challenging, anomaly detection remains, Few-shot anomaly

备注： Accepted to ECCV 2026!

点击查看摘要

Abstract:Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.

30. 【2606.20291】Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

链接：https://arxiv.org/abs/2606.20291

作者：Luke J. Zachmann,David D. Diaz,Vincent A. Landau,Chelsey Walden-Schreiner,Tony Chang,Nathan E. Rutenbeck,Katharyn A. Duffy,Kiarie Ndegwa,Andreas Gros,Scott Conway,Guy Bayes

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：deliver actionable science, Remote sensing, wildfire risk management, sensing is increasingly, increasingly relied

备注：

点击查看摘要

Abstract:Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.

31. 【2606.20282】U$^2$Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection

链接：https://arxiv.org/abs/2606.20282

作者：Junhui Li,Jialu Li,Youshan Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offering significant advantages, modeling long sequences, salient object detection, Mamba-based models, offering significant

备注： 6 pages, 2 figures

点击查看摘要

Abstract:Mamba-based models have emerged as a promising alternative for salient object detection (SOD), offering significant advantages in modeling long sequences. However, existing models often fail to explore contextual information and the depth of the entire architecture. This paper introduces U$^2$Mamba, a powerful and innovative U-structured network for salient object detection. We propose multiscale Mamba U-blocks (MMUBs) that enhance the model depth to improve local feature extraction capabilities. Our newly developed nested U-structure, incorporating MMUBs, enables the network to integrate various receptive fields from shallow and deep layers, thereby collecting richer contextual information and longer-range data without being constrained by resolution. Instead of using the traditional deep supervision scheme and top-level supervised training, we propose a hierarchical training supervision method where the loss is computed at each level during the training process. Extensive experiments demonstrate that U$^2$Mamba achieves highly competitive performance against state-of-the-art methods. The source code is available at \url{this https URL}.

32. 【2606.20272】Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

链接：https://arxiv.org/abs/2606.20272

作者：Paul Koch,Vivek Chavan,André Sers,Adem Karakurt,Paul Hofmann,Mohamad Zaher Ziadeh,Jörg Krüger

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：vision models, driving factor, potential use case, case scenarios, scenarios of cognitive

备注： Accepted and best paper award at MHI-Kolloquium 2024

点击查看摘要

Abstract:AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.

33. 【2606.20250】Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

链接：https://arxiv.org/abs/2606.20250

作者：Duc T. Nguyen,Hoang-Long Nguyen,Thanh-Ha DO,Huy-Hieu Pham

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing weakly supervised, fully supervised retraining, Existing weakly, offline pseudo-mask refinement, supervised semantic segmentation

备注： Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings

点击查看摘要

Abstract:Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: this https URL

34. 【2606.20244】SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

链接：https://arxiv.org/abs/2606.20244

作者：Bo Yin,Xiaobin Hu,Chengming Xu,Ruolin Shen,Mo Yang,Jiangning Zhang,Peng-Tao Jiang,Cheng Tan,Shuicheng YAN

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Vision-language models, evidence intensive tasks, decisive visual evidence, easy to overlook, leading to failures

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{this https URL}

35. 【2606.20241】BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models

链接：https://arxiv.org/abs/2606.20241

作者：Thomas Klassert,Adrian Ulges,Biying Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generative artificial intelligence, Generative artificial, creative content, artificial intelligence, potential to improve

备注： Accepted at the IEEE Winter Conference on Applications of Computer Vision, WACV 2026

点击查看摘要

Abstract:Generative artificial intelligence has the potential to improve productivity and transform the production of creative content. However, existing research indicates that image generation models are significantly influenced by biases. This work investigates the inherent biases and language-induced biases present in text-to-image models within the context of occupation-related image generation, complementing established metrics with human preference feedback. We present a comprehensive evaluation of five current text-to-image models: Midjourney v6.1, Stable Diffusion 3 Medium, DALL-E 3, Playground v2.5, and FLUX.1-dev , focusing on gender and ethnicity bias, image quality, and prompt alignment. To facilitate this evaluation, we developed the "Battle-Arena for Fair Image Synthesis" (BAFIS), a platform designed to collect human feedback on bias in generated images. Furthermore, we created a dataset comprising 21,140 synthetic images generated using multilingual prompts, which serves as a basis for our analysis. We further place our results within a broader social context by comparing them to official statistics from the German Federal Employment Agency. Our findings reveal systematic biases in text-to-image models, with established evaluation metrics in partial correlation with subjective user ratings. Thus, our research emphasizes the need for including human preferences to develop fairer and more inclusive text-to-image models.

36. 【2606.20233】Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

链接：https://arxiv.org/abs/2606.20233

作者：Tianyi Xiang,Mingming He,Li Ma,Jing Liao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Cinematic compositing aims, integrate green-screen characters, photometric realism, Cinematic compositing, aims to integrate

备注：

点击查看摘要

Abstract:Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.

37. 【2606.20223】DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests

链接：https://arxiv.org/abs/2606.20223

作者：Hugo Magaldi,Theau d'Audiffret,Etienne Francois Akomo-Okoue,Bala Amarasekaran,Naomi Anderson,Claire Auger,Noemie Cappelle,Daniel Cornelis,Raphael Cornette,Tobias Deschner,Gabriel Dubus,Davy Fonteyn,Rosa M. Garriga,Jennifer Hatlauf,Innocent Kasekendi,Raymond Katumba,Aram Kazandjian,Alfred Ngomanda,Stephan Ntie,Simone Pika,Xavier Rufray,Harold Rugonge,John Justice Tibesigwa,Peter van Lunteren,Hadrien Vanthomme,Joeri A. Zwerts,Sabrina Krief

类目：Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词：African tropical forests, tropical forests increasingly, forests increasingly extends, African forest camera-trap, African tropical

备注： Accepted at ICPR 2026 - Computer Vision for Biodiversity Monitoring and Conservation Workshop

点击查看摘要

Abstract:Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.

38. 【2606.20199】Evaluation of Image Matching for Art Skills Assessment

链接：https://arxiv.org/abs/2606.20199

作者：Asaad Alghamdi,Michael Poor,Trung-Nghia Le,Tam V. Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires dedicated training, training and practice, individuals possess, possess a natural, natural talent

备注： MAPR 2024

点击查看摘要

Abstract:While some individuals possess a natural talent for drawing, mastering this skill requires dedicated training and practice. Determining one's skill in the art of drawing requires proper comprehensive assessment. In this paper, we propose a method to measure drawing skill by by matching the hand-drawn image with the original template. Existing techniques often involve complex processes. However, advancements in computer vision allow us to train computers to perform these comparisons at a human-like level, thereby resolving the tedious and overwhelming traditional process. Using computer vision applications, determining image similarity involves identifying the level of similarities in an image with a reference image. We have implemented and analyzed the SIFT feature and Siamese network to measure image similarity. Our results indicate that it is feasible to assess art skill levels. Through feature analysis, we found that SIFT-based key point matching provides a more effective means of detecting drawing skills.

39. 【2606.20196】Distill Once, Adapt Life-Long: Exploring Dataset Distillation for Continual Test-Time Adaptation

链接：https://arxiv.org/abs/2606.20196

作者：Hyun-Kurl Jang,Jihun Kim,Hyeokjun Kweon,Kuk-Jin Yoon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Continual Test-Time Adaptation, maintain model performance, Continual Test-Time, evolving target domains, aims to maintain

备注： ECCV 2026

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) aims to maintain model performance under evolving target domains by adapting online without labeled data. However, practical deployments often cannot retain the source dataset due to privacy or licensing constraints, and purely source-free CTTA methods tend to become unstable under long-term distribution shift, suffering from compounding self-training errors and catastrophic forgetting. We introduce DO-ALL (Distill Once, Adapt Life-Long), a plug-and-play framework that revisits source information in a compact and privacy-conscious form via Dataset Distillation (DD). Before deployment, DO-ALL performs DD to produce a small set of synthetic distilled anchors that summarize the source distribution. During adaptation, each target sample is matched with its most semantically aligned anchor, which provides a stable reference for various CTTA via source replay, representation alignment, and manifold-smoothing regularization. DO-ALL can be seamlessly integrated into existing CTTA algorithms, consistently improving long-term robustness across CIFAR100-C, ImageNet-C, and the CCC benchmark. This demonstrates the potential of leveraging DD to enable stable and continuous adaptation without retaining raw source data. The code is available at this https URL.

40. 【2606.20189】HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

链接：https://arxiv.org/abs/2606.20189

作者：Maciej Wozniak,Jesper Ericsson,Hariprasath Govindarajan,Truls Nyberg,Thomas Gustafsson,Patric Jensfelt,Olov Andersson

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：Leveraging Vision Foundation, Vision Foundation Models, Leveraging Vision, Vision Foundation, annotated data needed

备注： Accepted to ECCV 2026. Maciej and Jesper contributed equally

点击查看摘要

Abstract:Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: this https URL.

41. 【2606.20177】Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

链接：https://arxiv.org/abs/2606.20177

作者：Haochen Han,Jue Wang,Alex Jinpeng Wang,Fangming Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Remote Sensing, Large Language

备注： ECCV 2026 Accepted

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.

42. 【2606.20161】ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

链接：https://arxiv.org/abs/2606.20161

作者：Tong Wang,Siwen Wang,Yaolei Qi,Jinxing Zhou,Yuting He,Guanyu Yang,Yutong Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video polyp segmentation, including weak annotations, supervised video polyp, temporally consistent masks, densely labeled frames

备注：

点击查看摘要

Abstract:Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at this https URL.

43. 【2606.20155】NAMESAKES: Probing Identity Memorization in Text-to-Image Models

链接：https://arxiv.org/abs/2606.20155

作者：Morris Alper,Vasudha Varadarajan,Moran Yanuka,Angelina Wang,Hadar Averbuch-Elor

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：raising privacy concerns, generate realistic likenesses, models generate realistic, raising privacy, privacy concerns

备注：

点击查看摘要

44. 【2606.20143】HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

链接：https://arxiv.org/abs/2606.20143

作者：Numan Saeed,Salma Hassan,Shahad Hardan,Lishan Cai,Xinglong Liang,Moona Mazher,Abdul Qayyum,Yansong Bu,Mengye Lyu,Yue Lin,Mingyuan Meng,Chuanyi Huang,Lisheng Wang,Dalal Chamseddine,Shamimeh Ahrari,Beining Wu,Yifei Chen,Fuyou Mao,Hao Zhang,Baixiang Zhao,Surajit Ray,Muzi Guo,Lei Xiang,Jakob Dexl,Michael Ingrisch,Adrien Depeursinge,Arman Rahmim,Mathieu Hatt,Vincent Andrearczyk,Mohammad Yaqub

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：effective radiotherapy planning, accurate tumor delineation, global health burden, Head and neck, significant global health

备注： 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: [this https URL](https://hecktor.grand-challenge.org/)

点击查看摘要

Abstract:Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.

45. 【2606.20140】SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

链接：https://arxiv.org/abs/2606.20140

作者：Edoardo Mello Rella,Ajad Chhatkuli,Shipra Jain,Ender Konukoglu,Luc Van Gool

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent online video, achieved impressive results, Recent online, methods have achieved, video instance segmentation

备注：

点击查看摘要

Abstract:Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

46. 【2606.20131】riFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

链接：https://arxiv.org/abs/2606.20131

作者：Haoxuan Li,Ziya Erkoç,Daniele Sirigatti,Vladislav Rosov,Lei Li,Angela Dai,Matthias Nießner

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：signed distance fields, producing compact, input geometry conditions, triangle topology directly, generative approach

备注：

点击查看摘要

Abstract:We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

47. 【2606.20130】SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

链接：https://arxiv.org/abs/2606.20130

作者：Xuesong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Semantic Segmentation Challenge, Fine-Grained Semantic Segmentation, Segmentation Challenge, Semantic Segmentation, Fine-Grained Semantic

备注： 4th place in ICRA 2026 GOOSE 2D Semantic Segmentation Challenge

点击查看摘要

Abstract:We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

48. 【2606.20115】When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage

链接：https://arxiv.org/abs/2606.20115

作者：Nafis Fuad Shahid

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Conformal risk control, Conformal risk, segmentation quality, quality by calibrating, CRC

备注： 9 pages, 3 figures, 2 tables. Submitted to the DeCaF Workshop at MICCAI 2026

点击查看摘要

Abstract:Conformal risk control (CRC) provides distribution-free guarantees on segmentation quality by calibrating a prediction-set threshold on held-out data. In federated deployments, the standard approach pools calibration scores across sites into a single threshold. We provide the first quantification, on real multi-institutional brain tumor data (FeTS-2022, 1,251 subjects, 20 institutions), showing that this naive pooled CRC protects the average hospital but violates coverage at 40% of individual institutions, with the worst site exceeding the target false-negative rate by 7.8 percentage points. The naive alternative, per-site local CRC, largely restores coverage but inflates prediction sets by 83x, rendering them clinically useless. We propose a shrinkage-based federated CRC protocol: each site transmits only its empirical risk curve (G scalars) to a server, which computes a shrinkage-regularized threshold per site. A single hyperparameter n0 smoothly trades worst-case coverage for prediction-set efficiency; leave-one-site-out sensitivity analysis identifies n0=19, achieving 2.7/20 violations at 2.0x stretch. We further show that direct Lagrangian optimization of coverage budgets fails, concentrating risk on vulnerable hospitals, and that the finite-sample correction term is essential: removing it triples violations. The marginal CRC guarantee is preserved by construction under the stated site-mixture assumption; per-site coverage is validated across four targets with three seeds. No patient-level images, masks, or per-volume scores leave any site.

49. 【2606.20112】Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

链接：https://arxiv.org/abs/2606.20112

作者：Zhenkai Zhang,Markus Hiller,Krista A. Ehinger,Tom Drummond

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：fine details remains, details remains challenging, remains challenging due, substantial computational demands, Generating high-resolution

备注： Accepted at ICLR 2026. Code available at [this https URL](https://github.com/Fredy-Zhang/PRDiT)

点击查看摘要

Abstract:Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

50. 【2606.20110】FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

链接：https://arxiv.org/abs/2606.20110

作者：Yuhwan Jeong,Hyeonseong Kim,Daehyun We,Seonkyu Song,Jinnyeong Yang,Hyun-Kurl Jang,Youngho Yoon,Kuk-Jin Yoon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scalable scene generation, promise scalable scene, Synthetic data, promise scalable, Synthetic

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.

51. 【2606.20108】EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

链接：https://arxiv.org/abs/2606.20108

作者：Pengwei Wang,José Morano,Qian Wan,Hrvoje Bogunović

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Image quality control, control is vital, wide range, range of downstream, Deep learning-based image

备注： Accepted in MIDL 2026. Code: [this https URL](https://github.com/penway/EFIQA)

点击查看摘要

Abstract:Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

52. 【2606.20103】Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

链接：https://arxiv.org/abs/2606.20103

作者：Kyoleen Kwak,Daeho Kim,Jeong Woon Lee,Hyoseok Hwang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：robust multi-modal perception, Accurate LiDAR-camera calibration, Accurate LiDAR-camera, multi-modal perception, essential for robust

备注： Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

点击查看摘要

Abstract:Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

53. 【2606.20100】WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

链接：https://arxiv.org/abs/2606.20100

作者：Qian Liang,Xiaomin Li,Ying Zhang,Jia Xu,Lihao Ni,Hongrui Li,Jingjing Li,Jing Lyu,Chen Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：synthesizing highly realistic, highly realistic images, demonstrated remarkable capabilities, demonstrated remarkable, synthesizing highly

备注：

点击查看摘要

Abstract:Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.

54. 【2606.20095】Stitching and dimensionality effects on large artificially generated volume datasets

链接：https://arxiv.org/abs/2606.20095

作者：Lucas von Chamier,Jan Philipp Albrecht,Dagmar Kainmüller

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hardware memory limitations, Generating large images, patching input data, accommodate hardware memory, assembling output patches

备注：

点击查看摘要

Abstract:Generating large images via deep learning requires patching input data to accommodate hardware memory limitations, then assembling output patches, a process that can introduce stitching artifacts when neighboring patches do not align at borders. While these artifacts are known to affect segmentation tasks, their impact on generative models for style-transfer remains poorly understood. We investigated three stitching approaches and two patch dimensionalities (2D vs 3D) using cycleGAN models trained on cryo-electron microscopy datasets. We evaluated both perceptual quality and performance on downstream mitochondria segmentation. Our key findings reveal that: (1) FID scores fail to detect subtle stitching artifacts that significantly impact downstream segmentation performance, (2) 3D models with artifact-free stitching marginally outperform 2D models on downstream tasks, though the improvement barely justifies the computational cost, and (3) 2D models train more stably due to larger batch sizes. Additionally, we demonstrate that ensembling predictions from three orthogonal directions can improve low-quality volumes but provides no benefit for high-quality outputs. These results demonstrate that maximizing generative model performance on large scientific datasets requires careful consideration and mitigation of stitching artifacts, and that perceptual metrics alone are insufficient for evaluating domain adaptation quality in biomedical imaging.

55. 【2606.20094】MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

链接：https://arxiv.org/abs/2606.20094

作者：Nefeli Andreou,Angel Martínez-González,Sabine Sternig,Matthieu Guillaumin,Epameinondas Antonakos,Michael Opitz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：fun augmented reality, online makeup shopping, models enable fun, enable fun augmented, Makeup transfer

备注：

点击查看摘要

Abstract:Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

56. 【2606.20092】EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

链接：https://arxiv.org/abs/2606.20092

作者：Ganlin Yang,Zhangzheng Tu,Yuqiang Yang,Sitong Mao,Junyi Dong,Tianxing Chen,Jiaqi Peng,Jing Xiong,Jiafei Cao,Jifeng Dai,Wengang Zhou,Yao Mu,Tai Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long-horizon robotic manipulation, policies often fail, remains a critical, long-horizon robotic, fail when task-relevant

备注：

点击查看摘要

Abstract:Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

57. 【2606.20083】Holo-World: Unified Camera, Object and Weather Control for Video World Model

链接：https://arxiv.org/abs/2606.20083

作者：Xiangchen Yin,Wenzhang Sun,Jiahui Yuan,Zijie Liu,Yinda Chen,Wei Li,Dachun Kai,Chunfeng Wang,Xiaoyan Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：weather, moving toward preserving, preserving an observed, motion while allowing, allowing its environmental

备注： Project Page: \url{ [this https URL](https://xiangchenyin.github.io/Holo-World) } Code: \url{ [this https URL](https://github.com/XiangchenYin/Holo-World) }

点击查看摘要

Abstract:Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{this https URL}.

58. 【2606.20077】he Hidden Evolution of Disguised Visual Context inside the VLM

链接：https://arxiv.org/abs/2606.20077

作者：Wish Suharitdamrong,Tony Alex,Muhammad Awais,Sara Atito

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, enter Large Language, tokens enter Large, Language Models, Large Language

备注：

点击查看摘要

Abstract:Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

59. 【2606.20076】Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

链接：https://arxiv.org/abs/2606.20076

作者：Dong Hoon Lee,Seunghoon Hong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：fixed compression ratio, visual synthesis, tokenizer fixed compression, Latent Diffusion Models, dominant in visual

备注：

点击查看摘要

Abstract:Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](this https URL)

60. 【2606.20045】See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

链接：https://arxiv.org/abs/2606.20045

作者：Fanfu Xue,En Yu,Yantian Shen,Zhikun Hu,Hongjun Wang,Yang Yang,Xindi Wang,Jiande Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：long-range target discovery, final target approach, UAV Vision-Language Navigation, evaluated jointly, typically formulated

备注： 12 pages, 7 figures

点击查看摘要

Abstract:UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at this https URL.

61. 【2606.20044】FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

链接：https://arxiv.org/abs/2606.20044

作者：Xuanhao Qi,Tom H. Luan,Yukang Zhang,Jinkai Zheng,Zhou Su,Shuwei Li,Lei Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：emphasize low-frequency cues, existing methods tend, existing methods, low-frequency cues, significant progress

备注： Accepted in ICML 2026

点击查看摘要

Abstract:Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

62. 【2606.20035】PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation

链接：https://arxiv.org/abs/2606.20035

作者：Ziyuan Li,Osamah Sufyan,Uwe Jaekel,Babette Dellen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：dense prediction networks, model higher-order feature, additive feature transformations, prediction networks rely, dense prediction

备注： Accepted to the ICANN 2026

点击查看摘要

Abstract:Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic--exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

63. 【2606.20032】ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

链接：https://arxiv.org/abs/2606.20032

作者：Hongming Zhu,Huaji Chen,Bowen Du,Sicong Liu,Qin Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unlike traditional remote, identifies land cover, arbitrary text prompts, traditional remote sensing, Open-Vocabulary Change Detection

备注：

点击查看摘要

Abstract:Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13\% to 9.75\% with higher computational efficiency. The code is publicly available at \this https URL

64. 【2606.20027】QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

链接：https://arxiv.org/abs/2606.20027

作者：Luca Zedda,Davide Antonio Mura,Cecilia Di Ruberto,Maurizio Atzori,Muhammed Furkan Dasdelen,Carsten Marr,Andrea Loddo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Attention-based Multiple Instance, Multiple Instance Learning, Attention-based Multiple, Instance Learning aggregators, Learning aggregators

备注：

点击查看摘要

Abstract:Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: this https URL

65. 【2606.19998】ri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

链接：https://arxiv.org/abs/2606.19998

作者：Jinghan Yang,Yunchao Zhang,Wang Yuan,Haolun Wan,Jiaming Zhang,Zhengyang Hu,Yanchao Yang

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：remain black boxes, failure detection essential, irreversible harm, making generalizable, detection essential

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.

66. 【2606.19985】Vision-Reasoning-Guided Occlusion Removal from Light Fields

链接：https://arxiv.org/abs/2606.19985

作者：Mohamed Youssef,Oliver Bimber

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vegetation severely limits, severely limits visibility, Occlusion-robust scene recovery, dense foreground vegetation, foreground vegetation severely

备注：

点击查看摘要

Abstract:Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

67. 【2606.19970】CrossFlow: One-Step Generation Across Latent and Pixel Spaces

链接：https://arxiv.org/abs/2606.19970

作者：Xiyuan Wang,Xiao Zhang,Yang Li,Ruoxi Jiang,Zhao Zhong,Liefeng Bo,Muhan Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：flow-matching generators define, Latent, probability path, autoencoder latent space, define the prior

备注： Preprint, Under Review

点击查看摘要

Abstract:Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

68. 【2606.19966】Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

链接：https://arxiv.org/abs/2606.19966

作者：Yucheng Xing,Ling Huang,Pei Liu,Jingying Ma,Jiaqing Xu,Kai He,Mengling Feng

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：computational cancer prognosis, Whole-slide images, cancer prognosis, computational cancer, Whole-slide

备注：

点击查看摘要

Abstract:Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.

69. 【2606.19965】ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

链接：https://arxiv.org/abs/2606.19965

作者：Yihao Wang,Zijian He,Jie Ren,Keze Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal large language, Multimodal large, large language models, large language, increasingly expected

备注： 29 pages, 11 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

70. 【2606.19961】Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

链接：https://arxiv.org/abs/2606.19961

作者：Kaili Wang,Martin Dimitrievski,Jose Maria Salvador,Ben Stoffelen,David Van Hamme,Lore Goetschalckx

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：degrading downstream perception, downstream perception tasks, discard fine spatial, fine spatial details, Latent diffusion models

备注：

点击查看摘要

Abstract:Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, 32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.

71. 【2606.19958】SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

链接：https://arxiv.org/abs/2606.19958

作者：Meixi Li,Xianlin Zhang,Yue Zhang,Xueming Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Traditional animation production, production relies heavily, animation production relies, Traditional animation, iterative refinement

备注：

点击查看摘要

Abstract:Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9\% and FVD by 9.5\%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.

72. 【2606.19950】Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

链接：https://arxiv.org/abs/2606.19950

作者：Yuetian Du,Yucheng Wang,Ming Kong,Tian Liang,Qiang Long,Bingdi Chen,Qiang Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, show great potential

备注： Accepted by MICCAI 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40\% across three Medical VQA datasets, significantly enhancing MLLMs' reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.

73. 【2606.19944】mage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

链接：https://arxiv.org/abs/2606.19944

作者：Yifeng Wu,Huimin Huang,Ruiluo Wu,Chunyi Lin,Guanhua Chen,Xian Wu,Wang Song,Ruize Han

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, textual query rarely, query rarely carries, explicit geometric anchor

备注： ECCV

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

74. 【2606.19939】DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

链接：https://arxiv.org/abs/2606.19939

作者：Wei Pan,Xuhan Zheng,Yilin Shi,Huiguo He,Hiuyi Cheng,Dezhi Peng,Minghui Liao,Lianwen Jin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Mathematical Expression Generation, complex two-dimensional layouts, Handwritten Mathematical Expression, long-range structural dependencies, Expression Generation

备注：

点击查看摘要

Abstract:Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

75. 【2606.19938】riangular Consistency as a Universal Constraint for Learning Optical Flow

链接：https://arxiv.org/abs/2606.19938

作者：Yi Xiao,Carlos Rodriguez Coronel,Jing Zhan,Haniyeh Ehsani Oskouie,Alex Wong,Dong Lao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：supervision type, network architecture, propose triangular consistency, agnostic to network, image-pair and multi-frame

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal'' plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.

76. 【2606.19934】Speeding up the annotation process in semantic segmentation industrial applications

链接：https://arxiv.org/abs/2606.19934

作者：Marta Fernandez-Moreno,Margarita Guerrero,Rosalia Rementeria,Pablo Mesejo,Raul Moreno

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Current machine learning, Current machine, commonly require large, models commonly require, commonly require

备注：

点击查看摘要

Abstract:Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78\%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.

77. 【2606.19932】Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models

链接：https://arxiv.org/abs/2606.19932

作者：Jindi Lv,Aoyu Li,Yuhao Zhou,Zheng Zhu,Xiaofeng Wang,Qing Ye,Yueqi Duan,Wentao Feng,Jiancheng Lv

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：long visual sequences, modeling long visual, demonstrates strong efficiency, Mamba demonstrates strong, visual sequences

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the two-dimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3\% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0\% accuracy drop on PlainMamba, achieving performance comparable to ViT.

78. 【2606.19927】CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

链接：https://arxiv.org/abs/2606.19927

作者：Chengwen Liu,Hao Peng,Jisheng Dang,Hong Peng,Bin Hu,Tat-Seng Chua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：learning-based methods typically, methods typically rely, inflexible reasoning-length control, reasoning-length control strategies, reinforcement learning-based methods

备注：

点击查看摘要

Abstract:In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at this https URL.

79. 【2606.19915】SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

链接：https://arxiv.org/abs/2606.19915

作者：Jiayu Tang,Yuchen Zhou,Chao Gou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal large language, large language model, multimodal large, large language, crucial for understanding

备注： Accepted by IJCAI 2026

点击查看摘要

Abstract:Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

80. 【2606.19908】Gaussian Process Prior Variational Autoencoder for Endoscopic Videos

链接：https://arxiv.org/abs/2606.19908

作者：Ivan De Boi,Xinxing Shi,Xiaoyu Jiang,Tim J.M. Jaspers,Francisco Caetano,Mauricio A. Alvarez,Fons van der Sommen,Sam Van der Jeught

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Process Prior, Endoscopic video analysis, motion artifacts, computer-assisted interventions, analysis is essential

备注：

点击查看摘要

Abstract:Endoscopic video analysis is essential for gastrointestinal diagnosis and computer-assisted interventions, but video sequences are routinely degraded by specular reflections, motion artifacts, and missing frames. These transient corruptions can distract clinicians, reduce image interpretability, and disrupt downstream tasks such as 3D reconstruction and navigation. Effective restoration therefore requires methods that exploit temporal continuity rather than treating frames in isolation. We introduce a Gaussian Process Prior Variational Autoencoder (GPVAE) framework for endoscopic video restoration that replaces the standard factorized latent prior with a temporal Gaussian process prior, enabling interpolation of missing frames with uncertainty-aware reconstruction. The framework combines endoscopy-specific encoders, including a convolutional EndoVAE backbone and pretrained Vision Transformer encoders from GastroNet-5M, with two scalable GP approximations: Hierarchical Prior Approximation (HPA) and Sparse Precision Approximation (SPA). Specular reflections are handled using a DUCKNet-based masking pipeline that excludes corrupted pixels from the reconstruction objective. On the C3VDv2 colonoscopy dataset, the best GPVAE variants reduced image reconstruction RMSE by 21.9\% on average, and by up to 26.1\%, relative to matched VAE baselines. Downstream trajectory RMSE was reduced by 12.7\% on average across classical visual odometry and a pretrained PoseNet, at an average increase of 27.3\% in training time per epoch. Finally, the GP posterior provides per-frame uncertainty estimates that reflect temporal support and offer a confidence signal for restored frames.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.19908 [cs.CV]

(or
arXiv:2606.19908v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.19908

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

81. 【2606.19901】Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution

链接：https://arxiv.org/abs/2606.19901

作者：Mingyu Choi,Woo Kyoung Han,Sunghoon Im,Kyong Hwan Jin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：stable linear recurrence, Linear recurrent unit, demonstrated promising accuracy, long-range dependency tasks, linear recurrence

备注： Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Linear recurrent unit (LRU), designed with a principled formulation for stable linear recurrence, has demonstrated promising accuracy and robustness on long-range dependency tasks. However, its static parameterization and single-scan method limits its applicability to 2D vision tasks. In this study, we propose a LRU-based restoration network with a semantic modulating unit (SMU) to achieve a harmonious balance between performance and efficiency in single-image super-resolution. The SMU plays three key roles: LRU modulation, spatial categorization, and feature enhancement through learned prototype. Extensive experiments demonstrate that our method quantitatively and qualitatively surpasses recent state-of-the-art methods. Notably, our approach achieves superior performance with computational complexity on par with existing methods. The source code and models are available at this https URL

82. 【2606.19889】SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

链接：https://arxiv.org/abs/2606.19889

作者：Wentao Pan,Wuyang Li,Shengyuan Liu,Xinyu Liu,Hengyu Liu,Yixuan Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scaling robot policy, substantial safety risks, vivo exploration poses, exploration poses substantial, poses substantial safety

备注：

点击查看摘要

Abstract:Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

83. 【2606.19882】Multimodal Concept Bottleneck Models

链接：https://arxiv.org/abs/2606.19882

作者：Tongqing Shi,Ge Yan,Tuomas Oikarinen,Tsui-Wei Weng

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Concept Bottleneck Models, Concept Bottleneck Model, deep learning networks, Concept Bottleneck, Multimodal Concept Bottleneck

备注： Present at NeurIPS 2025 Mechanistic Interpretability Workshop

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

84. 【2606.19874】MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

链接：https://arxiv.org/abs/2606.19874

作者：Fan Zhu,Ziyu Chen,Peichen Liu,Yifan Zhao,Zhisong Xu,Hui Zhu,Hongxing Zhou,Sixun Liu,Chunmao Jiang

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Simultaneous Localization, Simultaneous Localization, Visual Simultaneous, Gaussian Splatting, high-fidelity scene reconstruction

备注： ICRA 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

85. 【2606.19867】PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement

链接：https://arxiv.org/abs/2606.19867

作者：Dong Yeong Kim,Jaewon Choi,Youmin Shin,Jungyu Lee,Myeongseop Kim,Jinwook Choi,Joo Whan Kim,Young-Gon Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Computed Tomography, poses radiation risks, pediatric craniofacial abnormalities, diagnosing pediatric craniofacial, craniofacial abnormalities

备注： 11pages, 5 figures

点击查看摘要

Abstract:Computed Tomography (CT) is essential for diagnosing pediatric craniofacial abnormalities, yet poses radiation risks to developing anatomies. Reconstructing 3D CT from sparse bi-planar X-rays offers a low-dose alternative but is severely ill-posed. Existing methods employ geometry-agnostic feature lifting, naively projecting 2D features into 3D without explicit spatial modeling, causing depth ambiguity and degraded osseous boundaries. We present PSCT-Net, a geometry-aware framework with differentiable back-projection. Differentiable back-projection establishes a spatially faithful volumetric prior, alleviating depth ambiguity. An Attention-Guided Projection (AGP-3D) module then learns non-linear voxel-wise correspondences between 2D regions and 3D locations. A Bidirectional Mamba (BiM-3D) module captures long-range volumetric dependencies with linear complexity. We further curate a private institutional pediatric skull CT cohort, PedSkull-CT, comprising normal and pathological cases for internal evaluation, addressing the gap in adult-centric, trunk-focused datasets.

86. 【2606.19849】ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

链接：https://arxiv.org/abs/2606.19849

作者：Yang Tan,Junlong Tong,Linan Yue,Hao Wu,Pengfei Fang,Xiaoyu Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low query latency, continuously process incoming, query-time responsiveness critical, process incoming video, maintaining low query

备注： 19 pages, 7 figures, 13 tables

点击查看摘要

Abstract:Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.

87. 【2606.19838】OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

链接：https://arxiv.org/abs/2606.19838

作者：Jiwoong Yang,Haejun Chung,Ikbeom Jang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：chest radiography, clinical practice, mammography and chest, standard component, component of clinical

备注： Accepted at MICCAI 2026

点击查看摘要

Abstract:Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at this https URL.

88. 【2606.19836】World Engine: Towards the Era of Post-Training for Autonomous Driving

链接：https://arxiv.org/abs/2606.19836

作者：Tianyu Li,Li Chen,Caojun Wang,Haochen Liu,Kashyap Chitta,Zhenjie Yang,Yuhang Lu,Naisheng Ye,Yihang Qiu,Yufei Wang,Luoxi Zou,Jiaxin Peng,Jin Pan,Zhaoyu Su,Andrei Bursuc,Shengbo Eben Li,Andreas Geiger,Peng Su,Hongyang Li

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：severe consequences, vehicles must operate, operate safely, real world, World Engine

备注： Technical Report. Project Page: [this https URL](https://opendrivelab.com/WorldEngine/)

点击查看摘要

Abstract:Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail'' events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.

89. 【2606.19835】Neural Events: Discrete Asynchronous Autoencoders for Event-Based Vision

链接：https://arxiv.org/abs/2606.19835

作者：Roberto Pellerito,Daniel Gehrig,Shintaro Shiba,Davide Scaramuzza

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Event cameras capture, cameras capture dynamic, capture dynamic scenes, exceptional temporal fidelity, microsecond resolution

备注：

点击查看摘要

Abstract:Event cameras capture dynamic scenes with exceptional temporal fidelity by representing them as a continuous stream of microsecond resolution \textit{events}. Each individual event, however, only carries minimal semantic value, merely signaling a localized brightness change. To derive meaningful signals, downstream algorithms need to quickly integrate cues from a potentially massive torrent of low-information events. Current architectures, however, are easily overwhelmed, struggling to balance capturing fine-grained temporal dynamics and maintaining a manageable data throughput. This paper proposes a framework to re-tokenize event streams into a small set of highly informative \textit{neural events}, each representing a local spatio-temporal context window with a discrete learnable code. Every time this code flips, a neural event is triggered, yielding a highly compressed data stream. We demonstrate that, across object detection and classification, networks trained on neural events are on par or surpass the performance of state-of-the-art approaches while reducing the event rate by a factor of 2.0.

90. 【2606.19828】3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

链接：https://arxiv.org/abs/2606.19828

作者：Jintang Xue,Xinyu Wang,Yixing Wu,Jingwen Chen,C.-C. Jay Kuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal large language, large language models, multimodal large, large language, frozen point encoder

备注：

点击查看摘要

Abstract:3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token part_k; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.

91. 【2606.19824】CSWinUNETR: Segmentation of Thin Anatomical Structures in Medical Images

链接：https://arxiv.org/abs/2606.19824

作者：Junho Moon,Haejun Chung,Ikbeom Jang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remains challenging due, severe class imbalance, tortuous anatomical structures, cerebral vasculature, frequent discontinuities

备注： Accepted at MICCAI 2026

点击查看摘要

Abstract:Accurate segmentation of thin, tortuous anatomical structures, such as retinal vessels, cerebral vasculature, and facial wrinkles, remains challenging due to low contrast, frequent discontinuities, and severe class imbalance. Although recent convolutional and Transformer-based models have improved performance, they often yield fragmented predictions and fail to recover fine branches. We propose CSWinUNETR, a general-purpose backbone for 2D and 3D thin-structure segmentation. It employs cross-shaped stripe self-attention to model long-range principal-axis context and incorporates cyclic shifts to enhance information exchange across stripes. To better preserve fine-grained details, we further introduce a detail-enhanced multi-scale self-attention module that aggregates contextual features from multi-resolution representations. In addition, we propose sparse-control dynamic snake convolution, which reconstructs reliable dense curvilinear kernels from sparsely predicted control points to better follow tortuous geometry. Extensive experiments on four benchmarks across ophthalmology, neurovascular imaging, and dermatology demonstrate that CSWinUNETR consistently outperforms state-of-the-art methods without task-specific post-processing or topology-aware losses. The code is available at this https URL.

92. 【2606.19817】raining-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance

链接：https://arxiv.org/abs/2606.19817

作者：Myeongseok Nam,Donghoon Yeo,Seungwook Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：supplement limited real, computer vision models, limited real datasets, training computer vision, image generative models

备注： 9 pages, 4 figures

点击查看摘要

Abstract:With the recent advent of image generative models, synthetic data are increasingly being used to supplement limited real datasets for training computer vision models. However, not all synthetic datasets improve performance equally, and their effectiveness can only be assessed by training a downstream model, which is computationally expensive and time-consuming. This problem is pronounced in the task of object detection, where the required annotations are much more dense due to bounding boxes. In this paper, we propose a pre-computable metric family, dubbed Conditional-Composition Domain Match (CCDM), which serves as a proxy for the relative utility of candidate synthetic training sets for downstream detection. Experiments on the VisDrone-DET dataset show that the CCDM metric families achieve a Spearman correlation of 1.0 with the downstream performance of YOLOv8, clearly outperforming existing metrics for synthetic image evaluation.

93. 【2606.19805】ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

链接：https://arxiv.org/abs/2606.19805

作者：Zijie Meng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：creators reuse cinematic, reuse cinematic moves, freshly generated, creators reuse, reuse cinematic

备注： Accepted by SCA2026(poster)

点击查看摘要

Abstract:Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales -- a sweep across a galaxy versus a nudge across a desk -- and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it -- not the raw trajectory -- is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene's own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that -- unlike the similarity-aligned TransErr -- exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.

94. 【2606.19804】HypOProto: Hyperbolic Ordinal Prototypes for Left Ventricular Filling Pressure Classification

链接：https://arxiv.org/abs/2606.19804

作者：Victoria Wu,Nima Hashemi,Hooman Vaseli,Christina Luong,Purang Abolmaesumi,Teresa S. M. Tsang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Ventricular Filling Pressure, Left Ventricular Filling, Filling Pressure, Left Ventricular, Ventricular Filling

备注：

点击查看摘要

Abstract:Echocardiography (echo) is a widely used imaging modality for assessing cardiac function, with Left Ventricular Filling Pressure (LVFP) serving as a critical physiological marker for conditions such as heart failure. Standard LVFP classification into normal \emph{vs} elevated categories relies on the Doppler-derived $E/e'$ ratio, which is operator-dependent and often unavailable in resource-limited settings, motivating methods that infer LVFP directly from B-mode echo. Existing deep learning approaches achieve high performance but remain largely black-box, limiting clinical interpretability. We propose HypOProto, a hyperbolic, ordinal prototype-based framework for interpretable LVFP classification using a frozen, explainable foundation model backbone. HypOProto arranges prototypes along the physiological $E/e'$ scale, placing borderline cases near the hyperboloid root where small angular differences separate similar cases, while normal and elevated cases occupy outward positions reflecting increasing diagnostic certainty. This hyperbolic geometry encodes clinically meaningful ordinal relationships and improves interpretability. We also introduce a novel Hyperbolic Prototype Angular Separation (HyperPAS) loss, enforcing inter-class prototype separation in hyperbolic space. HypOProto achieves SOTA performance while maintaining transparency, and highlights clinically relevant regions in visualizations. This work represents the first prototype-based framework for LVFP classification in echo. Our code can be found at this https URL.

95. 【2606.19802】Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems

链接：https://arxiv.org/abs/2606.19802

作者：Nicolas Zilberstein,Morteza Mardani,Santiago Segarra

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：produce blurry reconstructions, minimize error produce, error produce blurry, quality yield sharp, Image restoration faces

备注：

点击查看摘要

Abstract:Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ($128\times 128$) and AFHQ ($256\times 256$) across several linear and nonlinear inverse tasks validate our findings.

96. 【2606.19776】Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

链接：https://arxiv.org/abs/2606.19776

作者：Jianing Li,Zhou Fang,Yijiang Liu,Li Du

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made significant progress, driving advances, made significant, significant progress, advances in applications

备注：

点击查看摘要

Abstract:Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

97. 【2606.19736】VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

链接：https://arxiv.org/abs/2606.19736

作者：Shihui Yan,Hu Liu,Junyu Shi,Zihui Zhu,Ziqi Zhou,Yufei Song,Youming Geng,Minghui Li,Shengshan Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains highly challenging, world remains highly, targets undergo continuous, undergo continuous geometric, UAV reconnaissance

备注： Accepted by ICME 2026

点击查看摘要

Abstract:Adversarial camouflage in the physical world remains highly challenging, particularly under UAV reconnaissance where targets undergo continuous geometric changes and extreme illumination variations. Existing methods either optimize 2D digital perturbations that fail to generalize to dynamic viewpoints or produce visually unnatural textures that cannot be deployed in real scenarios. Therefore, we propose an end-to-end framework for adversarial camouflage generation that automatically produces wearable adversarial patterns and maintains stable attack performance in real physical environments with changing viewpoints, poses, and lighting conditions. Our method integrates UV-volume rendering with a diffusion-based texture generator, enabling consistent appearance under varying scales, poses, and lighting conditions. To ensure environmental realism, we propose an illumination color consistency estimator that extracts dominant background attributes and guides a natural texture loss to align the generated UV texture with the surrounding environment. A multi-scale dynamic training strategy further enhances robustness against viewpoint shifts and body deformation. Extensive experiments across multiple mainstream detectors demonstrate that our method achieves strong and stable physical attack performance while maintaining high perceptual naturalness, reducing human detection rates without introducing unnatural artifacts.

98. 【2606.19735】GLARE: A Natural Language Interface for Querying Global Explanations

链接：https://arxiv.org/abs/2606.19735

作者：Bhavan Vasu,Rajesh Mangannavar

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：hinders practical exploration, understanding vision models, decision contexts, practical exploration, global explanations

备注： 16 pages, 2 figures

点击查看摘要

Abstract:While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

99. 【2606.19733】QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

链接：https://arxiv.org/abs/2606.19733

作者：Xiuyuan Zhu,Ke Lu,Zijie Yang,Chao Yue,Jian Xue,Dongming Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Efficiently retrieving specific, Efficiently retrieving, natural language prompts, language prompts remains, retrieving specific

备注： 8 pages, 4 figures, 6 tables. Accepted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)

点击查看摘要

Abstract:Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.

100. 【2606.19718】One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

链接：https://arxiv.org/abs/2606.19718

作者：Shenjian Gong,Kangkan Wang,Shanshan Zhang,Jian Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：human, human image, pose, paper addresses, addresses the challenge

备注： 30 pages, 10 figures

点击查看摘要

Abstract:This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel this http URL results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at this https URL.

101. 【2606.19712】Efficient Neural Network Model Selection for Few-Class Application Datasets

链接：https://arxiv.org/abs/2606.19712

作者：Bryan Bo Cao,Abhinav Sharma,Lawrence O'Gorman,Michael Coss,Shubham Jain

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：high-performance neural networks, benchmarking high-performance neural, effort has focused, focused on developing, developing and benchmarking

备注： 36 pages, 9 tables, 13 figures

点击查看摘要

Abstract:While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are typically evaluated on datasets with thousands of classes, yet many real-world applications involve fewer than ten. To address this understudied but common setting, we develop a measure of classification difficulty based on data-side properties and show how it enables more efficient model selection for few-class datasets, where traditional approaches are less effective. We term this phenomenon "few-class distinctiveness". Our metric allows comparison of models and datasets 6 to 29$\times$ faster than repeated training and testing. Leveraging this insight, we extend scaled model families below the smallest published models, achieving greater efficiency at similar accuracy, for example models up to 42% smaller than YOLOv5-nano for a mobile robot task. Targeting resource-constrained applications, we demonstrate few-class model selection across mobile robot, drone, and IoT scenarios, highlighting practical gains in efficiency without sacrificing performance.

102. 【2606.19706】NEST: Narrative Event Structures in Time for Long Video Understanding

链接：https://arxiv.org/abs/2606.19706

作者：Ali Asgarov,Kaushik Narasimhan,Najibul Haque Sarker,Hani Alomari,Chia-Wei Tang,Anushka Sivakumar,Zaber Ibn Abdul Hakim,Shaurya Mallampati,Chris Thomas

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：handle extended token, extended token streams, long video sequences, increasingly long video, Long Video Understanding

备注：

点击查看摘要

103. 【2606.19684】Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

链接：https://arxiv.org/abs/2606.19684

作者：Nguyen Cao Hoang,Hoang Bui Le,Nam Vo Hoang,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modified text description, Composed image retrieval, image retrieval retrieves, Composed image, composed query

备注： SOICT 2025

点击查看摘要

Abstract:Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.

104. 【2606.19682】Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

链接：https://arxiv.org/abs/2606.19682

作者：Duc-Tho Nguyen,Hieu-Hoc Tran-Minh,Khanh-Hoa Lam,Hoang-Nhut Ly,Huu-Phuc Huynh,Thanh-Tien Tran,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Chi Minh City, City AI Challenge, Chi Minh, Minh City, paper presents Vortex

备注： SOICT 2025

点击查看摘要

Abstract:This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team's system achieved a score of 79.6/88 (90.5\%) in the Preliminary Round and was further evaluated in the Final Round, achieving an `Excellent' overall performance with `Outstanding' results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.

105. 【2606.19676】Morpher: Toward Robust Simultaneous Motion-Location Editing

链接：https://arxiv.org/abs/2606.19676

作者：Haengbok Chung

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieved remarkable success, Diffusion models, motion-location editing, motion, editing

备注：

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist's motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist's skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.

106. 【2606.19662】Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

链接：https://arxiv.org/abs/2606.19662

作者：Bingshuo Qian,Xiang Cheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-representation diffusion models, denoising complementary views, performance depends critically, Multi-representation diffusion, improve visual synthesis

备注： 25 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation's local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at this https URL

107. 【2606.19651】BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

链接：https://arxiv.org/abs/2606.19651

作者：Max Van Puyvelde,Ibrahim Gulluk,Wim Van Criekinge,Olivier Gevaert

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：privacy-preserving data sharing, simulate disease trajectories, augment under-represented cohorts, neurology and neuro-oncology, brain MRI latent

备注：

点击查看摘要

Abstract:Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.

108. 【2606.19646】SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

链接：https://arxiv.org/abs/2606.19646

作者：Ayush Dwivedi,Qixin Wang,Ashvi Soni,Ruoteng Wang,Han Li,Animesh Mahapatra,Neeraj Agrawal,Xintao Wu

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：chart question answering, lightweight language reasoning, Vision-language models, question answering, VLM

备注： Demo paper submitted at CIKM 2026. 4 pages, 2 figures

点击查看摘要

109. 【2606.19641】Scaling Self-Play for End-to-End Driving

链接：https://arxiv.org/abs/2606.19641

作者：Luke Rowe,Roger Girgis,Rodrigue de Schaetzen,Daphne Cornelisse,Alaap Grandhi,Felix Heide,Eugene Vinitsky,Christopher Pal,Liam Paull

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：offline human-demonstration datasets, provide limited state, limited state coverage, long-tail agent interactions, closed-loop feedback

备注：

点击查看摘要

Abstract:End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

110. 【2606.19617】GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution

链接：https://arxiv.org/abs/2606.19617

作者：Max Shad,Naeem Khoshnevis

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：Local Spectral Representation, Global-Bandwidth Local Spectral, fixed-grid local spectral, Local Spectral, Spectral Representation

备注：

点击查看摘要

Abstract:We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.

111. 【2606.19584】Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

链接：https://arxiv.org/abs/2606.19584

作者：Chengzhi Mao,Xudong Lin,Wen-Sheng Chu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：static feature extractors, large downstream models, Vision foundation models, placing the burden, typically trained

备注：

点击查看摘要

Abstract:Vision foundation models are typically trained as static feature extractors, placing the burden of task adaptation onto large downstream models. We propose an alternative paradigm: instead of solely feeding visual features into language models, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time, removing the need for task-specific retraining. This enables the encoder to focus on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), surpasses vision-language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks -- offering a direct path toward adaptive, instruction-driven visual intelligence.

112. 【2606.19565】Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

链接：https://arxiv.org/abs/2606.19565

作者：Navin Ranjan,Andreas Savakis

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mixed-precision PTQ framework, PTQ framework, mixed-precision PTQ, VLA functional boundaries, key VLA functional

备注：

点击查看摘要

Abstract:We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

113. 【2606.19534】PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

链接：https://arxiv.org/abs/2606.19534

作者：Yueyi Sun,Yuhao Wang,Jason Li,Ye Tian,Tao Zhang,Jacky Mai,Yihan Wang,Haochen Wang,Jinbin Bai,Ling Yang,Yunhai Tong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：achieved remarkable progress, Multimodal large language, achieved remarkable, remarkable progress, multimodal diffusion language

备注： Code available at [this https URL](https://github.com/MSALab-PKU/PerceptionDLM)

点击查看摘要

114. 【2606.19531】ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

链接：https://arxiv.org/abs/2606.19531

作者：Yuyang Zhang,Wenyao Zhang,Zekun Qi,He Zhang,Haitao Lin,Jingbo Zhang,Yao Mu,Xiaokang Yang,Wenjun Zeng,Xin Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：commonly rely, bridge visual world, World Action, World Action Models, world action model

备注： Project Page: [this https URL](https://zhangwenyao1.github.io/ImageWAM/)

点击查看摘要

Abstract:World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

115. 【2606.19495】LooseControlVideo: Directorial Video Control using Spatial Blocking

链接：https://arxiv.org/abs/2606.19495

作者：Shariq Farooq Bhat,Niloy J. Mitra,Kalyan Sunkavalli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：spatial orchestration, generation remains, significant challenge, remains a significant, temporal dynamics

备注： Project page at [this https URL](https://shariqfarooq123.github.io/LooseControlVideo/)

点击查看摘要

Abstract:Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

116. 【2606.19483】LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation

链接：https://arxiv.org/abs/2606.19483

作者：Jiaqi Zhang,Ashton Lee,Anthony Wong,John Zou,Sami BuGhanem,Randall Balestriero

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, Vision Transformer, Vision Foundation, Foundation Models, semantic segmentation

备注：

点击查看摘要

Abstract:Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature-based knowledge distillation (KD) often suffers from the teacher-student gap; the student struggles to imitate teacher's complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer-skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature-based knowledge distillation. By utilizing the teacher's intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher-level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP-distilled ViT-S achieves 90.1% accuracy on ImageNet-100, a +12.24% improvement compared with baseline. On ImageNet-1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet-100 by implementing early-stopping for teacher inference during the initial stages of training. Code is available at this https URL

117. 【2606.19460】Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

链接：https://arxiv.org/abs/2606.19460

作者：Fabio De Sousa Ribeiro,Emma A.M. Stanley,Charles Jones,Tian Xia,Dominic C. Marshall,Laurent Renard Triché,Christopher V. Cosgriff,Panagiotis Dimitrakopoulos,Sotirios A. Tsaftaris,Ben Glocker

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：billion-parameter scale, generative foundation model, generative foundation, foundation model, chest radiographs

备注： Project page: [this https URL](https://RadiT-project.github.io)

点击查看摘要

Abstract:We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state of the art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.

118. 【2606.19451】3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

链接：https://arxiv.org/abs/2606.19451

作者：Ellina Zhang,Madhaven Iyengar,Amir Zadeh,Chuan Li,Deepak Pathak,David Held,Tal Daniel

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：decomposes scene-level RGB-D, scene-level RGB-D, RGB-D or voxel, representation learning model, object-centric representation learning

备注： ICML 2026. Project webpage: [this https URL](https://eubooks3003.github.io/3d-dlp)

点击查看摘要

Abstract:We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at this https URL.

119. 【2606.19383】3D Scene Graphs: Open Challenges and Future Directions

链接：https://arxiv.org/abs/2606.19383

作者：Dennis Rotondi,Francesco Argenziano,Sebastian Koch,Nathan Hughes,Martin Buechner,Johanna Wald,Lukas Rosenberger Schmid,Daniele Nardi,Abhinav Valada,Liam Paull,Federico Tombari,Luca Carlone,Kai O. Arras

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：combining geometric grounding, combining geometric, geometric grounding, grounding with semantic, semantic and relational

备注： Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

点击查看摘要

Abstract:3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at this https URL.

120. 【2606.19371】ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

链接：https://arxiv.org/abs/2606.19371

作者：Long Doan,Branden Chen,Ethan Litton,Huan Huang,Jiajing Huang,Yixin Xie,Weihua Zhou,Nandakumar Narayanan,Chen Zhao

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Alzheimer disease, Positron Emission Tomography, elderly population, Magnetic Resonance Imaging, fatal disorder

备注：

点击查看摘要

Abstract:Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

121. 【2606.17054】Human Universal Grasping

链接：https://arxiv.org/abs/2606.17054

作者：Kevin Yuanbo Wu,Tianxing Zhou,Isaac Tu,Billy Yan,Irmak Guzey,David Fouhey,Dandan Shan,Lerrel Pinto

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：level of generality, human grasps, grasp objects effortlessly, objects effortlessly, human

备注： 28 pages, 20 figures, 7 tables

点击查看摘要

Abstract:Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: this https URL

122. 【2606.19767】Contour-Constrained Deformable Registration with Parameter Characterization for Head and Neck Surgical Guidance

链接：https://arxiv.org/abs/2606.19767

作者：Qingyun Yang,Jon S. Heiselman,Ayberk Acar,Morgan J. Ringel,Michael I. Miga,Matthieu Chabanas,Michael C. Topf,Jie Ying Wu

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词：neck squamous cell, squamous cell carcinoma, highest recurrence rates, annual new cases, cases globally

备注：

点击查看摘要

Abstract:With 890,000 annual new cases globally, head and neck squamous cell carcinoma has one of the highest recurrence rates among solid malignancies. Although frozen section analysis is the standard of care for intraoperative margin assessment, accurately relocating detected positive margins on the resection bed remains challenging due to imprecise alignment between resected specimens and their resection bed, compounded by post-resection mucosal tissue shrinkage. We present a biomechanics-driven deformable registration framework that corrects post-resection tissue deformation to provide intraoperative guidance. Our approach registers 3D specimen meshes to intraoperative resection bed point clouds using a deformable registration approach based on regularized Kelvinlet basis functions. The registration matches surface point clouds, fiducial landmarks, and boundary contour constraints that directly penalize perpendicular distance-to-agreement between specimen and resection bed boundaries. Across nine specimens from skin, buccal mucosa, and tongue sites, the overall mean target registration error was $11.11 \pm 4.07$ mm using rigid registration, which decreased to $8.20 \pm 2.68$ mm (26.19\% reduction) using deformable registration without contour constraint. The proposed contour-constrained deformable registration further reduced the error to $5.62 \pm 2.28$ mm, a 49.41\% reduction relative to rigid registration. We observed the largest reduction in the most clinically challenging tongue specimens. We also performed a systematic two-stage parameter search to characterize the relative importance of surface alignment, fiducial correspondences, contour constraint, and strain energy regularization. This search revealed that contour weighting dominates registration accuracy for tissue types with large lateral deformation, while the algorithm operates over a broad range of parameter combinations.

123. 【2606.19574】FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference

链接：https://arxiv.org/abs/2606.19574

作者：Chengwei Zhou,Ovishake Sen,Xuming Chen,Rishith Paramasivam,Shaahin Angizi,Swarup Bhunia,Baibhab Chatterjee,Gourav Datta

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Deploying vision transformers, Deploying vision, transmit high-dimensional image, vision transformers, on-device compute

备注：

点击查看摘要

Abstract:Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.

124. 【2606.19372】Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning

链接：https://arxiv.org/abs/2606.19372

作者：Jonathan Thomas,Harsh Thaker

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：present Full-Self Diagnostics, recovering latent physiological, Full-Self Diagnostics, latent physiological states, unified mathematical framework

备注： 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework

点击查看摘要

Abstract:We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.

Comments:
38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

MSC classes:
35R30, 49N45, 94A17, 68T07

ACMclasses:
I.2.6; I.2.10; J.3

Cite as:
arXiv:2606.19372 [eess.IV]

(or
arXiv:2606.19372v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2606.19372

Focus to learn more

              arXiv-issued DOI via DataCite</p>