本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新837篇论文，其中：

自然语言处理131篇
信息检索20篇
计算机视觉148篇

自然语言处理

1. 【2606.03990】Neuron Populations Exhibit Divergent Selectivity with Scale

链接：https://arxiv.org/abs/2606.03990

作者：Amil Dravid,Yasaman Bahri,Alexei A. Efros,Yossi Gandelsman

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：neural networks evolve, networks evolve predictably, Rosetta Neurons, neural networks, networks evolve

备注： Project page and code: [this https URL](https://avdravid.github.io/rosetta-neuron-scaling/)

点击查看摘要

Abstract:We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

2. 【2606.03982】Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

链接：https://arxiv.org/abs/2606.03982

作者：Mutsumi Sasaki,Go kamoda,Ryosuke Takahashi,Kosuke Sato,Kentaro Inui,Keisuke Sakaguchi,Benjamin Heinzerling

类目：Computation and Language (cs.CL)

关键词：symbolic unit scale, require language models, require language, unit scale, Abstract

备注：

点击查看摘要

Abstract:Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

3. 【2606.03980】Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

链接：https://arxiv.org/abs/2606.03980

作者：Tao Chen,Gangwei Jiang,Pengyu Cheng,Siyuan Huang,Yihao Liu,Jingwei Ni,Jiaqi Guo,Mengyu Zhou,Kai Tang,Junling Liu,Qinliang Su,Xiaoxi Jiang,Guanjun Jiang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：critical feedback signals, LLM post-training, signals for LLM, provide critical feedback, notably in reinforced

备注：

点击查看摘要

Abstract:Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at this https URL.

4. 【2606.03969】Quantifying Faithful Confidence Expression in Large Reasoning Models

链接：https://arxiv.org/abs/2606.03969

作者：Areeb Gani,Asal Meskin,Gabrielle Kaili-May Liu,Arman Cohan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Reliable uncertainty communication, persistent failure mode, Reliable uncertainty, trustworthiness of LLMs, failure mode

备注： Code: [this https URL](https://github.com/yale-nlp/faithful_lrm)

点击查看摘要

Abstract:Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

5. 【2606.03968】QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

链接：https://arxiv.org/abs/2606.03968

作者：Rongzhi Zhang,Rui Feng,Zhihan Zhang,Jingfeng Yang,Qingyu Yin,Xin Liu,Zixuan Zhang,Priyanka Nigam,Bing Yin,Tuo Zhao,Chao Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：extending reinforcement learning, existing methods optimize, methods optimize rubrics, distribution as fixed, promising route

备注：

点击查看摘要

Abstract:Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

6. 【2606.03967】AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

链接：https://arxiv.org/abs/2606.03967

作者：Quentin Fuxa,Dominik Macháček

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：simultaneous speech translation, speech translation system, English to German, simultaneous speech, speech translation

备注： Accepted to IWSLT 2026

点击查看摘要

Abstract:We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

Comments:
Accepted to IWSLT 2026

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.03967 [cs.CL]

(or
arXiv:2606.03967v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.03967

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

7. 【2606.03965】Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

链接：https://arxiv.org/abs/2606.03965

作者：Yu Xia,Zhouhang Xie,Xin Xu,Byungkyu Kang,Prarit Lamba,Xiang Gao,Julian McAuley

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, improve final-answer accuracy, Large language, language models improve, models improve final-answer

备注：

点击查看摘要

Abstract:Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at this https URL.

8. 【2606.03957】Efficient ASR Training with Conversations that Never Happened

链接：https://arxiv.org/abs/2606.03957

作者：Máté Gedeon,Péter Mihajlik

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：ASR for lower-resource, domain-matched multi-speaker training, Conversational ASR, niche domains, domains is limited

备注：

点击查看摘要

Abstract:Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

9. 【2606.03948】A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

链接：https://arxiv.org/abs/2606.03948

作者：Aziz Sharipov Ortega,Dominik Macháček

类目：Computation and Language (cs.CL)

关键词：Speech Translation Shared, Simultaneous Speech Translation, Translation Shared task, implement simultaneous translation, simultaneous translation capability

备注： IWSLT 2026

点击查看摘要

Abstract:We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strengths of our system are: (1) high translation quality, outperforming similarly sized baselines both in low- and high-latency regimes in computationally unaware simulations; (2) low computational requirements, as the model has only 1B parameters; (3) multilinguality -- support of 25 source and 25 target languages.

Comments:
IWSLT 2026

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.03948 [cs.CL]

(or
arXiv:2606.03948v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.03948

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

10. 【2606.03928】Value-Aware Stochastic KV Cache Eviction for Reasoning Models

链接：https://arxiv.org/abs/2606.03928

作者：Ting-Yun Chang,Harvey Yiyun Fu,Deqing Fu,Chenghao Yang,Jesse Thomason,Robin Jia

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：long outputs create, chains of thought, compute bottleneck, extended chains, long outputs

备注： Codes: [this https URL](https://github.com/terarachang/VaSE)

点击查看摘要

Abstract:Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

11. 【2606.03924】Knowledge Editing in Masked Diffusion Language Models

链接：https://arxiv.org/abs/2606.03924

作者：Haewon Park,Yohan Jo

类目：Computation and Language (cs.CL)

关键词：correct factual knowledge, Knowledge editing aims, factual knowledge, aims to update, update or correct

备注：

点击查看摘要

Abstract:Knowledge editing aims to update or correct factual knowledge in a language model. A widely used approach, locate-then-edit, does this in two steps: it first localizes a fact within the model, then edits the weights there. To date, such methods have been developed exclusively on autoregressive models (ARMs). Whether their underlying assumptions hold for masked diffusion models (MDMs), which model text bidirectionally and generate by iterative denoising rather than next-token prediction, remains an open question. We address it by transferring locate-then-edit to MDMs and comparing two MDMs (LLaDA, Dream) with two ARMs (LLaMA, Qwen) at matched scale. Our central finding has two parts. First, where an edit is applied transfers across paradigms: causal tracing highlights the same early-to-mid-layer MLP at the last subject token in both, and editing is most effective there. Second, this shared location does not guarantee a shared outcome. Single-token edits succeed in both, but as targets grow longer, editing degrades systematically in the MDMs but not the ARMs. The failure stems from how the edited fact is generated: producing a multi-token target requires passing through partially unmasked intermediate states for which the edit was never optimized. Guided by this diagnosis, we introduce a simple correction that optimizes the edit for these states, substantially restoring multi-token performance.

12. 【2606.03892】Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

链接：https://arxiv.org/abs/2606.03892

作者：Ibrahim Abdelaziz,Asim Munawar,Kinjal Basu,Maxwell Crouse,Chulaka Gunasekara,Suneet Katrekar,Pavan Kapanipathi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：verbose tool-calling patterns, incentivize verbose tool-calling, realistic stateful execution, synthetic training queries, stateful execution environments

备注：

点击查看摘要

Abstract:Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) an automated data synthesis pipeline that generates validated multi-turn tool-call trajectories against these servers via dependency-graph-guided conversation simulation grounded in live-sampled server state, so every generated query references entities that actually exist; and (3) a multi-component programmatic reward - graduated validity scoring, dependency-aware coverage, an adaptive efficiency penalty with a complexity-scaled call budget, a tool-name signal, and an argument-value matching bonus - requiring no external judge model. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO using identical reward hyperparameters and ~13K training examples; only learning rate is tuned per model family from a three-point sweep. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that a compact programmatic reward yields consistent gains on multi-step tool orchestration across two model families.

13. 【2606.03889】RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

链接：https://arxiv.org/abs/2606.03889

作者：Zongwei Lv,Zhewen Tan,Yaoming Li,Yilun Yao,Yuxuan Tian,Lin Sun,Xiangzheng Zhang,Weihong Lin,Tong Yang,Guangxiang Zhao

类目：Computation and Language (cs.CL)

关键词：miss key realism, key realism properties, miss key, key realism, realism properties

备注： 19 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:this https URL.

14. 【2606.03871】Visual Instruction Tuning Aligns Modalities through Abstraction

链接：https://arxiv.org/abs/2606.03871

作者：Luis Palacios,Lorenzo Basile,Diego Doimo,Alberto Cazzaniga

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Model, pre-trained Large Language, Language Model, Large Language, information alongside text

备注：

点击查看摘要

Abstract:Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.

15. 【2606.03867】A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

链接：https://arxiv.org/abs/2606.03867

作者：Cuong Vuong Tuan,Trang Mai Xuan,Tien-Cuong Nguyen,Vu-Duc Ngo,Thien Van Luong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：distilling essential information, plays a critical, critical role, role in distilling, distilling essential

备注： Accepted by Neural Computing and Applications

点击查看摘要

Abstract:Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

16. 【2606.03866】aiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

链接：https://arxiv.org/abs/2606.03866

作者：Yuecheng Li,Zeyu Song,Jing Yao,Chi Lu,Peng Jiang,Kun Gai

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, Scaling recommender systems, Scaling recommender, language models, large language

备注： 8 pages, 2 figures

点击查看摘要

Abstract:Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.

17. 【2606.03846】Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

链接：https://arxiv.org/abs/2606.03846

作者：Qi Cao,Takeshi Kojima,Andrew Gambardella,Helinyi Peng,Yutaka Matsuo,Yusuke Iwasawa

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, Large language, demonstrate remarkable performance, demonstrate remarkable, diverse tasks

备注： Findings of ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.

18. 【2606.03825】Dynamic Short Convolutions Improve Transformers

链接：https://arxiv.org/abs/2606.03825

作者：Oliver Sieberling,Bharat Runwal,Rameswar Panda,Yoon Kim

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：dynamic short convolutions, short convolutions, static short convolutions, dynamic convolutions, residual connections

备注：

点击查看摘要

Abstract:Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper introduces dynamic short convolutions as an additional neural network primitive for improving Transformers. Unlike static short convolutions, dynamic convolutions use input-dependent filters, which preserves the locality bias of convolution while increasing expressivity. Motivating experiments show that applying dynamic short convolutions to key, query, and value representations improves performance on challenging associative recall tasks compared with static convolutional variants. Across language-modeling experiments ranging from 150M to 2B parameters, dynamic convolutions consistently outperform standard Transformers and Transformers augmented with static short convolutions. Fitting scaling laws indicates a 1.33$\times$ compute advantage over compute-matched Transformers when dynamic convolutions are applied to the key, query, and value vectors, and a 1.60$\times$ advantage when adding dynamic convolutions after every linear layer. Dynamic convolutions also offer improvements on linear RNNs (Mamba-2/Gated DeltaNet) and mixture-of-experts architectures. We make these gains practical with custom Triton kernels that enable efficient training with a manageable end-to-end slowdown. These results suggest that dynamic short convolutions are a scalable, hardware-efficient, and expressive primitive for advancing Transformer-based language models.

19. 【2606.03817】Rethinking the Idiomaticity Decomposability Hypothesis: Evidence from Distributional Learning

链接：https://arxiv.org/abs/2606.03817

作者：Maggie Mi,Golzar Atefi,Atsuki Yamaguchi,Felix Gers,Aline Villavicencio,Nafise Sadat Moosavi

类目：Computation and Language (cs.CL)

关键词：constituent meanings contribute, analysed in terms, constituent meanings, syntactic flexibility, decomposability

备注： ACL 2026 Main - long paper (9 pages + Appendices)

点击查看摘要

Abstract:Idioms can be analysed in terms of their decomposability, the extent to which constituent meanings contribute to the figurative whole. Decomposability is thought to predict syntactic flexibility. Usage-based accounts instead attribute idiom behaviour to distributional experience, such as speaker familiarity and predictability. We examine these views using contextualised language models as controlled distributional learners. We propose a model-internal measure of decomposability and relate it to human ratings, syntactic flexibility, and predictability while tracking idiom learning during pretraining. Model-derived decomposability correlates weakly with human judgments and shows a small but consistent negative relationship with syntactic flexibility. Pretraining analyses show that stabilisation of idiom representations in models is not explained by frequency alone. Instead, surprisal, decomposability, and frequency all contribute, with decomposability showing the strongest training-dependent effect.

20. 【2606.03810】Consistency Training Can Entrench Misalignment

链接：https://arxiv.org/abs/2606.03810

作者：David Demitri Africa,Arathi Mani

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：produce similar outputs, sampling procedures, produce similar, similar outputs, outputs across related

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 ``model organisms: open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We find that outcomes vary significantly: consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy. We present evidence that distribution shifts induced by the consistency labeling process, rather than variation in the selection operators, may be the primary driver of systematic alignment effects. Finally, we present a unifying theoretical framework to derive conditions under which consistency training will amplify or suppress misalignment. In total, our study establishes that consistency training is not alignment-neutral, and that its use in critical systems should be carefully audited.

21. 【2606.03793】Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

链接：https://arxiv.org/abs/2606.03793

作者：Hashmat Shadab Malik,Muzammal Naseer,Salman Khan

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：integrate visual perception, Multimodal Large Language, Large Language Models, Models integrate visual, Multimodal Large

备注：

点击查看摘要

Abstract:Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric tasks, leaving multilingual behaviour unexplored. We address this gap through a systematic study of adversarial robustness and multimodal safety across 12 diverse languages, evaluating open-source MLLMs that acquire multilingual capability through instruction tuning. Gradient-based attacks reveal a transferable multilingual vulnerability: adversarial images optimized in one language continue to induce failure in others, demonstrating strong cross-lingual transferability. Multilingual safety further varies with how effectively a model retrieves or interprets harmful instructions. When harmful intent is issued through text, languages with stronger linguistic grounding more often elicit misuse-enabling responses, while weaker languages produce fewer unsafe outputs. When embedded in the image as typographic content, English scripts are reliably recognised and followed, whereas non-English scripts are rarely parsed by the vision encoder. Lower-resource languages may therefore appear safer, but this is an artefact of comprehension and visual-grounding failures rather than genuine alignment, a phenomenon we term safety-by-failure. In contrast, MLLMs that build multilingual capability throughout their training stages rather than only at instruction tuning, such as Qwen3-VL, exhibit genuine cross-lingual safety, maintaining active refusal across languages rather than masking comprehension failure. Shallow multilingual adaptation, such as fine-tuning on translated instruction data, may produce surface-level understanding that creates illusory safety in low-resource languages; deeper integration across training stages leads to genuine multilingual safety alignment.

22. 【2606.03785】Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

链接：https://arxiv.org/abs/2606.03785

作者：Lisa Bouger,Théo Lasnier,Philippe Looubet Moundi,Yannick Teglia,Djamé Seddah

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, growing security concern, generate adversary-chosen content, attacks in Large

备注： 22 pages, 28 figures

点击查看摘要

Abstract:Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.

23. 【2606.03782】Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

链接：https://arxiv.org/abs/2606.03782

作者：Renhao Pei,Yihong Liu,Sampo Pyysalo,Hinrich Schütze,Shaoxiong Ji

类目：Computation and Language (cs.CL)

关键词：Large language, Large language models, incorporating linguistic resources, offer a promising, promising approach

备注：

点击查看摘要

Abstract:Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply grammatical information effectively during translation. Inspired by recent progress in chain-of-thought reasoning, we investigate whether low-resource MT can benefit from structured intermediate steps of linguistic analysis and grammatical reasoning. We propose a pipeline for automatically generating step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks. We evaluate these traces in three settings: in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), on Xibe and Chintang as test cases. Our results show that linguistic reasoning traces are most effective as inference-time guidance: in ICL, reliable sentence-specific traces substantially improve translation performance across most models, languages, and metrics. In contrast, using the linguistic reasoning traces as training data yields smaller and less consistent gains, as models learn the trace format but often generate erroneous content. These findings suggest that LLMs can leverage grammatical information for low-resource MT when given reliable linguistic analyses, while learning to generate such analyses remains a major bottleneck.

24. 【2606.03780】Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models

链接：https://arxiv.org/abs/2606.03780

作者：Yuetian Lu,Ali Modarressi,Yihong Liu,Hinrich Schütze

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：interventions localize information, localize information flow, dense transformer language, feed-forward modules, transformer language models

备注： Preprint

点击查看摘要

Abstract:Causal tracing of factual recall has been studied predominantly in dense transformer language models, where interventions localize information flow to layers or feed-forward modules. Sparse mixture-of-experts (MoE) language models introduce a sharper question: when a factual prediction is mediated by a routed MoE block, which routed expert contributions matter? We formulate expert-aware causal tracing for sparse MoE language models. Using CounterFact facts, we first corrupt the model's factual preference by adding noise to subject-token embeddings, and then test whether clean MoE-block outputs or clean expert-level updates restore the true-vs-foil logit contrast. For Qwen3-30B-A3B-Base, a layer sweep selects and validates layer 44, and expert-level tracing identifies L44E069 as an expert repeatedly selected in the clean run whose held-out patch outperforms other active same-layer expert patches. For Mixtral-8x7B-v0.1, layer-level tracing validates a mid-layer signal, but the signal is not localized to the selected singleton expert; a coalition check instead recovers it with routed multi-expert updates. These results suggest that MoE factual tracing can be made expert-aware, while also showing that expert-level localization is model- and protocol-dependent rather than universal.

25. 【2606.03773】KletterMix: Climbing Toward High-Quality German Pretraining Data

链接：https://arxiv.org/abs/2606.03773

作者：Maurice Kraus,Ruben Härle,Sebastian Sztwiertnia,Abbas Goher Khan,Mehdi Ali,Michael Fromm,Kristian Kersting

类目：Computation and Language (cs.CL)

关键词：German-language resources remain, weakly documented, high-quality German corpus, central ingredient, resources remain

备注：

点击查看摘要

Abstract:High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.

26. 【2606.03768】HybridThinker: Efficient Chain-of-Thought Reasoning via Compressed Memory and Transient Thought Steps

链接：https://arxiv.org/abs/2606.03768

作者：Xin Liu,Runsong Zhao,Xinyu Liu,Junhao Ruan,Pengcheng Huang,Shichao Dong,Chunyang Xiao,Chenglong Wang,Changliang Li,Jingbo Zhu,Tong Xiao

类目：Computation and Language (cs.CL)

关键词：traces improve LLM, improve LLM reasoning, incur substantial computational, improve LLM, thought steps

备注： 23 pages, 9 figures

点击查看摘要

Abstract:Extended chain-of-thought (CoT) traces improve LLM reasoning but incur substantial computational and memory costs. While existing CoT compression methods mitigate this by condensing thought steps into compact representations via memory tokens and retaining only these representations at inference time, the loss of fine-grained information makes subsequent steps more error-prone. To alleviate this, we propose \textbf{HybridThinker}, where in addition to preserved these representations, thought steps are also temporarily retained to provide fine-grained details. However, we observe that naively keeping thought steps accessible to subsequent steps \emph{during training} lets the model bypass memory tokens by retrieving information directly from these steps, leaving the model's ability to compress and retrieve information through memory tokens insufficiently trained. We therefore introduce a hybrid training scheme, in which only some thought steps are directly accessible through attention to subsequent steps, while the other thought steps are masked, forcing the model to use memory tokens for compression and retrieval. Across 4 reasoning benchmarks, HybridThinker matches the uncompressed baseline, advancing the state of the art in CoT compression by 5.8 points on average accuracy with similar inference time. Ablation studies confirm that both temporary thought-step retention and the hybrid training scheme contribute to these gains.

27. 【2606.03761】Framing Migration News with LLMs: Structured CoT as a Support for Human Interpretation

链接：https://arxiv.org/abs/2606.03761

作者：David Alonso del Barrio,Jing Wen,Daniel Gatica-Perez

类目：Computation and Language (cs.CL)

关键词：academic research groups, resource constraints typical, research groups, Frame analysis, resource constraints

备注：

点击查看摘要

Abstract:Frame analysis of migration news is a socially consequential task: media scholars and researchers who study how migration is narrated need tools that are not only accurate, but transparent, auditable, and accessible within the resource constraints typical of academic research groups. Existing LLM-based approaches rely on proprietary APIs and large models that raise concerns about data privacy, reproducibility and equitable access among media researchers. This work studies how a locally deployable open-source LLM can support interpretable frame analysis as an assistive tool. We introduce a Structured Chain-of-Thought (SCoT) prompting approach using Llama3-8B, enabling step-by-step justifications grounded in predefined framing categories. This structured design allows users to audit model outputs and examine alternative interpretations in a task that is inherently subjective. We evaluate our approach on a dataset of migration-related news and show that SCoT improves classification performance over zero-shot and few-shot baselines while remaining feasible on a single GPU. Then, we conduct a human-centered evaluation in which annotators assess the coherence and influence of "the model's reasoning". Results indicate that SCoT explanations are generally perceived as logical (mean score 4.1/5, though with notable variation across texts) and can prompt reflection on initial interpretations, even when disagreement persists. Our findings highlight both the potential and risks of LLM-assisted frame analysis. While structured reasoning can increase the traceability of model outputs and support critical interpretation, it can also influence human judgment in subtle ways. By enabling local deployment and emphasizing human-in-the-loop interaction, this work contributes to discussions on responsible and accessible computational tools for the study of socially impactful media narratives.

28. 【2606.03739】Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

链接：https://arxiv.org/abs/2606.03739

作者：Justice Owusu Agyemang,Jerry John Kponyo,Kwame Opuni-Boachie Obour Agyekum,Francisca Adoma Acheampong,Kwame Agyeman-Prempeh Agyekum,James Dzisi Gadze

类目：Computation and Language (cs.CL); Information Theory (cs.IT)

关键词：LLM pipelines waste, pipelines waste substantial, LLM pipelines, waste substantial token, substantial token budgets

备注：

点击查看摘要

Abstract:LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional components. An adaptive quenching schedule $T(\tau) = T_0 / (1 + \alpha \tau)$ removes tokens whose Boltzmann survival probability $p_i = \exp(-E_i / kT)$ falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below $\theta$. We prove token selection by descending $E(t)$ maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit $\text{CR} \to 1 - I(P; T)/H(P)$. A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining $S_E 0.80$, with energy-squared amplification $E \to E^2$ adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.

29. 【2606.03728】Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

链接：https://arxiv.org/abs/2606.03728

作者：Mohamed Hesham Elganayni,Selim Saleh

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Retrieval-augmented generation systems, legal question answering, question answering typically, answering typically retrieve, Retrieval-augmented generation

备注： 11 pages, 4 tables, 1 figure. Published at ASAIL 2026 (8th Workshop on Automated Semantic Analysis of Information in Legal Text), co-located with ICAIL 2026, Singapore

点击查看摘要

Abstract:Retrieval-augmented generation systems for legal question answering typically retrieve passages based on semantic similarity and provide them to a language model, which then generates cited answers. Prior work assumes that highly ranked passages are most likely to be usefully cited by the model. Perturbation-based attribution methods, such as C-LIME, have been used exclusively for post-hoc explanation. However, on the AQuAECHR benchmark, semantic similarity does not correlate with passage attribution. Within a retriever's candidate pool, similarity-based ranking performs worse than random selection at surfacing gold citation paragraphs. To address this limitation, a lightweight cross-encoder is trained on continuous perturbation-based attribution scores to re-rank passages prior to generation. This approach is evaluated on the AQuAECHR benchmark, using two language models and five-fold cross-validation. The re-ranker substantially improves citation faithfulness and alignment with gold expert answers. Notably, two re-rankers trained independently on different models converge beyond their raw attribution agreement. This finding indicates that the cross-encoder reduces model-specific noise and produces a shared relevance signal that partially transfers across models, although same-model re-ranking remains more effective. These results demonstrate that perturbation-based attribution provides a practical, model-agnostic training signal for citation-aware retrieval.

30. 【2606.03695】Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

链接：https://arxiv.org/abs/2606.03695

作者：Clara Haya Suslik,Or Shafran,Mor Geva

类目：Computation and Language (cs.CL)

关键词：erase specific knowledge, real-world applications, safety and compliance, increasingly deployed, deployed in real-world

备注：

点击查看摘要

Abstract:As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through adversarial prompting or relearning. In this work, we hypothesize this limitation stems in part from existing methods overlooking the embedding layer. To address this, we introduce EMBedding ERasure (EMBER), a plug-n-play erasure module that leverages Sparse Matrix Factorization for precise erasure of concept-related features from token embeddings. Through comprehensive evaluations across diverse concepts on Gemma-2-2B-it and Llama-3.1-8B-Instruct, we find that augmenting existing methods with EMBER consistently improves erasure efficacy and specificity across task formats, with minimal coherence loss. Moreover, it dramatically improves robustness to relearning, reducing regained accuracy by up to 50%, limiting it to 35% on Llama compared to 70%-76% for prior methods. Further analysis shows that the coherence cost is localized, affecting only a small set of concept-exclusive tokens. Our work establishes that precise embedding-level intervention is necessary for robust concept erasure, and demonstrates that existing methods can benefit from such augmentation.

31. 【2606.03693】Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

链接：https://arxiv.org/abs/2606.03693

作者：Pieter Christy Yan Yudhistira,Dzaki Rafif Malik,Novanto Yudistira

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：language largely unexplored, English radiology visual, largely unexplored, Bahasa Indonesia, Indonesian

备注： accepted to MMFM-BIOMED Workshop @ CVPR 2026

点击查看摘要

Abstract:Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at this https URL.

32. 【2606.03692】SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

链接：https://arxiv.org/abs/2606.03692

作者：Yuan Xiong,Ziqi Miao,Qian Chen,Lijun Li,Yequan Wang,Shizhu He,Jun Zhao,Kang Liu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：systematic skill construction, solve complex tasks, flexibly invoke skills, flexibly invoke, solve complex

备注：

点击查看摘要

Abstract:Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.

33. 【2606.03650】CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

链接：https://arxiv.org/abs/2606.03650

作者：Alexander Apartsin,Yehudit Aperstein

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：task-specific labeled data, scores reflect memorization, labeled data exists, standard public benchmarks, ranking language models

备注： 19 pages, 6 images

点击查看摘要

Abstract:Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-source, reusable framework that closes this gap end to end: from only a description of a task or domain, teacher models synthesize a fresh, attribute-controlled benchmark with no human labels, contamination-free because items are generated anew on each run, and a cross-family judge ensemble ranks candidate models with no human raters. Validated where ground truth exists, CoEval recovers the true model ranking and tracks ground-truth correctness at ho=0.86. The label-free judging needs no human calibration because judge-panel composition (vendor diversity), not size, drives reliability: a small, well-chosen cross-family panel is most reliable, while a single judge can be anti-correlated with ground truth (judge-choice regret 0.35) and the ensemble never is. Generated items show zero verbatim 13-gram overlap with five major public benchmarks; the panel cancels verbosity bias and precludes same-family self-preference. A four-task study produced 7,978 evaluations for USD 5.89. The same declarative pipeline applies to any domain and is cheap enough to re-run on every model release: a label-free, contamination-free leaderboard any team can regenerate for its own application.

34. 【2606.03648】Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

链接：https://arxiv.org/abs/2606.03648

作者：Krishnapriya Vishnubhotla,Hillary Dawkins,Isar Nejadgholi,Svetlana Kiritchenko

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Adapting foundation large, foundation large language, Adapting foundation, large language models, foundation large

备注： 8 pages plus appendices

点击查看摘要

Abstract:Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

35. 【2606.03628】Building Reliable Long-Form Generation via Hallucination Rejection Sampling

链接：https://arxiv.org/abs/2606.03628

作者：Lin Li,Georgia Channing,Suhaas M Bhat,Gabriel Davis Jones,Yarin Gal

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, achieved remarkable progress, Large language, undermines their reliability, achieved remarkable

备注： accepted by ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: this https URL.

36. 【2606.03624】Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

链接：https://arxiv.org/abs/2606.03624

作者：Zhengyi Zhao,Shubo Zhang,Huimin Wang,Zezhong Wang,Yutian Zhao,Yefeng Zheng,Binyang Li,Yulan He,Kam-Fai Wong,Xian Wu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：demonstrated impressive capabilities, competing constraints simultaneously, satisfy individual constraints, balance competing constraints, Large Reasoning Models

备注： a pre-MIT Press publication version

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints'' that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model's own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.

37. 【2606.03604】Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

链接：https://arxiv.org/abs/2606.03604

作者：Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Luyao Ye,Huimin Wang,Hanqi Yan,Binyang Li,Kam-Fai Wong,Yulan He

类目：Computation and Language (cs.CL)

关键词：Large Vision Language, Vision Language Models, Large Vision, Vision Language, tend to describe

备注：

点击查看摘要

Abstract:When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response. We reframe meme understanding as a problem of literal-pragmatic decomposition and propose \textbf{Intent Projection}, a framework that separates the two signals at the representation, output, and objective levels within a single LVLM backbone. At the representation level, an orthogonal projection module removes dominant unimodal directions from the fused image-text representation, retaining only the pragmatic residual, while a surface-real affect classifier anchors the decoder with a discrete tag that names the polarity gap. At the output level, the model externalizes a structured reasoning chain, and at the objective level a contrastive reward explicitly penalizes answers that restate the literal description. Across six multimodal benchmarks, Intent Projection consistently outperforms open-source baselines and narrows the gap to proprietary models, with the largest gains on high-divergence posts where literal collapse is most damaging.

38. 【2606.03603】World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

链接：https://arxiv.org/abs/2606.03603

作者：Yucheng Zhou,Wei Tao,Yiwen Guo,Jianbing Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：provide complementary capabilities, multimodal large language, large language models, static visual observations, predicting future outcomes

备注：

点击查看摘要

Abstract:World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at this https URL.

39. 【2606.03602】CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

链接：https://arxiv.org/abs/2606.03602

作者：Bo Peng,Kaiwen Wu,Sirui Chen,Zhiheng Wang,Yu Qiao,Chaochao Lu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：finite sample sizes, observational data remains, data remains challenging, remains challenging due, purely statistical methods

备注：

点击查看摘要

Abstract:Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at this https URL.

40. 【2606.03576】AutoTail-BSFGM: Class-Balance-Aware Fine-Tuning for Chinese Scholarly Text Classification

链接：https://arxiv.org/abs/2606.03576

作者：Anling Xiang,Yuwen Yang,Yang Shen

类目：Computation and Language (cs.CL)

关键词：semantically adjacent disciplinary, Fast Gradient Method, adjacent disciplinary labels, supports literature organization, Chinese scholarly corpora

备注： 17 pages, 4 figures, 4 tables. Code and data: [this https URL](https://github.com/thu-nmrc/autotail-bsfgm-scholarly-classification)

点击查看摘要

Abstract:Scholarly text classification supports literature organization, subject indexing, and research intelligence, but Chinese scholarly corpora often contain imbalanced and semantically adjacent disciplinary labels. We propose AutoTail-BSFGM, a class-balance-aware fine-tuning method that combines an automatically gated tail-prior adjustment, a weak Balanced Softmax auxiliary loss, and Fast Gradient Method adversarial regularization. The method changes only the training objective and procedure; inference uses the same single base-size encoder and linear classifier as the corresponding label-smoothed baseline. We evaluate the method on two CSL-based tasks: an abstract-to-discipline task with 67 labels and a title-to-category task with 13 categories. On the primary abstract task, AutoTail-BSFGM improves validation and lockbox accuracy under both Chinese RoBERTa-WWM and MacBERT-base. With MacBERT-base, validation accuracy increases by 0.83 percentage points and lockbox accuracy by 0.49 points, with a pooled paired McNemar signal on validation (p = 0.023). On the title task, the method improves validation accuracy by 0.70 points and validation balanced accuracy by 2.64 points; lockbox accuracy is approximately neutral while lockbox balanced accuracy improves by 1.22 points. The results support a bounded contribution: AutoTail-BSFGM improves class-balance-sensitive behavior and yields consistent gains for abstract-based scholarly classification, without uniformly improving every metric on every split.

41. 【2606.03544】SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

链接：https://arxiv.org/abs/2606.03544

作者：Linyue Pan,Yaoming Zhu,Lin Qiu,Xuezhi Cao,Xunliang Cai

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Self-improving language agents, iteratively refines, Self-improving language, receives feedback, Agent Group Evolution

备注： 13 pages, 5 figures

点击查看摘要

Abstract:Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

42. 【2606.03535】Can LLM Rerankers Predict Their Own Ranking Performance?

链接：https://arxiv.org/abs/2606.03535

作者：Shiyu Ni,Keping Bi,Jiafeng Guo,Jingtong Wu,Zengxin Han,Xueqi Cheng

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：effectiveness varies substantially, Retrieval effectiveness varies, substantially across queries, making it important, effectiveness varies

备注：

点击查看摘要

Abstract:Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textit{reranker-internal QPP}: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019--2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.

43. 【2606.03504】BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

链接：https://arxiv.org/abs/2606.03504

作者：Muhammad Ali

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Tibetic language spoken, ISO 639-3, ASR resources, Mozilla Common Voice, Tibetic language

备注： 5 pages, 4 figures, 4 tables. Code and data available at [this https URL](https://github.com/mohdali-dev/BaltiVoice-ASR)

点击查看摘要

Abstract:We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538 utterances, down from a measured zero-shot baseline of 182.18% for Whisper-small on Balti. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.

44. 【2606.03463】DMF: A Deterministic Memory Framework for Conversational AI Agents

链接：https://arxiv.org/abs/2606.03463

作者：Matteo Stabile,Enrico Zimuel

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：long interaction horizons, scalable and semantically, semantically coherent, coherent across long, DMF

备注： 21 pages, 3 figures

点击查看摘要

Abstract:Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $\Omega$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $\Omega_{\mathrm{eff}}(\Delta n)$, governs how relevance evolves as new turns arrive, where $\Delta n$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

45. 【2606.03437】Large Language Models Are Overconfident in Their Own Responses

链接：https://arxiv.org/abs/2606.03437

作者：Mario Sanz-Guerrero,Manuel Mager,Katharina von der Wense

类目：Computation and Language (cs.CL)

关键词：base pre-trained counterparts, large language models, instruction-tuned large language, Prior work, pre-trained counterparts

备注： Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.

46. 【2606.03412】Lexicons and grammars for language processing: industrial or handcrafted products?

链接：https://arxiv.org/abs/2606.03412

作者：Eric Laporte

类目：Computation and Language (cs.CL)

关键词：language processing increased, recent years, language resources, processing increased progressively, lexicons and grammars

备注：

点击查看摘要

Abstract:During the recent years, the use of linguistic data for language processing increased progressively. Such data are now commonly called language resources. Most of the language resources used for this purpose are collections of texts as the Brown Corpus and the Penn Treebank, but electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.

47. 【2606.03399】Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

链接：https://arxiv.org/abs/2606.03399

作者：Farhan Sheth,Ziyuan Yang,Yongying Lan,Si Yong Yeo

类目：Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：require sending raw, large language models, raw sensitive health, sensitive health information, sending raw sensitive

备注： 33 pages, 8 figures, 26 tables

点击查看摘要

Abstract:While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy leakage. A natural approach to mitigate this risk is to encrypt the data before transmission. However, straightforward solutions such as encrypting the entire dataset introduce prohibitive computational, alignment, and communication overheads, rendering large-scale practical deployment infeasible. To preserve privacy while maintaining usability, we present Healthcare Encryption Redaction via Adaptive Linguistic Decomposition (HERALD), a token-level cryptographic redaction framework designed to achieve this balance by encrypting only sensitive tokens while preserving the surrounding context for downstream model utility. HERALD combines medical named-entity recognizer (NER) with part-of-speech (POS) driven policies to select candidate tokens, performs targeted lemmatization to stabilize surface forms, and substitutes each protected token with a deterministic ciphertext wrapped in explicit delimiters. Notably, HERALD is model-agnostic and operates entirely on the client side, ensuring that sensitive content remains encrypted throughout storage, transmission, and processing without requiring changes to downstream models. We evaluated HERALD on both classification and medical question answering (MQA) tasks on public datasets. Across different tasks, experiments illustrate that fully secured baselines suffer significant utility loss, whereas HERALD consistently recovers performance close to plaintext. Overall, HERALD provides a novel utilization pipeline.

48. 【2606.03398】Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

链接：https://arxiv.org/abs/2606.03398

作者：Nishit Singh

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Formal languages, effective conduits, conduits to understand, Formal, counter languages learn

备注： 8 pages, 8 figures

点击查看摘要

Abstract:Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysis, this paper investigates the causal role of these representations. Linear probes are trained to predict the stack depth at each token from the model's hidden states, and a principal representation direction is extracted from the probe. Ablation of this direction from the model causes sequential accuracy to collapse to near 0%, providing strong empirical evidence that the stack representation is not just learned, but is causally necessary for model performance.

49. 【2606.03391】When Model Merging Breaks Routing: Training-Free Calibration for MoE

链接：https://arxiv.org/abs/2606.03391

作者：Canbin Huang,Tianyuan Shi,Xiaojun Quan,Jingang Wang,Jianfei Zhang,Qifan Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Model merging, LLMs without retraining, consolidating the capabilities, capabilities of multiple, multiple LLMs

备注：

点击查看摘要

Abstract:Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at this https URL.

50. 【2606.03376】P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

链接：https://arxiv.org/abs/2606.03376

作者：Ruipeng Zhang,Zhihao Li,Haozhang Yuan,C. L. Philip Chen,Tong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Vision-Language Models, recently garnered significant, garnered significant research, Large Vision-Language, Direct Preference Optimization

备注：

点击查看摘要

Abstract:Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P\textsuperscript{2}-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P\textsuperscript{2}-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P\textsuperscript{2}-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.

51. 【2606.03371】See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

链接：https://arxiv.org/abs/2606.03371

作者：Honghui Zhang,Chenmeinian Guo,Yichen Yu,Guanyu Liu,Yongming Qin,Chongguo Song,Mengyue Yang,Lei Yu,Tianyu Shi

类目：Computation and Language (cs.CL)

关键词：Multimodal retail agents, Multimodal retail, request is made, Intent World Model, Proactive Intent World

备注： 16 pages, 3 figures, 9 tables. Preprint

点击查看摘要

Abstract:Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework, where a device must see pre-interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action-conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart-retail benchmark containing state manifests, pre-interaction videos, candidate responses, action-conditioned outcomes, and best-action labels. When conditioned on ground-truth customer state to isolate action selection, PIWM achieves 0.641 macro F1 on 30 held-out target videos, outperforming a zero-shot Qwen2.5-VL-7B baseline and training variants without balanced action supervision; end-to-end video-only selection drops to 0.295, below the 5-class balanced random baseline of 0.414, identifying video-to-state grounding as the dominant deployment-time bottleneck. A preliminary staged real-store pilot (recorded with paid participants performing scripted customer behaviors) reaches 0.579 action macro F1 on 20 fully annotated videos, with 10 additional accessible videos released with index-level labels.

52. 【2606.03363】EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

链接：https://arxiv.org/abs/2606.03363

作者：Chengxi Liao,Tao Xu,Zulong Chen,Chuanfei Xu,Yiyan Wang,Xinyun Wang,Yanlong Zhang,Xiaojun Chen,Zhibo Yang,Zeyi Wen

类目：Computation and Language (cs.CL)

关键词：enables natural language, natural language access, enables natural, advanced its capabilities, natural language

备注：

点击查看摘要

Abstract:Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases, and realistic workflows, but largely overlook enterprise scenarios where SQL generation depends on private business knowledge, such as internal metrics, reporting conventions, and organizational rules. We introduce EntSQL, an enterprise-oriented Text-to-SQL benchmark for evaluating long-context grounding over proprietary business documents. EntSQL contains 1,066 aligned Chinese-English semantic examples across five business domains, with most examples requiring domain knowledge beyond the question and schema and involving complex SQL structures. On English inputs, the best evaluated system reaches only 15.9\% when long-form documents are provided, highlighting the difficulty of grounding SQL generation in enterprise knowledge.

53. 【2606.03359】Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection

链接：https://arxiv.org/abs/2606.03359

作者：Daniil Krasnoproshin,Maxim Vashkevich

类目：ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Speech emotion recognition, human-computer interaction systems, modern human-computer interaction, Speech emotion, interaction systems

备注： 6 pages, 5 figures, DSPA 2026

点击查看摘要

Abstract:Speech emotion recognition is an important component of modern human-computer interaction systems. However, many state-of-the-art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM-SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM-based framework. Evaluated on the RAVDESS dataset under strict speaker-independent partitioning, the proposed model outperforms conventional attention-based LSTM baselines and several previously reported CNN- and hybrid CNN-LSTM architectures in terms of unweighted average recall (UAR). The best-performing variant (ResLSTM-SA-h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large-scale self-supervised alternatives, thereby enabling efficient deployment on edge devices and real-time voice assistants. The source code is available at this https URL.

54. 【2606.03357】he Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

链接：https://arxiv.org/abs/2606.03357

作者：Nils Schwager,Christoph Hau,Simon Münker,Achim Rettinger

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：researchers assume, assume the outputs, reflect semantic reasoning, semantic reasoning, outputs reflect semantic

备注： 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that separates semantic signals from prompt artifacts. By systematically varying personas, instructions, items, and option symbols, we find that artifactual variance frequently overpowers the semantic signal. In these cases, models predominantly reflect prompt compliance rather than simulated psychological traits. While these findings limit SLM utility in psychometrics, our framework provides a diagnostic tool to identify destructive artifacts and isolate semantic understanding for future frontier-model research.

55. 【2606.03345】Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

链接：https://arxiv.org/abs/2606.03345

作者：Youssef Mohamed,Kenneth Ward Church,Mohamed Elhoseiny

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：problem for understanding, perceived affectively, perception experiences, Perception Topics, Perception

备注： 8 pages

点击查看摘要

Abstract:We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objective factual and a subjective affective aspect, and (2) associate images to their relevant perception experiences. We introduce **PercepT** (**Percep**tion topic **T**ransformer), a two-stage architecture that tackles P-Topics modeling. In the formation stage, percepT discovers *P-Topics* as visual-textual clusters using an unsupervised training objective, and dynamically selects the number of clusters to match the perceptual richness of the dataset. In the mapping stage, it learns *P-Topic mapping functions* via attention pooling to associate images to their respective clusters. On ArtELingo, PercepT achieves a silhouette score of **0.97** compared to **0.37** from the closest baseline reflecting better perceptual clusters. PercepT also achieves an AUC score of **0.94** compared to **0.77** showing better mapping to perceptual clusters. Human evaluation confirms that PercepT captures semantically meaningful perception experiences and significantly outperforms existing methods. Our implementation will be made public.

56. 【2606.03334】Lingo_Research_Group at SemEval-2026 Task 9: Evaluating Prompt Variants for Polarization Detection

链接：https://arxiv.org/abs/2606.03334

作者：Pritam Kadasi,Anuj Tiwari,Mayank Singh

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Text Classification Challenge, Multilingual Text Classification, binary polarization detection, Multilingual Text, polarization manifestation identification

备注： Accepted at the SemEval Workshop, ACL 2026

点击查看摘要

Abstract:Our submission presented in this paper is for SemEval-2026 Task 9: Multilingual Text Classification Challenge - Polarization Detection and it covers all three subtasks: (1) binary polarization detection, (2) polarization type classification and (3) polarization manifestation identification. We adopt a systematic approach of research on short designed prompts by considering twelve designed prompts that are different in terminology clarity, detail of the definition, guidance of reasoning and in-context examples use. The experiments are conducted using aya-101 and Gemma3-27B, with the latter chosen for the submission at the end of the development through performance considerations. Our system has an average macro level F1-score of 0.762 on Subtask 1, 0.587 on Subtask 2 and 0.444 on Subtask 3 with the average accuracy of 0.819, 0.678 and 0.498, respectively, on the official test set averaged among 22 languages, respectively. With cross-task and cross-lingual analysis, we demonstrate that prompt-based approaches can be used effectively to detect coarse grained polarization but encounter more and more difficulties as far as fine-grained and multi-label sociolinguistic classification is concerned.

57. 【2606.03331】Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

链接：https://arxiv.org/abs/2606.03331

作者：Atm Mizanur Rahman(University of Illinois Urbana-Champaign),Md Arid Hasan(University of Toronto),Syed Ishtiaque Ahmed(University of Toronto),Sharifa Sultana(University of Illinois Urbana-Champaign)

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Consumer device repair, Consumer device, large language models, repair, important but underexplored

备注：

点击查看摘要

Abstract:Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

58. 【2606.03327】CAPER: Clause-Aligned Process Supervision for Text-to-SQL

链接：https://arxiv.org/abs/2606.03327

作者：Lujie Ban,Jiasheng Shi,Jinyang Li,Xiaolin Han,Tsz Nam Chan,Chenhao Ma

类目：Databases (cs.DB); Computation and Language (cs.CL)

关键词：decision caused success, query-level execution correctness, SQL decision caused, intermediate SQL decision, systems are typically

备注：

点击查看摘要

Abstract:Text-to-SQL systems are typically evaluated by query-level execution correctness, but this terminal signal provides little guidance about which intermediate SQL decision caused success or failure. Token-level dense supervision is also ill-suited: SQL tokens do not align with complete semantic decisions, can penalize execution-equivalent queries, and are difficult to label reliably at scale. We therefore propose CAPER, which automatically derives clause-level supervision via counterfactual intervention on the SQL abstract syntax tree, enabling root-cause error localization for reward modeling; the resulting data is used to train CAPER-9B, a lightweight Clause-PRM that provides clause-boundary feedback for policy optimization and candidate verification. Experiments on BIRD and Spider show that clause-aligned supervision not only improves execution accuracy, achieving up to a 15.3% relative EX improvement over GPT-5.4, but also strengthens failure-localization capability, reaching 84.53% accuracy and 90.60% MRR on held-out failures. Our project page is at this https URL.

59. 【2606.03318】Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

链接：https://arxiv.org/abs/2606.03318

作者：Xuan Yang,Hao Xu,Tingfeng Hui,Hongsheng Xin,Kaike Zhang,Chunxiao Liu,Ning Miao

类目：Computation and Language (cs.CL)

关键词：large language models, language models, existing evaluation benchmarks, great advances, advances in tool-use

备注：

点击查看摘要

Abstract:Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at this https URL.

60. 【2606.03304】From Script to Semantics: Prompting Strategies for African NLI

链接：https://arxiv.org/abs/2606.03304

作者：Anuj Tiwari,Terry Oko-odion,Hannah Nwokocha

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：African languages remains, Large language models, languages remains underexplored, Natural Language Inference, low-resource African languages

备注： Accepted at the RAIL Workshop, LREC 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated in multilingual settings, yet their inference behavior in low-resource African languages remains underexplored especially under pure prompting without fine-tuning. We present a systematic study of prompting strategies for Natural Language Inference (NLI) in Swahili, Yoruba, and Hausa using the AfriXNLI benchmark. We evaluate five prompting strategies Baseline (zero-shot), Script-Aware, Language Specific, Contrastive, and Native-Label Self-Translation (NL-STP) across two mid-sized open weight models (Llama3.2-3B and Gemma3-4B). To isolate the effect of prompt design, the effect of few-shot examples and Chain-of-Thought reasoning is eliminated in our study. We find a significant difference in performance of class wise across strategies with highly neutral class collapse and high prediction skew in some configurations. Contrastive prompting proves to be the most reliable and steadily improving strategy over language and model and has better balance of class behavior and balance of overall accuracy gains. Notably, well-constructed prompts are sufficient to beat more powerful baselines that are provided with few-shot prompts and Chain-of-Thought prompts. We have found that prompt formulation is essential to multilingual NLI with low-resource languages and that language aware decision structuring can be used to meaningfully enhance robustness in resource challenged settings.

61. 【2606.03301】SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

链接：https://arxiv.org/abs/2606.03301

作者：Galann Pennec,Zhengyuan Liu,Nicholas Asher,Philippe Muller,Nancy F. Chen

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：long-form video benchmark, full-length TV series, long-form video, reasoning, video reasoning benchmarks

备注：

点击查看摘要

Abstract:We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.

62. 【2606.03291】Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

链接：https://arxiv.org/abs/2606.03291

作者：Chaoyi Xiang,Olga Ohrimenko,Benjamin I. P. Rubinstein,Lea Frermann

类目：Computation and Language (cs.CL)

关键词：motivating unlearning methods, Large language models, memorize sensitive facts, remove targeted knowledge, Large language

备注： Accepted at ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) can memorize sensitive facts, motivating unlearning methods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily English-centric. We study multilingual unlearning by extending the TOFU benchmark to five languages, and fine-tune, unlearn, and query our models with different permutations of languages. We find that unlearning transfer, the ability of an unlearned model to "forget" facts in languages other than the unlearning language, is highly variable: e.g., it is strongest between languages sharing scripts and families, and we show that the unlearning language predicts which query languages are most likely to yield the strongest transfer. Layer-wise analysis reveals that unlearning leaves the shared cross-lingual latent space largely intact in early layers, instead operating primarily in later decoding layers. This suggests that unlearning does not truly erase knowledge, but rather induces superficial suppression. Exploiting this structure, a single inference-time steering direction reverses much of this suppression across languages, recovering 50% (Qwen) and 90% (Gemma) of the unlearned knowledge.

63. 【2606.03284】SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

链接：https://arxiv.org/abs/2606.03284

作者：Peerawat Chomphooyod,Jian Gang Ngui,Yosephine Susanto,Attapol T. Rutherford,Alham Fikri Aji,Sarana Nutanong,Can Udomcharoenchaikit,Peerat Limkonchotiwat

类目：Computation and Language (cs.CL)

关键词：Frontier LLMs perform, Southeast Asia, remain poorly tested, Western contexts, Frontier LLMs

备注：

点击查看摘要

Abstract:Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual, limiting their ability to measure culturally grounded reasoning. We introduce SEA-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge-intensive categories such as Languages and Science and Technology. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA-adapted models and culture-aware prompting improve performance, while CoT prompting offers limited gains.

64. 【2606.03273】VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

链接：https://arxiv.org/abs/2606.03273

作者：Hang He,Chuhuai Yue,Chengqi Dong,Chengcheng Wan,Ting Su,Haiying Sun,Jiajun Chai,Xiaohan Wang,Guojun Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：connecting fine-grained clues, complex visual queries, inspecting image regions, requires multimodal large, repeatedly inspecting image

备注：

点击查看摘要

Abstract:Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

65. 【2606.03259】Beyond "To whom it may concern": Tailoring Machine Translation to Audience and Intent

链接：https://arxiv.org/abs/2606.03259

作者：Raphael Merx,Ekaterina Vylomova,Trevor Cohn

类目：Computation and Language (cs.CL)

关键词：depending on audience, communicative intent, source text demands, Translation quality depends, translations depending

备注：

点击查看摘要

Abstract:Translation quality depends on purpose: the same source text demands different translations depending on audience, tone, and communicative intent. Yet MT models and metrics treat translation as a fixed mapping from source to target. LLMs enable users to explicitly specify purpose alongside source text, yet this capability has not been evaluated at scale. We introduce a systematic evaluation of purpose-driven MT across 50 languages, 5 model sizes and 8 text domains. We find that (1) explicit instructions substantially improve translation adaptedness, with larger gains on informal domains (conversation, social media), for larger model sizes and for higher-resource languages; (2) instructions outperform semantically-matched few-shot examples and paragraph-level context; (3) traditional MT metrics fail to capture adaptation quality, often penalizing adapted translations; (4) when curated instructions are unavailable, models can self-generate them from surrounding document context, closing up to 80% of the adaptedness gap to curated instructions. Our results establish that purpose-adapted MT is a viable and measurable capability of LLMs, while highlighting the need for purpose-aware metrics.

66. 【2606.03250】he Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

链接：https://arxiv.org/abs/2606.03250

作者：Henry He,Johann Frei,Raphael Schmitt

类目：Computation and Language (cs.CL)

关键词：Digital healthcare generates, Subjects Tuned BERT, healthcare generates vast, generates vast amounts, Digital healthcare

备注： Under revision at BMC Medical Informatics and Decision Making

点击查看摘要

Abstract:Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks. ChristBERT consistently outperforms existing general-purpose and medical German language models on four of five benchmarks and establishes a new state of the art for German clinical language modeling. Our results show that the optimal adaptation strategy is task-dependent: in our evaluation, training from scratch is particularly effective for highly specialized clinical texts, whereas continued pre-training performs well on more commonly written medical texts. All models are publicly released to support future research and applications in German medical NLP.

67. 【2606.03247】Structures Facilitate Retrieve, Rerank, and Generate

链接：https://arxiv.org/abs/2606.03247

作者：Yeqin Zhang,Haomin Fu,Xujie Zhang,Cam-Tu Nguyen

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Document-grounded dialogue systems, domain-specific user questions, answer domain-specific user, Document-grounded dialogue, dialogue systems

备注：

点击查看摘要

Abstract:Document-grounded dialogue systems (DGDS) utilize knowledge from external documents to answer domain-specific user questions. Existing solutions typically divide documents into independent passages for retrieval and response generation. This approach, however, neither makes good use of structural information within documents nor provides enough (document) context for knowledge selection and responses. This paper proposes SF-Re2G to address such issues systematically. Firstly, we seek to improve a passage representation by contrasting it with others of the same section, thus improving the retrieval performance. Secondly, a structure-enhanced reranker is built, leveraging the fact that multiple grounding passages of one dialog turn tend to be in the same neighborhood. Specifically, candidates from the retrieval are grouped into subgraphs according to the document structure. The reranker will rescore the candidate integrating its group information. Finally, the chosen passages are used for responses, taking into account the subgraph context for better generation. Experimental results on two DGDS datasets validate our method for both Chinese and English.

68. 【2606.03244】When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation

链接：https://arxiv.org/abs/2606.03244

作者：Suhwan Hwang

类目：Computation and Language (cs.CL)

关键词：common intuition, final pooled embedding, post-encoder adapter attaches, lightweight post-encoder adapter, difficulty

备注： 13 pages, 3 figures, 2 tables

点击查看摘要

Abstract:A common intuition is that sentence embeddings should adapt to the difficulty of the input. We test this intuition in a controlled, multi-seed setting: a lightweight post-encoder adapter attaches to a frozen Qwen3-Embedding-0.6B encoder, accessing only its final pooled embedding, and is evaluated on four paraphrase and semantic-similarity tasks (PAWS, MRPC, QQP, STS-B). The naive form of the idea fails: surface-based per-sentence complexity is nearly uncorrelated with frozen-baseline error (Pearson approximately 0.05) and provides no advantage over constant or shuffled controls, while degrading a saturated baseline. Even when the target is aligned to a non-circular pair-difficulty signal, the per-sentence gate still cannot reliably capture difficulty because difficulty is primarily a property of the pair, not the individual sentence. In contrast, a small pair-level residual gated by a held-out cross-encoder difficulty signal yields consistent gains on the larger and graded tasks, including +0.022 Spearman on STS-B and +0.037 on QQP, while remaining anchored to the frozen baseline across all seeds. Because this useful form operates on sentence pairs rather than individual sentences, the resulting model is best understood as a lightweight re-ranker over cached frozen embeddings, not a replacement single-vector embedding; we make no state-of-the-art claim. Our contribution is a controlled account of when difficulty-aware adaptation helps and when it fails, together with a pre-training diagnostic that predicts the available headroom.

69. 【2606.03241】Benchmarking Speech-to-Speech Translation Models

链接：https://arxiv.org/abs/2606.03241

作者：Alkis Koudounas,Hayato Futami,Quentin Jodelet,Osamu Take,Shinji Watanabe,Emiru Tsunoo

类目：Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：preventing direct comparisons, studies report non-overlapping, report non-overlapping metric, offline evaluation lacks, advanced rapidly

备注： Paper under submission

点击查看摘要

Abstract:Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $\rho0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($\rho \geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.

70. 【2606.03239】ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

链接：https://arxiv.org/abs/2606.03239

作者：Zheng Liu,Longxiang Zhang,Xintong Wang,Zhiang Xu,Shaoxiong Zhan,Xin Shan,Wen Huang,Tao Dai,Shu-Tao Xia,Chengfu Huo,Liang Ding

类目：Computation and Language (cs.CL)

关键词：LLM-based search agents, LLM-based search, search agents, agents are trained, trained predominantly

备注：

点击查看摘要

Abstract:LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.

71. 【2606.03237】Solipsistic Superintelligence is Unlikely to be Cooperative

链接：https://arxiv.org/abs/2606.03237

作者：Rakshit S Trivedi,Natasha Jaques,Logan Cross,Alexander Sasha Vezhnevets,Joel Z Leibo

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：capability to coexistence, central challenge, challenge is shifting, shifting from capability, Abstract

备注： 24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026

点击查看摘要

Abstract:AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

72. 【2606.03220】WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

链接：https://arxiv.org/abs/2606.03220

作者：Yuxin Meng,Yuhan Suo,Junjie Wang,Yuhan Sun,Yiyao Yu,Ruixu Zhang,Ruining Hu,Yubin Wang,Shouwei Ruan,Bin Wang,Yuxiang Zhang,Yujiu Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：MLLM-generated web artifacts, web artifacts assess, artifacts assess interaction, Existing benchmarks, Interaction Contract Graphs

备注：

点击查看摘要

Abstract:Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

73. 【2606.03219】Sample-Size Scaling of the African Languages NLI Evaluation

链接：https://arxiv.org/abs/2606.03219

作者：Anuj Tiwari,Oluwapelumi Ogunremu,Terry Oko-odion,Jesujuwon Egbewale,Hannah Nwokocha

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：reliably enhances downstream, enhances downstream performance, annotation data reliably, data reliably enhances, unclear if augmenting

备注： Accepted at the AfricaNLP Workshop, EACL 2026

点击查看摘要

Abstract:African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language sensitive datasets creation and stronger multi-lingual modelling strategies.

74. 【2606.03198】AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

链接：https://arxiv.org/abs/2606.03198

作者：Sangwon Baek,Kyu Yeon Hur,Kyunga Kim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：evaluation increasingly delegates, increasingly delegates scoring, large language models, quantitatively characterized, increasingly delegates

备注： 11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables

点击查看摘要

Abstract:Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

75. 【2606.03197】MemTrain: Self-Supervised Context Memory Training

链接：https://arxiv.org/abs/2606.03197

作者：Ziheng Li,Xingrun Xing,Haoqing Wang,Zhi-Hong Deng,Yehui Tang

类目：Computation and Language (cs.CL)

关键词：long-horizon LLM agents, utilize information accumulated, preserve and utilize, accumulated across extended, long-horizon LLM

备注：

点击查看摘要

Abstract:Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.

76. 【2606.03189】SenseJudge: Human-Centric Preference-Driven Judgment Framework

链接：https://arxiv.org/abs/2606.03189

作者：Rui Li,Junfeng Liu,Xiangwen Kong,Linhai Xu,Zhifang Sui

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, increasingly accepted paradigm, assessing model responses, Language Models

备注： ACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios. To address these limitations, we propose SenseJudge, a customizable judgment framework driven by human preferences and SenseBench, a diverse and challenging instruction-following benchmark derived from real-world multi-turn interactions. We applied the automatic judgment framework and benchmark to two tasks: (1) LLMs as personalized judges, and (2) model ranking. We conducted extensive experiments, and the results demonstrate that the SenseJudge framework surpasses other judgment methods and models in the LLMs-as-personalized-judges task and achieves model ranking that aligns with real human sense. Additionally, we conducted analyses on position bias and consistency, alongside ablation studies, which affirmed the robustness of SenseJudge.

77. 【2606.03180】GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

链接：https://arxiv.org/abs/2606.03180

作者：Jonggwon Park,Seongeun Lee,Junhyun Park,Hannah Yun,Hyunwoong Kim,Sohyun Jeong,Hyewon Kang,Byungmu Yoon,Kyoyun Choi

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：pairs naturally produced, leveraging image-report pairs, image-report pairs naturally, Vision-language models, clinical workflows

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

78. 【2606.03179】HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift

链接：https://arxiv.org/abs/2606.03179

作者：Yu-Kai Chan,Wen-Sheng Lien,Dong-Ting Yao,Bo-Kai Ruan,Kwan-Yeung Lin,Hong-Han Shuai,Meng-Fen Chiang

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, maintain temporal validity, Knowledge Transfer Failure

备注： Accepted to Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) rely on Knowledge Editing (KE) to maintain temporal validity, yet real-world knowledge is inherently n-ary. We demonstrate that in non-stationary environments, sequential updates to complex relations induce N-ary Structural Drift, a phenomenon where the binary reification of n-ary events into triples fractures relational atomicity. This precipitates Structure-Conditioned Knowledge Transfer Failure, a systematic mis-grounding of the retriever frequently misdiagnosed as parametric hallucination. To tackle this, we propose HyperPatch, a parameter-preserving framework that reformulates sequential KE as a stability problem over hypergraph manifolds. HyperPatch preserves event integrity through three phases: (i) Structural Prior Initialization, establishing a topology-aware embedding space via contrastive learning on a Hypergraph Neural Network (HGNN) to capture high-order correlations; (ii) Sequential Topology Editing, utilizing a dual-stage mechanism that employs SimHash-based Topological Alignment for rapid conflict resolution and Topological LoRA Adaptation to track drift without backbone retraining; and (iii) Structure-Conditioned Reasoning, which integrates globally consistent evidence from fused linguistic and structural manifolds. On the MQuAKE-CF and MQuAKE-T benchmarks, HyperPatch achieves relative gains in Hop-wise Accuracy (H-Acc) of 96.24% and 21.06% over the strongest baseline, respectively. Further ablations demonstrate superior reliability under continuous n-ary update streams, whereas the standard KG-based variant suffers H-Acc collapses of up to 88.3% due to structural misalignment.

79. 【2606.03165】Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

链接：https://arxiv.org/abs/2606.03165

作者：Thomas Stephan Juzek,Xiaoyang Ming,Jose A. Hernandez

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：digital chat assistants, human preference learning, preference learning, digital chat, chat assistants

备注： 16 pages, 2 figures, 10 tables

点击查看摘要

Abstract:The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

80. 【2606.03156】A cross-domain tropical species dataset with Chinese vernacular names and CITES source links

链接：https://arxiv.org/abs/2606.03156

作者：Jeff Wang

类目：Computation and Language (cs.CL)

关键词：active tropical species, regulatory life cycle, kingdom-organised biodiversity infrastructures, tropical species, active tropical

备注： 25 pages, 4 figures, 4 tables. Dataset descriptor for the Tropical Species Encyclopedia. Companion to the methodology paper [arXiv:2606.00994](https://arxiv.org/abs/2606.00994) . Dataset deposited at Zenodo (doi: [https://doi.org/10.5281/zenodo.20377811%29%3B](https://doi.org/10.5281/zenodo.20377811%29%3B) canonical preprint-of-record at Zenodo (doi: [https://doi.org/10.5281/zenodo.20424981](https://doi.org/10.5281/zenodo.20424981) )

点击查看摘要

Abstract:We describe a versioned cross-domain dataset of 410,499 active tropical species (working snapshot 2026-04-20) spanning three applied subdomains -- tropical_plants, tropical_aquatic, and tropical_pets -- that share a commercial and regulatory life cycle but are distributed across kingdom-organised biodiversity infrastructures. The resource joins taxonomic identifiers from GBIF, Plants of the World Online, iNaturalist, NCBI Taxonomy, the Catalogue of Life and the Encyclopedia of Life, and adds three original layers: a cross-domain ontology that re-segments taxa along trade and husbandry contexts; a Chinese vernacular layer with explicit per-name provenance under a typology that excludes unverified machine-generated proposals; and a CITES source-linkage layer connecting each taxon to its Species+ entry. Chinese vernacular coverage -- the proportion of taxa carrying a CJK Chinese name distinct from the scientific binomial -- reaches 99.50 percent (408,456 of 410,499; full-population count). Coverage characterises completeness, not name-translation accuracy; the latter is bounded by the four-level provenance typology and is the subject of a preliminary internal review reported here, with a blind external audit identified as the principal open item. Upstream content is referenced by stable identifier only for the original-contribution layers, supporting CC-BY 4.0 reuse. The dataset is deposited on Zenodo (https://doi.org/10.5281/zenodo.20377811). This preprint is the canonical v1.0 description of the dataset's current state; future Data Descriptor submission is anticipated but is contingent on the validation and release-engineering items listed in the Limitations.

81. 【2606.03143】FederatedSkill: Federated Learning for Agentic Skill Evolution

链接：https://arxiv.org/abs/2606.03143

作者：Jingbo Yang,Guanyu Yao,Yang Zhang,Ramana Rao Kompella,Gaowen Liu,Shiyu Chang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Modern LLM agents, Modern LLM, LLM agents increasingly, handle complex tasks, agents increasingly rely

备注：

点击查看摘要

Abstract:Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehensive skills. While cross-user collaboration can overcome this data bottleneck, current trajectory-sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy-preserving framework for collaborative agent evolution. Moving beyond raw trajectory sharing, FederatedSkill utilizes semantic skill diffs, structured patches over local libraries, as the fundamental unit of communication. On the server side, an evolution agent aggregates these patches to dynamically model client-specific capability boundaries, facilitating strictly personalized skill evolution rather than a suboptimal global average. Evaluated across 20 distinct agent task families, FederatedSkill demonstrates substantial gains over self-evolving baselines, achieving up to a 44.4% increase in success rate and a 37.5% reduction in computational cost.

82. 【2606.03136】PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations

链接：https://arxiv.org/abs/2606.03136

作者：Muberra Ozmen,Subhabrata Majumdar

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Multi-turn jailbreak attacks, large language models, Multi-turn jailbreak, language models, reveal a mismatch

备注：

点击查看摘要

Abstract:Multi-turn jailbreak attacks on large language models (LLMs) reveal a mismatch in current guardrails: they operate on individual turns, while attacks unfold as trajectories across conversations. We propose a shift from content to dynamics, modeling conversations as paths in representation space and asking whether adversarial intent is encoded early in their geometry. We introduce PsychoPass, a framework that extracts geometric features from conversation trajectories in embedding space to predict a potential attack before harmful content is produced. These features achieve near-perfect performance in naïve classifiers, which is largely explained by the inclusion of number of turns as a feature. After removing this confound, a smaller but consistent geometric signal remains, with classification performance that does not depend meaningfully on encoder choice. Crucially, this signal appears early in the conversation: attack outcomes remain above chance from short prefixes alone, more reliably than baseline guardrails. A supporting theoretical analysis explains these findings via a decomposition of length and shape, a detection bound based on prefix length, and encoder invariance. Together, these results show that adversarial conversations leave an early, representation-robust geometric fingerprint suitable for online monitoring.

83. 【2606.03132】DMT-CBT: Longitudinal Therapeutic State Modeling for CBT Counseling

链接：https://arxiv.org/abs/2606.03132

作者：Chang Liu,Shuyi Zhang,Changsheng Ma,Yongfeng Tao,Minqiang Yang,Bin Hu

类目：Computation and Language (cs.CL)

关键词：Large language models, Cognitive Behavioral Therapy, shown growing potential, Large language, potential for Cognitive

备注：

点击查看摘要

Abstract:Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this formulation fundamentally mismatches the nature of real psychotherapy. In clinical CBT, therapy is a longitudinal process in which therapists continuously infer, update, and intervene on evolving therapeutic states across sessions. Realistic CBT further involves multimodal inference and delayed cross-session intervention effects, requiring models to capture longitudinal therapeutic state evolution under partial observability. We propose DMT-CBT, a framework for Dynamic Modeling of evolving Therapeutic states in CBT counseling. DMT-CBT maintains structured therapeutic states across sessions while incorporating multimodal behavioral grounding and tool-augmented intervention to support adaptive therapeutic reasoning. Based on this framework, we construct DMTCorpus, a synthetic multi-session multimodal CBT counseling dataset featuring evolving therapeutic states, image-grounded client behaviors, and cross-session intervention continuity. Experimental results show that DMT-CBT improves counseling fidelity and therapeutic alliance, produces more favorable longitudinal affective trajectories, and preserves therapeutic states more faithfully than post-hoc extraction approaches.

84. 【2606.03128】Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation

链接：https://arxiv.org/abs/2606.03128

作者：Bagus Rakadyanto Oktavianto Putra,Muhamad Risqi Utama Saputra,Widyawan,Guntur Dharma Putra

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：decentralized web services, Smart contracts face, Large Language Models, smart contract security, web services

备注： 12 pages, 4 figures, 5 tables. Accepted to IEEE ICWS 2026

点击查看摘要

Abstract:Smart contracts face critical security challenges that require thorough auditing in decentralized web services. While Large Language Models (LLMs) have shown promise in automated vulnerability detection, existing approaches lack severity evaluations with actionable remediation and demand unnecessarily massive computational overhead. In this study, we introduce an efficient end-to-end smart contract security audit framework utilizing lightweight, highly optimized open-source LLMs (0.6B-4B parameters). Our framework decouples comprehensive audit tasks into four interconnected components: vulnerability detection, explanation, severity classification, and remediation recommendation. To maintain high accuracy without massive parameters, we implement Rank-Stabilized Low-Rank Adapters (rsLoRA), knowledge distillation, and a custom Chain-of-Verification (CoVe) aggregation strategy to systematically screen and consolidate multiple draft responses from the model into a highly accurate audit report. Experimental results demonstrate that our lightweight pipeline consistently outperforms state-of-the-art open-source coder dense LLMs (7B to 34B parameters), achieving 98.25% accuracy in vulnerability detection and an alignment score of 0.4375 in generative explanation tasks. Furthermore, our extensive ablation studies empirically validate the superiority of our decoupled audit processes over unified prompting and uncover a novel severity centrality bias, establishing a critical benchmark for future research in LLM-assisted auditing.

85. 【2606.03113】Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

链接：https://arxiv.org/abs/2606.03113

作者：Yanyu Zhu,Hoilam Pao,Niu Hu,Wei Guo,Shaoxiong Zhan,Boyu Lai,Zitai Wang,Yongqin Zeng,Hai-Tao Zheng

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models suffer, Large Language, slow autoregressive inference, Markov Decision Process

备注：

点击查看摘要

Abstract:Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We reframe this optimization as a \textbf{Markov Decision Process} and propose \textbf{LEDE}, a framework that uses offline reinforcement learning. LEDE learns a policy to dynamically select the optimal exit layer and speculation length based on the local context of the generated sequence at each step, balancing computational cost and draft quality. Comprehensive evaluations on Llama-2 and Llama-3 models show LEDE achieves up to a $2.0\times$$\sim$$2.7\times$ speedup over autoregressive decoding and and provides an additional 17\% speedup over the static speculative baselines.

86. 【2606.03110】Coherence Maximization Improves Pluralistic Alignment

链接：https://arxiv.org/abs/2606.03110

作者：Taslim Mahbub,Yiding Pei,Shi Feng

类目：Computation and Language (cs.CL)

关键词：Aligning AI systems, Internal Coherence Maximization, open challenge, extensive human supervision, human supervision remains

备注：

点击查看摘要

Abstract:Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these examples effective, using Internal Coherence Maximization (ICM) -- which infers labels by maximizing their mutual predictability -- to generate persona-specific examples that steer a model toward a target group's values, without human supervision. Across four benchmarks spanning classification, preference, and open-ended generation, ICM-inferred in-context examples match the performance of gold labels. Crucially, coherence matters beyond individual label accuracy: with accuracy held constant, more coherent examples generalize substantially better than incoherent ones. For personas underrepresented in pretraining data, targeted human feedback on the questions where the model is least certain about a persona's values yields better generalization than the same number of labels on arbitrary questions. These results identify coherence as a key design principle for scalable value specification, leveraging the diverse human perspectives already encoded in pretrained language models.

87. 【2606.03102】Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

链接：https://arxiv.org/abs/2606.03102

作者：Runpeng Dai,Tong Zheng,Rui Liu,Chengsong Huang,Hongtu Zhu

类目：Computation and Language (cs.CL)

关键词：Test-time scaling improves, large language models, Test-time scaling, incurs substantial cost, scaling improves

备注：

点击查看摘要

Abstract:Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.

88. 【2606.03099】PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

链接：https://arxiv.org/abs/2606.03099

作者：Kailin Lyu,Zhiqiang Yuan,Jianwei He,Qiwei Yan,Xuanbo Su,Nanxing Hu,Yang Liu,Ce Hao,Shengqian Qin,Lianyu Hu,Jinchao Zhang,Jie Zhou

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：rich contextual cues, Deep Image Search, Image Search requires, Search requires multi-step, requires multi-step reasoning

备注：

点击查看摘要

Abstract:Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long-horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training-free, hierarchical memory system for photo-search agents. Inspired by human cognition, PhotoCraft equips MLLMs with working, episodic, and semantic memory, which are dynamically invoked during reasoning to preserve logical consistency and knowledge transferability throughout multi-step reasoning and answer generation. Extensive experiments on DISBench demonstrate that PhotoCraft consistently improves context-aware retrieval across diverse MLLM backbones, achieving gains of up to 18.5\% and effectively mitigating key bottlenecks in memoryless deep image search, offering a practical path toward reliable and generalizable multimodal search agents.

89. 【2606.03096】Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

链接：https://arxiv.org/abs/2606.03096

作者：Yuanpu Cao,Ziyi Yin,Fenglong Ma,Jinghui Chen

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, making knowledge editing, Language Models, making knowledge

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

90. 【2606.03085】Multi-component Causal Tracing in Large Language Models

链接：https://arxiv.org/abs/2606.03085

作者：Zirui Yan,Dennis Wei,Dmitriy A. Katz,Prasanna Sattigeri,Ali Tajer

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：causal pathways linking, pathways linking specific, linking specific inputs, Causal tracing systematically, LLM behavior

备注： Accepted to ACL 2026 main conference

点击查看摘要

Abstract:Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at this https URL.

91. 【2606.03080】Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

链接：https://arxiv.org/abs/2606.03080

作者：Mingkuan Zhao,Xiayu Sun,Wentao Hu,Suquan Chen,Jiaxuan Li,Xiaoyan Zhu,Xin Lai,Jiayin Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：factorize sequence probabilities, models factorize sequence, leaving future information, future information unexploited, language models factorize

备注：

点击查看摘要

Abstract:Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.

92. 【2606.03078】G^2C-MT: Graph-Guided Context Selection for Document-Level Machine Translation

链接：https://arxiv.org/abs/2606.03078

作者：Baijun Ji,Zixuan Zhou,Xiangyu Duan,Yu Liu,Longbo Sun,Rupu Wei,Bohong Zhao

类目：Computation and Language (cs.CL)

关键词：Effective document-level machine, requires capturing long-range, Effective document-level, capturing long-range discourse, requires capturing

备注： 9 pages, 2 figures; IJCAI2026

点击查看摘要

Abstract:Effective document-level machine translation (DocMT) requires capturing long-range discourse dependencies. Recent work has explored retrieval-based and discourse-aware context selection. However, these approaches often lack an explicit mechanism for modeling structured discourse dependencies between distant paragraphs in a document. In this paper, we propose G^2C-MT (Graph-Guided Context for Machine Translation), which views DocMT context selection as a structured path discovery problem on a lightweight discourse graph, rather than retrieving unstructured context sets or relying on expensive LLM-based discourse modeling. In detail, we represent each paragraph as a node and model the relationship between each pair of nodes, considering their semantic similarity, adjacency, and keyword overlap. Furthermore, we propose a depth-biased random walk over the graph to sample a backward context path for each target paragraph. The context path will be used to prompt a large language model (LLM) for translation. This framework naturally supports multi-path context sampling, which can improve robustness by aggregating diverse translation candidates for discourse-ambiguous inputs. Experiments conducted across various domains show that G^2C-MT outperforms strong baselines on multiple LLMs, including DeepSeek-V3, Gemini-2.5-Flash-lite, and the Qwen-2.5/3 series.

93. 【2606.03063】ZX-Calculus:Trace-Indexed Dependent Types and Epistemic Semantics

链接：https://arxiv.org/abs/2606.03063

作者：Peng Chen

类目：Logic in Computer Science (cs.LO); Computation and Language (cs.CL)

关键词：Knowledge Evolution Calculus, Dependent Type Theory, Knowledge Evolution, Evolution Calculus, Martin-Lof Dependent Type

备注：

点击查看摘要

Abstract:We propose ZX-Calculus (Knowledge Evolution Calculus), a conservative extension of Martin-Lof Dependent Type Theory (MLTT) integrating trace-indexed types, presheaf non-monotone semantics, and constructive AGM belief revision. A Coq mechanisation accompanies the paper (34 complete proofs; zero admits for the two central results). (I) Trace types. FinTrace(s0,sn) is an inductive family of typed execution traces. FinTrace and Star(Step) are isomorphic as path types but not judgementally equal; TraceElim exposes the event label e:Event explicitly, giving a more ergonomic interface for event-driven induction. We prove the Trace-Reachability Correspondence, Deterministic Replay, and a canonicity framework via reducibility candidates with a Transport Lemma (RC-elim deferred; all other Core results are Coq-verified). (II) Sheaf semantics. Trace-indexed propositions are contravariant sheaves over the free trace partial-order category Tf. A Separation Theorem (explicit countermodel) distinguishes proof-theoretic monotonicity from semantic non-monotonicity. The term model is an initial CwF (syntactic universal property, not classical completeness). (III) AGM belief revision. We give an explicit constructive partial meet contraction algorithm verified against (C1)-(C4). All eight AGM postulates (R1)-(R8) are theorems. Proofs of R7 and R8 use the Disjunctive Entrenchment Lemma, given a self-contained constructive derivation. (IV) Integration. B^AGM fails the sheaf composition law BP-comp for sequential revision (explicit countermodel, Coq-verified). We introduce Single-Step Revision Systems (SSRS), prove B^AGM is a valid SSRS (Coq-verified), and show this suffices for trace morphisms, retraction characterisation, and revision witnesses. The BP-comp failure reveals a fundamental tension between path-dependent belief revision and functor consistency, not previously identified.

94. 【2606.03043】he Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

链接：https://arxiv.org/abs/2606.03043

作者：Sourabrata Mukherjee,Hamna Hamna,Kalika Bali,Sunayana Sitaram

类目：Computation and Language (cs.CL)

关键词：judges agree strongly, circ, agree strongly, agreeing only weakly, human

备注：

点击查看摘要

Abstract:LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM-as-judge stack across four community-built Indic datasets, eight Indic languages, and 41 LLM judges: score spread, effective rank, principal angle to the human subspace, and stacked correlations among judges and humans, all with bootstrap confidence intervals. On subjective rubrics, judges use less than half the human score range ($\sigma_J / \sigma_H \approx 0.3$--$0.5$). Their evaluation axis is nearly orthogonal to the human one and noticeably further from humans than humans are from each other ($87^\circ$--$89^\circ$ versus $78^\circ$--$81^\circ$). Inter-LLM agreement exceeds LLM--human agreement ($r_{LL} \approx 0.35$ versus $r_{LH} \approx 0.27$--$0.32$). On a rubric with a verifiable factual answer, the same diagnostics fall back into the human range (axis $58.5^\circ$; $r_{LH} = 0.519$). Fine-tuning and preference optimization recover spread ($0.32 \rightarrow 1.08$) but barely move the axis (still $87^\circ$--$88^\circ$). Only post-hoc calibration on a small human-anchored set improves all four community-health rubrics together, placing a calibrated 24B Indic judge ($r = 0.184$) ahead of GPT-5.5 ($r = 0.123$), yet still short of human reliability (human-human $r = 0.474$ on the verifiable rubric). We argue that inter-LLM agreement should be considered evidence of human alignment only when a direct geometric check on the judge's score subspace passes; otherwise, the consensus reflects agreement within a collapsed subspace.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.03043 [cs.CL]

(or
arXiv:2606.03043v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.03043

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Sourabrata Mukherjee [view email] [v1]
Tue, 2 Jun 2026 02:26:18 UTC (1,384 KB)

95. 【2606.03032】he Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation

链接：https://arxiv.org/abs/2606.03032

作者：Herun Wan,Jiaying Wu,Minnan Luo,Fanxiao Li,Ningnan Wang,Nancy F. Chen,Min-Yen Kan

类目：Computation and Language (cs.CL)

关键词：Multi-agent LLM systems, systems often treat, LLM systems, Multi-agent LLM, treat consensus

备注：

点击查看摘要

Abstract:Multi-agent LLM systems often treat consensus as evidence of successful interaction. For deliberative problems, however, reliability depends on whether agents preserve the facts and viewpoints needed to interpret an issue. We identify the deliberative illusion: discussion produces (1) factual attrition, the progressive loss of issue-critical facts, alongside (2) stance homogenization, the collapse of diverse positions toward consensus. To measure this process, we introduce DelibTrace, a framework that decomposes each issue into atomic facts, labels issue-critical ones, distributes them across agents, and tracks their survival across discussion rounds. Across ethical and news-based deliberation with three representative LLM families, multi-agent discussion erases up to 72% of issue-critical facts. This loss is consequential: retained evidence can reconstruct the issue misleadingly, final stances remain anchored in base-model priors, and a single malicious agent can inject misinformation into the shrinking shared context. These results reveal a sharper risk: agents can agree more while knowing less. We call for evaluations that measure which facts, uncertainties, and legitimate disagreements survive interaction.

96. 【2606.03029】Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

链接：https://arxiv.org/abs/2606.03029

作者：Paiheng Xu,Jing Liu,Wei Ai

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：computational social science, discover interpretable differences, instructional quality, core goal, goal of computational

备注：

点击查看摘要

Abstract:A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.

97. 【2606.03027】SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

链接：https://arxiv.org/abs/2606.03027

作者：Peerat Limkonchotiwat,Raymond Ng,Sarana Nutanong,Jian Gang Ngui

类目：Computation and Language (cs.CL)

关键词：making robustness important, Southeast Asian languages, real-world NLP, Southeast Asian, downstream applications

备注：

点击查看摘要

Abstract:Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.

98. 【2606.03022】Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

链接：https://arxiv.org/abs/2606.03022

作者：Mingkuan Zhao,Wentao Hu,Tianchen Huang,Yuheng Min,Suquan Chen,Yide Gao,Yanbo Zhai,Shuangyong Song,Xuelong Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Language Models, Large Language, logical constraints, remains a persistent

备注：

点击查看摘要

Abstract:Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold this http URL code is available at this https URL

99. 【2606.03021】Hint-Guided Diversified Policy Optimization for LLM Reasoning

链接：https://arxiv.org/abs/2606.03021

作者：Zhiyu Cao,Kaixin Wu,Mingjie Zhong,Peifeng Li,Xiaobo Li,Can Ye,Qiaoming Zhu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, promising enhancement strategy, Recent developments, impressive reasoning capabilities

备注：

点击查看摘要

Abstract:Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

100. 【2606.02994】Inducing Reasoning Primitives from Agent Traces

链接：https://arxiv.org/abs/2606.02994

作者：Zhihan Lei,Jiarui Yan,Joshua Momo,William W. Cohen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：ReAct-style LLM agents, Reasoning Primitive Induction, transient scratchpads, ReAct-style LLM, routines trapped

备注： 22 pages including appendices

点击查看摘要

Abstract:ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 - 74), +30pp on MuSR team allocation (38 - 68), and +22pp on NatPlan meeting planning (7 - 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

101. 【2606.02991】Pretraining Language Models on Historical Text

链接：https://arxiv.org/abs/2606.02991

作者：Xiaoxi Luo,Zachary Shinnick,Niclas Griesshaber,Yixuan Wang,Junchi Yu,Freda Shi,Philip Torr,Yao Lu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：English text predating, exclusively on English, English text, trained exclusively, text predating

备注：

点击查看摘要

Abstract:We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

102. 【2606.02983】A Locally Deployed RAG-Based Academic Advising System for Course Selection

链接：https://arxiv.org/abs/2606.02983

作者：Feng Li,Yoritaka Iwata

类目：Computation and Language (cs.CL)

关键词：skills holistically, curriculum based, great importance, develop their knowledge, knowledge and skills

备注： to be published in Elsevier's Procedia Computer. Sci. (KES 2026)

点击查看摘要

Abstract:The correct sequence of courses in the curriculum based on prerequisites between courses is of great importance for students to develop their knowledge and skills holistically. However, students crafting this sequence in isolation frequently struggle with recognition limitations and information overload that leads to confusion. Simultaneously, education institutions encounter difficulties in providing adequate academic advice for the correct sequence due to limited education resources. To address these challenges, we propose a locally deployed RAG-based academic advising system grounded in syllabus information. By combining large language models with retrieval from structured syllabus data, the system is designed to support course selection, prerequisite understanding, and personalized study planning in a privacy-preserving manner.

103. 【2606.02981】Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

链接：https://arxiv.org/abs/2606.02981

作者：Luyang Zhang,Jingyan Li

类目：Computation and Language (cs.CL)

关键词：model ranks highest, reward model ranks, inference scaling, ranks highest, improves accuracy

备注：

点击查看摘要

Abstract:Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance currently requires running the procedure end-to-end. Prior work links cheap statistics of a model's sampled outputs and validation-set correctness (how often samples agree, how diverse they are, how confident the model is, and where correct samples appear) to model behavior, but does not isolate which of these form a stable, compact predictor of best-of-$N$ gain. We fit ridge predictors on features computed from a single labeled validation-set sampling pass, use bootstrap-Lasso as a stability analysis of the candidate feature set, and give a concentration analysis with an explicit linear-approximation residual. Across three base-model families, six post-training methods, and math and reasoning task domains, the stability analysis identifies a strict three-feature core spanning prompt-level agreement spread, label-assisted first-correct-sample position, and completion-length variance; a compact ridge predictor built from this core plus an entropy add-on reaches Spearman $\rho = 0.90$ with actual best-of-$N$ gain under a reward-model verifier. The intended use is labeled validation-set screening of candidate configurations before paying the full reward-model scoring cost.

104. 【2606.02976】Memory Retrieval for Changing Preferences

链接：https://arxiv.org/abs/2606.02976

作者：Yuehan Qin,Li Li,Linxin Song,Wei Yang,Jiate Li,Yuqing Yang,Yue Zhao

类目：Computation and Language (cs.CL)

关键词：Long-context dialogue systems, history are relevant, dialogue systems, systems must decide, interaction history

备注：

点击查看摘要

Abstract:Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to account for the changing and potentially inconsistent nature of user preferences. In this work, we propose a unified framework for memory access and selection based on changing preferences. We formulate personalized memory retrieval as identifying which historical turns provide evidence about a user's latent preference state, rather than relying on surface-level semantic similarity. To this end, we quantify the utility of each memory turn using a Bayes factor, defined as the improvement in the model's likelihood of the reference response when the turn is included in context. This provides a principled measure of evidence strength and a unified signal for both memory access and selection. By framing memory retrieval as utility estimation, the model learns to identify salient turns and regulate memory usage based on expected utility. Experiments on four heterogeneous memory benchmarks show that our approach outperforms existing embedding-based retrieval on long-context, preference-intensive tasks where modeling changing preferences is essential, while remaining competitive in low-density regimes where semantic similarity suffices.

105. 【2606.02973】Chatbots Output Meaningful (but Problematic) Language

链接：https://arxiv.org/abs/2606.02973

作者：Matthew Stone,Una Stojnić

类目：Computation and Language (cs.CL)

关键词：Anthropic agent Claude, capital of Spain, Spain, language, Claude answers

备注： 49 pages

点击查看摘要

Abstract:Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning -- and does it express a true proposition? Most ordinary users, as well as AI engineers, take the answer to be trivially "yes." However, many cognitive scientists, linguists, and philosophers of language argue that dominant intentionalist accounts of language and meaning deliver the opposite conclusion. Theorists more sympathetic to ordinary users' intuitions have therefore advocated a radical "de-anthropomorphization" of language, revising our understanding of mental states, intentions, and semantic content to capture the intuition that the outputs of LLMs are meaningful. We take a different approach. While we, too, argue that LLM outputs are meaningful, we contend that a proper theory of human language already applies, as is, to current chatbots. Meaning is a low bar: claiming that LLM outputs are meaningful does not require positing mental states, intentions, rationality, or the cognitive capacities requisite for communication in LLMs -- or, indeed, making any other anthropomorphic assumptions. People do have communicative intentions (typically successful ones), but nevertheless, even in humans, language production can depart from what the speaker has in mind. Our view has important consequences for how we should theorize about -- and critically engage with -- both human linguistic output and synthetically generated text. In particular, to say that chatbots produce meaningful text is not by any means to endorse what they output, or to assume that the technology is (or is not) good, powerful, appropriate, or useful.

Comments:
49 pages

Subjects:

Computation and Language (cs.CL)

ACMclasses:
I.2.0; I.2.7

Cite as:
arXiv:2606.02973 [cs.CL]

(or
arXiv:2606.02973v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.02973

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

106. 【2606.02971】EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

链接：https://arxiv.org/abs/2606.02971

作者：Marios Koniaris,Vasileios Kotronis,Eugenia Giannini,Panayiotis Tsanakas

类目：Computation and Language (cs.CL)

关键词：Extracting reporting obligations, Extracting reporting, legislation is critical, critical for assessing, assessing and reducing

备注：

点击查看摘要

Abstract:Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal understanding. Current legal NLP methods lack specialised datasets with clear guidelines and comparative evaluation of extraction paradigms and domain adaptation strategies. We curate EURO-5K, a corpus of sentence-level reporting obligations and challenging negative examples from 136 EU legislative acts. On this dataset, we train and compare discriminative token-classification models (BERT-style) and generative span-extraction models (LLMs), evaluating both full fine-tuning and parameter-efficient QLoRA against baselines (pattern and dependency-based extraction, few-shot prompting). Results show that fully fine-tuned generic and legal BERT models achieve similar performance (0.89 F1), while fine-tuned LLMs match encoder accuracy for sentence-level extraction. Legal pretraining offers only small gains for generative models. In contrast, it is clearly beneficial when adaptation capacity is constrained, as parameter-efficient tuning of Legal-BERT outperforms its generic counterpart. Learning curve analysis demonstrates that legal pretraining accelerates early learning with minimal data. All approaches converge around 3K samples with diminishing returns thereafter, validating dataset sufficiency. Cross-dataset evaluation on two external regulatory corpora shows that our models behave as specialised reporting obligation extractors rather than generic regulatory classifiers. We release EURO-5K, trained models, and an interactive demo with explainability visualizations and structured RDF export. These demonstrate that both paradigms and parameter-efficient training provide practical tools for regulatory compliance automation.

107. 【2606.02964】Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

链接：https://arxiv.org/abs/2606.02964

作者：Chunan Shi,Yilei Chen,Yilin Chen,Xupeng Miao,Bin Cui

类目：Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Model, avoid redundant attention, Large Language, redundant attention computation, relies on key-value

备注：

点击查看摘要

Abstract:Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization. Experiments show that AsymCache reduces TTFT by up to 1.90-2.03x and time-per-output-token (TPOT) by 1.62-1.71x over latest baselines, confirming the effectiveness of the method in common workloads and validating its design goal of balancing computational efficiency with cache hit rate. Moreover, the low-level design of AsymCache allows seamless integration into agent serving systems such as Continuum, where it further reduces average job latency by up to 18.1%.

Subjects:

Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2606.02964 [cs.AR]

(or
arXiv:2606.02964v1 [cs.AR] for this version)

https://doi.org/10.48550/arXiv.2606.02964

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

108. 【2606.02955】Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

链接：https://arxiv.org/abs/2606.02955

作者：Siva Rajesh Kasa,Yasong Dai,Sumit Negi,Hongdong Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：inference remains bottlenecked, Diffusion large language, parallel token generation, promise parallel token, large language models

备注： Initial version accepted at Workshop on Structured Probabilistic Inference Generative Modeling, ICML 2026

点击查看摘要

Abstract:Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \textbf{Fast-dLLM++}, a training-free extension that introduces \emph{Fréchet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\% higher throughput at comparable accuracy. Our anonymous code release is at this https URL.

109. 【2606.02953】Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

链接：https://arxiv.org/abs/2606.02953

作者：Claire Bonial,Claire Benet Post,Laura Michaelis,Harish Tayyar Madabushi

类目：Computation and Language (cs.CL)

关键词：distinct frequency signals, high frequency usage, stemming from high, frequency signals, frequency usage

备注：

点击查看摘要

Abstract:Usage-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear. Large Language Models are also usage-based, in the sense that the structures of language are learned through exposure to vast amounts of text. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity (entrenchment) in cases of coercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data.

110. 【2606.02951】SCOPE: Real-Time Natural Language Camera Agent at the Edge

链接：https://arxiv.org/abs/2606.02951

作者：Nikolaj Hindsbo,Sina Ehsani,Pragyana Mishra

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：real-world task demands, reflect real-world task, Deploying language-driven agents, robotics requires evaluations, Blender-based simulation environment

备注： 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: [this https URL](https://github.com/HindsboNikolaj/SCOPE)

点击查看摘要

Abstract:Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

Comments:
9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16–19, 2026. Code: this https URL

Subjects:

Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

ACMclasses:
I.2.9; I.2.10; I.2.7; I.2.11

Cite as:
arXiv:2606.02951 [cs.RO]

(or
arXiv:2606.02951v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.02951

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026

Related DOI:

https://doi.org/10.1145/3757279.3785641

Focus to learn more

            DOI(s) linking to related resources</p>

111. 【2606.02914】Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

链接：https://arxiv.org/abs/2606.02914

作者：Sema Helali,Lina Abu Nadab,Sausan Alqawas,Alaa Abd-Alrazaq,Faleh Tamimi,Rafat Damseh

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Oral diseases affect, billion people worldwide, remains poorly understood, dentistry remains poorly, Oral diseases

备注：

点击查看摘要

Abstract:Background: Oral diseases affect nearly 3.5 billion people worldwide, yet the comparative clinical potential of large-scale AI models in dentistry remains poorly understood. Three distinct model categories have emerged: language-generative models, discriminative vision foundation models, and dental-specific foundation models, with no unified review examining their relationships and collective limitations. Methods: Following PRISMA-ScR guidelines, we systematically searched four databases (PubMed, Google Scholar, Scopus, arXiv), screened independently by two reviewers. After applying inclusion/exclusion criteria, 97 studies (2020-2026) were included. We propose a two-dimensional classification framework organizing models by architectural paradigm and dental specialization degree. Results: Language-generative models excel at text-based tasks (clinical reasoning, licensing exams, patient communication) but show inconsistent performance on image-dependent diagnostics. Adapted SAM and CLIP variants achieve strong tooth segmentation and lesion detection results. Dental-specific models (DentVFM, DentVLM, OralGPT) demonstrate strongest performance on complex multimodal tasks. Integrated pipelines consistently outperform single-model approaches. A data asymmetry is observed: dental-specific pretraining concentrates almost entirely in the vision domain, reflecting scarce large-scale dental text corpora. Conclusions: General-purpose and dental-specific models play complementary roles; the most effective systems combine both within structured pipelines. Safe autonomous deployment requires resolving three persistent barriers: hallucination in generative models, limited annotated dental datasets, and absent standardized clinical evaluation benchmarks.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2606.02914 [cs.AI]

(or
arXiv:2606.02914v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.02914

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Sema Helali [view email] [v1]
Mon, 1 Jun 2026 21:39:27 UTC (18,026 KB)

112. 【2606.02911】he Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

链接：https://arxiv.org/abs/2606.02911

作者：Mirko Lai,Alessandra Urbinati,Simona Frenda,Fabiana Vernero,Marco Antonio Stranisci

类目：Computation and Language (cs.CL)

关键词：Current research primarily, generate annotated data, research primarily focuses, Current research, Collaborative Filtering-style annotators'

备注：

点击查看摘要

Abstract:Current research primarily focuses on model performance, while comparatively less attention has been devoted to uncertainty estimation, particularly in settings where LLMs are increasingly used to generate annotated data. We introduce a framework combining conformal prediction with Collaborative Filtering-style annotators' representation to model LLM behavior in relation to human annotators and to analyze patterns of agreement and disagreement. Using Non-Conformity Scores, we introduce the Ghost Prediction metric and the Ghost Annotator representation to quantify cases in which model predictions diverge from all available human annotations. We compute cosine similarity measures to explore differences in model behavior across sociodemographic axes. We evaluated four LLMs of different size and families across four content moderation datasets. Our finding shows that while we find that all models uncertainty increases with annotator disagreement, larger models tend to be more confident in the classification of texts that are not aligned with any human annotation. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment, suggesting a structural bias likely rooted in pretraining corpora.

113. 【2606.02908】WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

链接：https://arxiv.org/abs/2606.02908

作者：Hengrui Gu,Xiaotian Han,Kaixiong Zhou

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：execute valid actions, infer user intent, collect missing information, collect missing, valid actions

备注：

点击查看摘要

Abstract:Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $\tau^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.02908 [cs.CL]

(or
arXiv:2606.02908v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.02908

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

114. 【2606.02907】Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

链接：https://arxiv.org/abs/2606.02907

作者：Subramanyam Sahoo,Vinija Jain,Aman Chadha,Divya Chaudhary

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language model, models learn distinct, learn distinct representations, hidden states, language model

备注： Accepted in the 6th Workshop on Trustworthy NLP, ACL 2026

点击查看摘要

Abstract:Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

115. 【2606.02871】Adaptive Latent Agentic Reasoning

链接：https://arxiv.org/abs/2606.02871

作者：Dongwon Jung,Peng Shi,Yi Zhang,Junshan Zhang,Muhao Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large reasoning models, LLM agents, Large reasoning, models improve performance, Current LLM agents

备注：

点击查看摘要

Abstract:Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

116. 【2606.02866】When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

链接：https://arxiv.org/abs/2606.02866

作者：Chirag Parmar,Akshat Mehta,Henglin Wu,Jagadish Ramamurthy,Shweta Medhekar

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：data cleaning, Generator accepts uncritically, hallucinated Critic feedback, multi-agent debate, debate

备注： 27 pages, 4 figures, 12 tables. Includes appendix with full experimental results, prompt templates, and dataset statistics

点击查看摘要

Abstract:When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

117. 【2606.02859】Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

链接：https://arxiv.org/abs/2606.02859

作者：Zhenting Qi,Huangyuan Su,Ao Qu,Chenyu Wang,Yu Yao,Han Zheng,Kushal Chattopadhyay,Guowei Xu,Zihan Wang,Weirui Ye,Vijay Janapa Reddi,Ju Li,Paul Pu Liang,Himabindu Lakkaraju,Sham Kakade,Yilun Du

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词：centralized control, Friedrich Hayek economic, self-orchestrate and self-adapt, Friedrich Hayek, Hayek economic theory

备注：

点击查看摘要

Abstract:How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

118. 【2606.02837】Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

链接：https://arxiv.org/abs/2606.02837

作者：Andrea Brunello,Cristian Curaba,Luca Geatti,Michele Mignani,Angelo Montanari,Nicola Saccomanno

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Natural Language Inference, Natural Language, Language Inference, Accurate translation, First-Order Logic

备注：

点击查看摘要

Abstract:Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

119. 【2606.02814】Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors

链接：https://arxiv.org/abs/2606.02814

作者：Francisco Valentini,Edgar Altszyler,Martin Fajcik

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：annotated query-document pairs, query-document pairs, estimate query-document relevance, relevance, annotated query-document

备注：

点击查看摘要

Abstract:Neural retrievers are trained to estimate query-document relevance from annotated query-document pairs. Yet annotation protocols may not purely reflect relevance: they select only a subset of documents for labeling, and this selection can favor certain document types over others. We investigate whether supervised bi-encoder retrievers implicitly learn a document-level relevance prior: a query-independent signal encoded in their representation space as a side effect of training on annotated data. We estimate this prior by training simple classifiers on frozen document embeddings and evaluate three state-of-the-art retrievers across multiple IR benchmarks. We find that supervised neural retrievers encode relevance priors that generalize to unseen documents and are consistent across models. These priors create a findability gap: documents with lower prior are systematically harder to retrieve, even when genuinely relevant. This effect appears in supervised dense retrievers but is weaker and less consistent in BM25, and it persists under controlled matched-document comparisons. Using LLM-based explanations, we find that judged-relevant documents tend to be comprehensive, self-contained summaries of mainstream topics, while niche, fragmentary, or highly technical content is often left unjudged. Retrievers internalize this bias, ranking documents with these favored features higher than documents that lack them, independently of their actual relevance. Our findings expose a structural limitation of supervised retrieval: models trained on annotated data do not just learn relevance, but also the implicit document preferences in their training data.

120. 【2606.02812】raj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

链接：https://arxiv.org/abs/2606.02812

作者：Sihang Zeng,Matthew Thompson,Ruth Etzioni,Meliha Yetisgen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Modeling patient trajectories, electronic health records, longitudinal electronic health, long-context multimodal sequences, Modeling patient

备注：

点击查看摘要

Abstract:Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

121. 【2606.02806】ranslating Classical Poetry into Modern Prose

链接：https://arxiv.org/abs/2606.02806

作者：Chalamalasetti Kranti,Sowmya Vajjala

类目：Computation and Language (cs.CL)

关键词：Century Telugu Classical, Telugu Classical Poetry, Classical Poetry, Century Telugu, Telugu and English

备注： Preprint

点击查看摘要

Abstract:We introduce Padyam2Gadyam, a dataset for the task of poem-to-prose translation from 13th-17th Century Telugu Classical Poetry to contemporary Telugu and English prose. The dataset consists of 600 poems and their human-verified Telugu and English prose translations. We evaluated 5 contemporary Large Language Models (LLMs) on their ability to do poem-to-prose translation into Telugu and English. Our results indicate that while there are differences across LLMs, their overall performance leave a large room for improvement in both languages. Through qualitative analysis, we discuss the the capabilities and limitations of contemporary MT evaluation approaches for this task.

122. 【2606.02780】Do Value Vectors in Deep Layers Need Context from the Residual Stream?

链接：https://arxiv.org/abs/2606.02780

作者：Muyu He,Yuchen Liu,Qingya Huang,Li Zhang

类目：Computation and Language (cs.CL)

关键词：large part due, transformer architecture, backbone of modern, modern LLMs, large part

备注： 13 pages, 5 figures. Code: [this https URL](https://github.com/RiddleHe/nanochat)

点击查看摘要

Abstract:The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context-dependent query, key, and value vectors. However, we find that model performance meaningfully improves when deeper layers learn only a context-free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context-free value vector, adding back the context-dependent component provides little additional benefit for aggregate benchmark performance. Such context-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context-free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token-specific value vectors for each of the last third of layers. Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory.

123. 【2606.02776】opics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

链接：https://arxiv.org/abs/2606.02776

作者：Vera Neplenbroek,Gabriele Sarti,Arianna Bisazza,Raquel Fernández

类目：Computation and Language (cs.CL)

关键词：large language models, single conversation history, language models, medical and financial, large language

备注：

点击查看摘要

Abstract:When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

124. 【2606.02750】On the Persistent Effects of Lexicality in Large Language Mod

链接：https://arxiv.org/abs/2606.02750

作者：Hammad Rizwan,Muhammad Umair Haider,Nishant Subramani,Mona T. Diab,A.B. Siddique,Hassan Sajjad

类目：Computation and Language (cs.CL)

关键词：large language models, play an important, extracted from large, large language, important role

备注：

点击查看摘要

Abstract:Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.

125. 【2606.02741】Greener Than Humans? Environmental Attitudes in Large Language Models

链接：https://arxiv.org/abs/2606.02741

作者：Stefanie Kunkel,Tilman Hartwig,Marcus Voss,Emma K. Schütt,Angelika Gellrich

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large language models, sustainability-related decision support, Large language, systematic evidence exists, decision support

备注： Code can be found at [this https URL](https://gitlab.opencode.de/uba-ki-lab/llm-questionnaire-benchmarking-framework) Benchmark data and results can be found at [this https URL](https://zenodo.org/records/20445903)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs. This paper develops a benchmark for evaluating environmental cognition, affect, and behavioural recommendations in LLMs and applies it to 31 widely used proprietary and open-weight models. Drawing on questions from established environmental awareness surveys and additional sustainability-related behavioural measures, we compare LLM responses 1) among models and 2) between models and human survey benchmarks from Germany. We assess their robustness across prompting conditions. We find that many LLMs align more closely with environmentally progressive attitudes than the average survey respondent, exhibiting higher levels of environmental affect and cognition and recommending behaviours associated with substantial potential CO2 reductions. At the same time, we observe no systematic relationship between sustainability-oriented responses and model origin, size, or release context. However, models exhibit contextual sensitivity, controlled by persona-based prompting and show sycophantic shifts mirroring user-specified ideological positions, which raises concerns about steerability and normative reliability in real-world deployments. Our findings provide a reusable evaluation framework for assessing sustainability-related value alignment in LLMs and highlight the importance of governance, transparency, and critical oversight as AI systems become increasingly embedded in sustainability transformations and public decision-making.

126. 【2606.02737】Attention Calibration for Position-Fair Dense Information Retrieval

链接：https://arxiv.org/abs/2606.02737

作者：Andrianos Michail,Elias Schuhmacher,Juri Opitz,Simon Clematide,Rico Sennrich

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Dense retrieval models, retrieval effectiveness degrades, Dense retrieval, exhibit positional bias, retrieval effectiveness

备注：

点击查看摘要

Abstract:Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference-time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD-PosQ and FineWeb-PosQ, we examine how basket size, calibrated layer set, and strength affect the trade-off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb-PosQ for all three models without per-model tuning, and applies to both s-pooled and last-token-pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length-quartile x model x retrieval-setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at this https URL

127. 【2606.02684】Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

链接：https://arxiv.org/abs/2606.02684

作者：Yuying Li,Leqi Zheng,Yongzi Yu,Wenrui Zhou,Xuchang Zhong,Xing Hu,Jing Jin,Huangjie Yuan,Tao Feng

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：selective training paradigms, large language models, On-Policy distillation, training paradigms, large language

备注：

点击查看摘要

Abstract:On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at this https URL.

128. 【2606.02628】Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

链接：https://arxiv.org/abs/2606.02628

作者：Aizierjiang Aiersilan

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：open-source LLMs encode, linearly separable truthfulness, hidden states, separable truthfulness signal, per-layer hidden states

备注：

点击查看摘要

Abstract:We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than $0.01$ AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks~$13$--$18$ of~$32$ for Llama and Mistral, and blocks~$19$--$25$ of~$28$ for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings ($0.866$--$0.941$ AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single $8$\,GB GPU.

129. 【2606.02584】IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

链接：https://arxiv.org/abs/2606.02584

作者：Ayman Ali Sharara

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：natural language processing, Idiomatic expressions remain, expressions remain, remain a persistent, persistent challenge

备注： 12 pages, 21 figures. Includes dataset and code. Resources available on HuggingFace, Kaggle, and GitHub

点击查看摘要

Abstract:Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

130. 【2606.00384】VESTA: Visual Exploration with Statistical Tool Agents

链接：https://arxiv.org/abs/2606.00384

作者：William Rudman,Abhishek Divekar,Kanishk Jain,Sebastian Joseph,Stella S. R. Offner,Matthew Lease,Kyle Mahowald,Greg Durrett,Junyi Jessy Li

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computation (stat.CO)

关键词：Fitting quantitative models, central step, step in scientific, Statistical Tool Agents, refine statistical models

备注：

点击查看摘要

Abstract:Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

131. 【2512.00956】WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

链接：https://arxiv.org/abs/2512.00956

作者：Jiale Chen,Vage Egiazarian,Roberto L. Castro,Torsten Hoefler,Dan Alistarh

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Quantizing LLM weights, Quantizing LLM, low-bit quantization errors, amplify low-bit quantization, LLM weights

备注： Published as a conference paper at the 43rd International Conference on Machine Learning (ICML 2026): [this https URL](https://openreview.net/forum?id=ZsECxUkbKB)

点击查看摘要

Abstract:Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify low-bit quantization errors. Prior transform-based mitigations (e.g., Hadamard rotations) are fixed and data-agnostic, and their optimality for quantization has remained unclear. We derive closed-form optimal linear blockwise transforms for joint weight-activation quantization under standard RTN AbsMax-scaled block quantizers, covering both integer and floating-point formats. The resulting construction, WUSH, combines a Hadamard backbone with a data-dependent second-moment component to form a non-orthogonal transform that is provably near-optimal for FP and INT quantizers under mild assumptions while admitting an efficient fused GPU implementation. Empirically, WUSH improves W4A4 accuracy over the strongest Hadamard-based baselines (e.g., on Llama-3.1-8B-Instruct in MXFP4, it gains +2.8 average points with RTN and +0.7 with GPTQ) while delivering up to 5.8$\times$ per-layer throughput over BF16 via FP4 MatMul. Source code is available at this https URL.

信息检索

1. 【2606.03866】aiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

链接：https://arxiv.org/abs/2606.03866

作者：Yuecheng Li,Zeyu Song,Jing Yao,Chi Lu,Peng Jiang,Kun Gai

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, Scaling recommender systems, Scaling recommender, language models, large language

备注： 8 pages, 2 figures

点击查看摘要

2. 【2606.03728】Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

链接：https://arxiv.org/abs/2606.03728

作者：Mohamed Hesham Elganayni,Selim Saleh

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Retrieval-augmented generation systems, legal question answering, question answering typically, answering typically retrieve, Retrieval-augmented generation

备注： 11 pages, 4 tables, 1 figure. Published at ASAIL 2026 (8th Workshop on Automated Semantic Analysis of Information in Legal Text), co-located with ICAIL 2026, Singapore

点击查看摘要

3. 【2606.03727】When Does Latent Reasoning Help? MeRa: Metric-Space Bias for Spatial Prediction

链接：https://arxiv.org/abs/2606.03727

作者：Zhenyu Yu,Shuigeng Zhou

类目：Information Retrieval (cs.IR)

关键词：improved sequential recommendation, iteratively refining representations, Latent reasoning, improved sequential, sequential recommendation

备注：

点击查看摘要

Abstract:Latent reasoning has improved sequential recommendation by iteratively refining representations before prediction, but does it help spatial prediction? We find that the answer depends on whether reasoning is grounded in the underlying metric space. Without such grounding, latent reasoning degrades spatial prediction below the unmodified baseline, while a learned metric-space bias derived from pairwise distances produces consistent gains. We formalize this finding through MeRa (Metric-space Reasoning), a lightweight backbone-agnostic module that can be inserted between any sequence encoder and its prediction heads. On the GETNext backbone, the gap between reasoning without and with metric-space bias reaches 4.5% NDCG@10. MeRa achieves the best NDCG@10 on all three spatial prediction benchmarks among the compared methods, surpassing recent approaches such as GeoMamba and HMST. We prove that metric-space-constrained reasoning converges to a unique fixed point and that N-step reasoning is strictly more expressive than (N-1)-step reasoning. A controlled experiment on CLEVR with Euclidean distance confirms that the finding generalizes beyond geographic coordinates. The code is included in the supplementary material.

4. 【2606.03718】MARS: Multi-rate Aggregation of Recency Signals for Sequential Recommendation across Sparse and Dense Regimes

链接：https://arxiv.org/abs/2606.03718

作者：Zhenyu Yu,Shuigeng Zhou

类目：Information Retrieval (cs.IR)

关键词：Sequential recommenders weight, recommenders weight historical, weight historical interactions, single implicit decay, implicit decay schedule

备注：

点击查看摘要

Abstract:Sequential recommenders weight historical interactions either through positional self-attention as in Transformers or through a single implicit decay schedule as in State-Space Models. Neither makes the multi-scale temporal structure of real user behaviour explicit. We propose MARS, an encoder-agnostic aggregation operator that consumes real timestamps and produces K summaries emphasising distinct recency scales, fused by a context-adaptive gate. MARS adds at most 6% parameters and runs in $\mathcal{O}(LdK)$ time. MARS adapts to data density by automatically selecting between two encoder instantiations: MARS-T (Transformer) for sparse data and MARS-M (Mamba) for dense data, based on the average sequence length of the training set. On five public benchmarks against ten Transformer- and Mamba-based baselines under a unified RecBole protocol, MARS attains the best HR@10 on every benchmark, with mean relative gain +19.7% over the strongest content-only Transformer baseline on sparse data (reaching +36.2% on Games) and +3.2% HR@10 / +0.9% NDCG over SIGMA on dense ML-1M at 42% fewer MFLOPs, occupying the accuracy-efficiency Pareto frontier across the data-density spectrum. A backbone-only ablation isolates the marginal contribution of MARS at +4% to +19% HR@10 on sparse data and motivates the dual-instantiation design. The code is included in the supplementary material.

5. 【2606.03711】Ghost: Plausible Yet Unlearnable Trajectories via On-Manifold Substitution for Next-POI Privacy

链接：https://arxiv.org/abs/2606.03711

作者：Zhenyu Yu,Jihong Guan,Shuigeng Zhou

类目：Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词：user future locations, trajectories inadvertently publishes, releases check-in trajectories, check-in trajectories inadvertently, future locations

备注：

点击查看摘要

Abstract:A publisher who releases check-in trajectories inadvertently publishes a strong predictor of every user's future locations. We address this risk by generating unlearnable trajectories, perturbed sequences that yield victim models with degraded next-Point-of-Interest (next-POI) accuracy on clean test inputs. Direct ports of image-domain unlearnable examples fail on two counts. The published data must remain geographically and semantically plausible, and the perturbation must resist purification adversaries that exploit the structure of randomized defences. We propose Ghost, a manifold-aligned framework whose perturbations look like plausible human check-in sequences yet leave no learnable signal behind. Ghost steers each substitution onto the real-trajectory manifold through a frozen trajectory language model, so a denoising-bridge adversary has nothing to invert and a context-free frequency-table adversary recovers a near-uniform distribution. Across two standard benchmarks, and four attacker postures, Ghost achieves protection-gap competitive with the strongest deterministic baseline (PGD) while attaining the lowest restored accuracy under the bigram adaptive purification adversary on both datasets, and lies within one per-cell standard deviation of PGD on the protection-versus-purification-resistance plane. Ablations confirm the manifold prior subsumes the entropy-floor knob of prior randomized defences, with the frequency-table adversary's survival gap remaining within 0.04 even when twenty percent of the pairs are leaked.

6. 【2606.03565】Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

链接：https://arxiv.org/abs/2606.03565

作者：Zifei Wang,Wei Wen,Qiang Ji,Ruizhi Qiao,Xing Sun

类目：Information Retrieval (cs.IR)

关键词：complete complex tasks, composing multiple skills, agents complete complex, LLM agents complete, complete complex

备注： 19 pages, 8 figures

点击查看摘要

Abstract:LLM agents complete complex tasks by composing multiple skills, and skill retrieval is a front-end stage for agents. Skill retrieval differs fundamentally from traditional document retrieval at the supervision level: top-K joint correctness depends not only on the semantic relevance of each individual query-skill pair, but also on whether the skills retrieved together can collaborate to fulfill the task under the given query. Such "skill compatibility" cannot be derived from independent relevance alone. Yet existing LLM-based data synthesis pipelines can produce a direct supervision signal for "which skills should not be jointly retrieved under this query" -- namely the LLM's own rejection decisions -- and this signal is routinely discarded as low-quality data. To address this gap, we propose Reject-as-Resource Retriever (R3) and construct R3-Skill, a bilingual (Chinese-English) skill retrieval benchmark targeting realistic agent skill routing. R3-Skill spans four language directions, features query phrasings close to real user requests, and is verified through multi-expert cross-checking. On R3-Skill, we build a two-stage retrieval system (R3-Embedding + R3-Reranker) with skill compatibility as an explicit training signal. Gradient analysis shows that the "push-away" signal is diluted by bilateral balancing in the bi-encoder but acts as lossless graded ranking supervision in the cross-encoder -- motivating its placement at the cross-encoder stage, as confirmed by ablations on two datasets. The R3-Embedding + R3-Reranker pipeline attains Hit@1 = 0.7714, NDCG@10 = 0.8327 and Set-Compat = 0.3525 on R3-Skill. The dataset, training code and model weights are released as open source for agent skill routing.

7. 【2606.03535】Can LLM Rerankers Predict Their Own Ranking Performance?

链接：https://arxiv.org/abs/2606.03535

作者：Shiyu Ni,Keping Bi,Jiafeng Guo,Jingtong Wu,Zengxin Han,Xueqi Cheng

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：effectiveness varies substantially, Retrieval effectiveness varies, substantially across queries, making it important, effectiveness varies

备注：

点击查看摘要

8. 【2606.03367】Automating Information Extraction and Retrieval for Industrial Spare Parts Pooling

链接：https://arxiv.org/abs/2606.03367

作者：Dyuman Bulloni,Rocco Felici,Oliver Avram,Anna Valente

类目：Information Retrieval (cs.IR)

关键词：reusing existing assets, Maintenance organizations, lack of actionable, existing assets, organizations in manufacturing

备注：

点击查看摘要

Abstract:Maintenance organizations in manufacturing try to avoid downtime and unnecessary purchasing by reusing existing assets, but the main obstacle is not a lack of parts but a lack of actionable visibility across sites and partners. Inventories are distributed, described with inconsistent naming conventions, and contain duplicates and partially specified references, so the right part often exists somewhere but remains effectively undiscoverable. The paper proposes PhRAG, a hybrid Retrieval-Augmented Generation for Pooling this fragmented landscape into a Virtual Stock Pool (VSPool) that can be structured and searched as a single resource. Unstructured, heterogeneous spare part descriptions are structured via Named Entity Recognition (NER) into a shared virtual pool dataset and indexed to support robust retrieval even when users express needs in natural language rather than exact technical specifications. The proposed modular pipeline leverages the multitasking nature of generative language models to cover two dimensions that make industrial parts pooling challenging: (i) unstructured technical specifications from diverse data sources (e.g. new partners, catalogs, marketplace listings) are handled through an offline extraction and (ii) request variability at runtime (references, partial references, specifications, price/condition constraints) is handled through a hybrid RAG-based search engine capable of retrieving relevant components and justifying results. The framework demonstrates the potential of generative approaches compared with traditional NER approaches in the presence of data scarcity for technical specifications extraction and overcomes the opacity of standard information retrieval systems by generating justifications for retrieved components. The project's open-source code can be found at this https URL.

9. 【2606.03307】Generalizing Graph Foundation Models via Hyperbolic Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2606.03307

作者：Yifan Jin,Qirui Ji,Bin Qin,Jiangmeng Li,Lixiang Liu,Fuchun Sun,Changwen Zheng

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：leveraging large-scale pre-training, graph representation learning, Graph foundation models, dominant paradigm, representation learning

备注： Accepted by KDD2026

点击查看摘要

Abstract:Graph foundation models (GFMs) emerged as a dominant paradigm in graph representation learning by leveraging large-scale pre-training for cross-domain inference. However, the parameterized knowledge encoded within these models is insufficient to cope with distribution shifts, limiting their generalization ability. To mitigate this issue, retrieval-augmented generation (RAG) has been introduced to incorporate external knowledge at inference time. Nevertheless, existing RAG frameworks operating in Euclidean space suffer from a fundamental geometric limitation: the polynomial volume growth of Euclidean space is inherently mismatched with the tree-structured external knowledge bases. This mismatch leads to the loss of semantic granularity in retrieval and gives rise to the hubness this http URL address this limitation, we propose a Hyperbolic Retrieval-Augmented Generation (HyRAG) framework designed to enhance the generalization capabilities of GFMs. Specifically, the introduced Hyperbolic Knowledge Indexing module retains the tree-like hierarchies of the external knowledge base by modeling them within hyperbolic space. The Multi-granularity Retrieval module then provides GFMs with the global semantic anchors and local semantic nuances through coarse-grained and fine-grained knowledge retrieval, respectively. Finally, the Dual-path Fusion module achieves effective knowledge integration for graph tasks at both the feature and structural this http URL on multiple graph benchmarks demonstrate significant improvements in the zero-shot setting, highlighting the generalization of our method for robust GFMs inference.

10. 【2606.03247】Structures Facilitate Retrieve, Rerank, and Generate

链接：https://arxiv.org/abs/2606.03247

作者：Yeqin Zhang,Haomin Fu,Xujie Zhang,Cam-Tu Nguyen

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Document-grounded dialogue systems, domain-specific user questions, answer domain-specific user, Document-grounded dialogue, dialogue systems

备注：

点击查看摘要

11. 【2606.03221】VirtualMLE: A Virtual ML Engineer that Optimizes Sequential Recommenders

链接：https://arxiv.org/abs/2606.03221

作者：Shiteng Cao,Jingwen Liu,Junda She,Zhiheng Li

类目：Information Retrieval (cs.IR)

关键词：Large Language Models, Large Language, complex engineering workflows, automating complex engineering, Recent advancements

备注：

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning, reflection, and tool utilization, unlocking new paradigms for automating complex engineering workflows. However, in the domain of sequential recommendation (SR), tuning models on new datasets still relies heavily on the manual trial-and-error of experienced machine learning engineers. To bridge this gap, we propose \textbf{VirtualMLE}, an LLM-agent framework that leverages the cognitive capabilities of LLMs to organize recommender optimizing into a closed loop of execution, reflection, and memory update. After each trial, the agent explicitly analyzes the observed outcomes and stores concise heuristic feedback in a hierarchical memory system. We evaluate VirtualMLE on three Amazon SR benchmarks with two representative backbones, SASRec and HSTU. VirtualMLE reaches competitive recommendation quality with substantially fewer trials. Furthermore, we observe that cognition summaries distilled from previous datasets can significantly accelerate the search process on unseen datasets, demonstrating the potential of transferring tuning heuristics. Overall, our results provide compelling evidence that LLM agents equipped with reflection and memory can serve as practical virtual engineers to automate and amortize heuristic learning in SR optimization. Our codes are available.

12. 【2606.03138】Section-Weighted Hybrid Approach for Legal Case Retrieval

链接：https://arxiv.org/abs/2606.03138

作者：Rajith Arulanandam,Nisansa de Silva

类目：Information Retrieval (cs.IR)

关键词：surface word overlap, analogous precedents requires, precedents requires capturing, Finding truly analogous, requires capturing legal

备注： 10 pages, 4 figures. Accepted to the International Conference on Natural Language Processing (ICNLP 2026)

点击查看摘要

Abstract:Finding truly analogous precedents requires capturing legal reasoning beyond surface word overlap. We present a two-stage, section-aware framework for legal case retrieval that first segments raw judgments into facts, issues, decision, and reasoning using a deterministic large language model (LLM) offline. In Stage 1, we combine parallel lexical (BM25) and semantic (dense ANN) whole-document searches via Reciprocal Rank Fusion (RRF) to form a high-recall candidate pool. In Stage 2, we perform fine-grained, like-for-like comparisons (e.g., query reasoning vs. candidate reasoning). To address the scale mismatch between unbounded lexical scores and cosine similarities, we apply query-wise Z-score normalization before aggregating signals with learned section weights. For the top results, the system returns the relevant section text with a concise, grounded rationale and party-stance labels. We evaluate on a jurisdiction-scale benchmark, demonstrating consistent gains over strong lexical and neural baselines while maintaining high candidate coverage

13. 【2606.03091】BAHSD: Bridging the Long-tail Gap via Adaptive Distillation in Black-box Sequential Recommendation

链接：https://arxiv.org/abs/2606.03091

作者：Xi Zhou,Famin Wu,Mingming Li,Hongyue Zhang,Jiao Dai,Jizhong Han,Tao Guo

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Sequential recommendation systems, driven recent interest, capabilities locally, Sequential recommendation, systems are widely

备注：

点击查看摘要

Abstract:Sequential recommendation systems are widely adopted but often deployed as black-box APIs, which has driven recent interest in model extraction to replicate their capabilities locally. However, the long-tail distribution induces severe signal heterogeneity: dense head sequences trigger the solidification of teacher preference, biasing extraction toward local patterns, while sparse tail sequences yield flat, noisy predictions. Existing one-size-fits-all extraction overlooks this disparity, resulting in noise overfitting and suboptimal knowledge transfer. We propose BAHSD, a black-box adaptive distillation framework that handles signal heterogeneity via a multi-scale consistency probing mechanism to implicitly quantify signal reliability. Based on this, an adaptive hierarchical objective is designed: dynamic-temperature KL divergence mitigates preference solidification for high-confidence signals, while ranking consistency and InfoNCE contrastive learning provide noise-robust enhancement for low-confidence signals. BAHSD consistently outperforms baselines, achieving up to 4.98\% gain over the teacher and 80\%+ improvement on tail users, offering a plug-and-play solution for high-fidelity black-box recommendation extraction.

14. 【2606.02995】Patcher: Post-Hoc Patching of Backdoored Large Language Models

链接：https://arxiv.org/abs/2606.02995

作者：Anjun Gao,Yueyang Quan,Yufei Xia,Zhuqing Liu,Minghong Fang

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：bypass safety mechanisms, adversaries poison safety, poison safety alignment, safety alignment data, Large language models

备注： To appear in the USENIX Security Symposium, 2026

点击查看摘要

Abstract:Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.

15. 【2606.02992】Slipstream: Locality-Aware Graph Index Construction for Streaming Approximate Nearest Neighbor Search

链接：https://arxiv.org/abs/2606.02992

作者：Shubing Yang,Dongfang Zhao

类目：Information Retrieval (cs.IR)

关键词：high-recall approximate nearest, require streaming ANNS, approximate nearest neighbor, real-time applications require, nearest neighbor search

备注：

点击查看摘要

Abstract:Graph indexes are widely used for high-recall approximate nearest neighbor search (ANNS), but many real-time applications require streaming ANNS. In these real-time applications, continuously arriving embeddings must search the existing graph for candidate neighbors before updating graph edges, which makes repeated index construction a bottleneck for streaming ingestion workloads. We propose Slipstream, a new method that significantly reduces the computational cost of frequent insertions in graph indexes for ANNS. The core idea of Slipstream is exploiting the continuity in vector streams: the newly arrived point starts from promising candidates found during the previous insertion rather than searching from the entry point. More technically, Slipstream evaluates distinct subsets of starting candidates followed by an adaptive controller that narrows or widens the range according to the stream's stability. We further show that Slipstream is beyond heuristic: We derive an abstract model to characterize Slipstream's performance and analyze its theoretical bounds. We have implemented Slipstream in two popular open-source libraries (Faiss, HNSWLib) and compared it with four baseline methods on five streaming vector datasets. Experimental results show that Slipstream achieves up to 30.8$\times$ higher end-to-end throughput than baselines while maintaining at least 0.95 recall@10.

16. 【2606.02883】LLM-Assisted Reranking to Operationalize Nuanced Objectives in Recommender Systems

链接：https://arxiv.org/abs/2606.02883

作者：Amir Ghasemian,Homa Hosseinmardi,Upasana Dutta,Duncan J. Watts

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词：shape daily behavior, daily behavior, sophisticated systems, grown from content-organization, content-organization tools

备注： 30 pages total; 11 pages, 5 figures, 2 tables (main text); 19 pages, 11 figures, 9 tables (appendix)

点击查看摘要

Abstract:Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we see, they shape what we perceive, raising concerns about filter bubbles, radicalization, polarization, and social inequality. Large language models (LLMs) enable more powerful personalization, intensifying these dynamics. Yet most recommenders are tuned for engagement or limited accuracy metrics, with little attention to broader social implications, e.g. how personalization reshapes exposure in socially consequential domains. We investigate whether LLM-assisted reranking, while improving personalization, inadvertently amplifies exposure to ideologically extreme or conspiratorial political content, a risk theorized but not empirically characterized in news recommendation. Using real news-consumption histories, we rerank YouTube's sidebar candidates through zero-shot, instruction-based prompting. We compare a baseline prompt with a constrained variant that preserves topical relevance and broadens ideological exposure while reducing conspiratorial or extreme content. Without constraints, reranking strengthened personalization but increased exposure to conspiratorial and extremist material for users whose histories contained such content. Lightweight prompt-level regularization reduced promotion of extreme content and increased ideological diversity, with modest relevance loss. Synthetic experiments suggest that LLMs rerank via statistical regularities in language rather than semantic understanding of ideology, clarifying why naive prompts amplify these patterns and why regularization can reshape them. Together, our results highlight the power of LLMs to operationalize contextual nuance in high-stakes recommendation, and the need to evaluate LLM-assisted personalization beyond accuracy and treat prompt design as a value-laden rather than neutral default.

17. 【2606.02814】Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors

链接：https://arxiv.org/abs/2606.02814

作者：Francisco Valentini,Edgar Altszyler,Martin Fajcik

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：annotated query-document pairs, query-document pairs, estimate query-document relevance, relevance, annotated query-document

备注：

点击查看摘要

18. 【2606.02737】Attention Calibration for Position-Fair Dense Information Retrieval

链接：https://arxiv.org/abs/2606.02737

作者：Andrianos Michail,Elias Schuhmacher,Juri Opitz,Simon Clematide,Rico Sennrich

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Dense retrieval models, retrieval effectiveness degrades, Dense retrieval, exhibit positional bias, retrieval effectiveness

备注：

点击查看摘要

19. 【2606.02584】IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

链接：https://arxiv.org/abs/2606.02584

作者：Ayman Ali Sharara

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：natural language processing, Idiomatic expressions remain, expressions remain, remain a persistent, persistent challenge

备注： 12 pages, 21 figures. Includes dataset and code. Resources available on HuggingFace, Kaggle, and GitHub

点击查看摘要

20. 【2606.02581】Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs

链接：https://arxiv.org/abs/2606.02581

作者：Sanjay Mishra

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：improves factual grounding, fundamental three-way tension, deeper retrieval improves, inflates token costs, retrieval improves factual

备注： 13 pages , 18 figures , 8 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) faces a fundamental three-way tension: deeper retrieval improves factual grounding but inflates token costs and end-to-end latency. Static retrieval configurations cannot resolve this tension across heterogeneous query workloads -- simple definitional queries waste budget on unnecessary context, while complex analytical prompts are underserved by shallow retrieval. This paper introduces \emph{Cost-Aware RAG} (CA-RAG), a per-query routing framework that selects from a discrete catalog of \emph{strategy bundles} -- each coupling a retrieval depth (from retrieval-free direct inference to top-$k{=}10$ dense retrieval) with a fixed generation profile -- by maximizing a scalar utility that linearly combines an estimated quality prior with normalized penalties for predicted latency and total billed tokens. CA-RAG is implemented with FAISS-backed dense retrieval and OpenAI chat/embedding APIs, and evaluated on a 28-query benchmark spanning four bundles. The router dynamically exercises all bundles, achieving \textbf{26\% fewer billed tokens} than always-heavy retrieval and \textbf{34\% lower mean latency} than always-direct inference while maintaining equivalent answer quality. Per-query delta analysis reveals that savings are non-uniform and concentrated in simpler queries, motivating complexity-aware guardrails. Sensitivity analysis confirms that the same bundle catalog supports multiple cost-latency-quality operating points through weight adjustment alone. All results are generated directly from logged CSV artifacts for full reproducibility. CA-RAG provides a transparent, auditable foundation for cost-conscious LLM deployments.

计算机视觉

1. 【2606.03994】SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

链接：https://arxiv.org/abs/2606.03994

作者：Inhee Lee,Sangwon Baik,Sungjoo Kim,Hyeonwoo Kim,Hyunsoo Cha,Hanbyul Joo

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Reconstructing interactive, single image, critical bottleneck, bottleneck for robotic, Reconstructing

备注： Project Page: [this https URL](https://snuvclab.github.io/SimuScene/)

点击查看摘要

Abstract:Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters recover plausible per-object shapes, composing them yields scenes that collapse under physical simulation due to interpenetrating, hovering, or sinking objects. Existing physics-aware methods address this strictly as a post-hoc layout correction, leaving the underlying geometric errors unresolved. To address this, we introduce SimuScene, a compositional 3D reconstruction pipeline that puts physics in the loop of shape and layout estimation. Rather than using physics merely for layout cleanup, we utilize the physics engine as a diagnostic measurement tool during the generative process itself. By diagnostically simulating reconstructed objects under gravity, we convert penetration and support failures into quantitative correction signals that drive gravity-axis stretching and amodal shape resampling. This physics-informed feedback loop mitigates accumulated reconstruction errors and produces a stable, simulation-ready compositional 3D scene. Extensive experiments demonstrate state-of-the-art performance on physical stability and geometric alignment benchmarks. We further highlight SimuScene's utility by deploying reconstructed environments in humanoid control and robot-arm manipulation tasks.

2. 【2606.03992】Exploring Easy Boosts for Lidar Semantic Scene Completion

链接：https://arxiv.org/abs/2606.03992

作者：Tetiana Martyniuk,Jonathan Seele,Alexandre Boulch,Gilles Puy,Renaud Marlet,Raoul de Charette

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：complex architectural redesigns, requiring complex architectural, semantic scene completion, free lunch, paper investigates

备注： Accepted to ICIP 2026

点击查看摘要

Abstract:This paper investigates "free lunch" strategies to boost the performance of lidar semantic scene completion (SSC) without requiring complex architectural redesigns. We first demonstrate that endowing input point clouds with semantic pseudo-labels from off-the-shelf segmentors significantly improves the performance of existing architectures. By evaluating these models against an oracle, we establish that high-quality semantic priors are a primary driver of mIoU gains. Furthermore, we equip the input lidar scan with visibility information that distinguishes between empty and unknown spaces, which provides a secondary performance boost across the tested architectures. Using these simple enhancements, we observe that older models remain competitive with state-of-the-art systems, and can even outperform them. Our code is available at this https URL.

3. 【2606.03990】Neuron Populations Exhibit Divergent Selectivity with Scale

链接：https://arxiv.org/abs/2606.03990

作者：Amil Dravid,Yasaman Bahri,Alexei A. Efros,Yossi Gandelsman

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：neural networks evolve, networks evolve predictably, Rosetta Neurons, neural networks, networks evolve

备注： Project page and code: [this https URL](https://avdravid.github.io/rosetta-neuron-scaling/)

点击查看摘要

4. 【2606.03989】PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation

链接：https://arxiv.org/abs/2606.03989

作者：Shinjeong Kim,Ignacio Alzugaray,Callum Rhodes,Paul H. J. Kelly,Andrew J. Davison

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Images composed, computer vision algorithms, Gaussian Belief Propagation, vision algorithms, underlying computations

备注：

点击查看摘要

Abstract:Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks. We propose a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed odometry and depth estimation with keyframe anchoring on-sensor. Project Page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.03989 [cs.CV]

(or
arXiv:2606.03989v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.03989

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

5. 【2606.03986】NewtPhys: Do Foundation Models Understand Newtonian Physics?

链接：https://arxiv.org/abs/2606.03986

作者：Sebastian Cavada,Soumava Paul,Tuan-Hung Vu,Andrei Bursuc,Raoul de Charette

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual question-answering tasks, Previous work, question-answering tasks, work has evaluated, evaluated physics reasoning

备注：

点击查看摘要

Abstract:Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at this https URL.

6. 【2606.03985】Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

链接：https://arxiv.org/abs/2606.03985

作者：Zekun Qi,Xuchuan Chen,Dairu Liu,Chenghuai Lin,Yunrui Lian,Sikai Liang,Zhikai Zhang,Yu Guan,Jilong Wang,Wenyao Zhang,Xinqiang Yu,He Wang,Li Yi

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：causal attention trained, causal attention, attention trained, billion-scale motion corpus, GPT-style Transformer

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

7. 【2606.03976】Formalizing the Binding Problem

链接：https://arxiv.org/abs/2606.03976

作者：Lianghuan Huang,Yihao Li,Saeed Salehi,Yingshan Chang,Ansh Soni,Konrad P. Kording

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

关键词：binding, binding information, call binding information, circle is blue, blue

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

8. 【2606.03972】AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

链接：https://arxiv.org/abs/2606.03972

作者：Haobo Li,Yanhong Zeng,Yunhong Lu,Jiapeng Zhu,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Yujun Shen,Zhipeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Adversarial Distillation framework, Adversarial Distillation, Asymmetric Adversarial Distillation, adopt adversarial distillation, adversarial distillation begins

备注： ICML 2026. Project page: \url{ [this https URL](https://aad-1.github.io/) }

点击查看摘要

Abstract:We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

9. 【2606.03971】Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

链接：https://arxiv.org/abs/2606.03971

作者：Yonghao Yu,Lang Huang,Runyi Li,Zerun Wang,Toshihiko Yamasaki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Causal video generators, Causal, Causal video, autoregressive video diffusion, future

备注：

点击查看摘要

Abstract:Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present. This creates what we call a representation-level planning gap: states that fit the current segment may discard identity, layout, and motion information needed for a consistent future. We introduce Video-Mirai, a training-only method that closes this gap without changing causal inference: the generator rolls out causally, a frozen foresight encoder reads the completed rollout non-causally, and a lightweight predictor distills the resulting stopped-gradient targets into causal states. Future frames supervise representations, never generator inputs. At inference, the encoder and predictor are discarded, leaving the original architecture, per-step FLOPs, and KV-cache behavior unchanged. Video-Mirai improves a strong Causal-Forcing baseline on 5-second VBench from 83.8 to 84.6 in terms of Total Score. On 30-second rollouts beyond the training horizon, subject consistency improves from 84.9 to 88.5 and background consistency from 90.2 to 91.9. Ablations identify future-conditioned targets as the key ingredient, and probes show that future frames become more decodable from current features. Causality should constrain inference, not representation supervision. Our study highlights that visual autoregressive models need foresight. Project page: this https URL.

10. 【2606.03954】VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

链接：https://arxiv.org/abs/2606.03954

作者：Hanjiang Hu,Yiyuan Pan,Jiaxing Li,Xusheng Luo,Alexander Robey,Na Li,Yebin Wang,Changliu Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：systems increasingly assist, increasingly assist humans, physical actions carry, physical tasks, systems increasingly

备注： 18 pages, 5 tables, 5 figures

点击查看摘要

Abstract:As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at this https URL.

11. 【2606.03951】Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

链接：https://arxiv.org/abs/2606.03951

作者：Zechen Bai,Zhiheng Chen,Yiqi Lin,Kevin Qinghong Lin,Difei Gao,Xiangwu Guo,Xin Wang,Mike Zheng Shou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digital environments offers, rich procedural knowledge, offers a vast, underexplored resource, resource of authentic

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at this https URL.

12. 【2606.03925】Adaptive Causal Alignment for High-Confidence Adversarial Training

链接：https://arxiv.org/abs/2606.03925

作者：Zhiming Luo,Kejia Zhang,Yingxin Lai,Junwei Wu,Juanjuan Weng,Shaozi Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Inverse adversarial training, non-causal background correlations, intrinsic object semantics, adversarial training leverages, leverages high-confidence predictions

备注：

点击查看摘要

Abstract:Inverse adversarial training leverages high-confidence predictions to stabilize robust learning, yet we uncover a critical paradox: high confidence often stems from overfitting to non-causal background correlations rather than intrinsic object semantics. Our investigation reveals that visual context functions as a dual-natured signal, serving as either a necessary supportive prior or a spurious confounder. This insight renders existing blind suppression strategies flawed, as they inevitably lead to severe Feature Loss. To resolve this, we propose High-Confidence Causally Aligned Training (HICAT), a unified framework that establishes a Semantic Equilibrium. Operating on a ``Measure-Debias-Align'' pipeline, HICAT integrates a Learnable Background-Bias Estimator (LBBE) to adaptively diagnose context utility. Guided by this diagnosis, an Adaptive Debiasing mechanism performs surgical logit rectification, complemented by a geometrically grounded Foreground Logit Orthogonal Enhancement (FLOE) loss to enforce rigorous feature disentanglement. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that HICAT consistently improves over matched baselines across diverse architectures (CNNs and ViTs) while significantly reducing the robust generalization gap.

13. 【2606.03921】GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

链接：https://arxiv.org/abs/2606.03921

作者：Jiahao Sun,Dingkun Wei,Zehong Shen,Hongyu Zhou,Yujun Shen,Liang Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Converting multi-view RGB, environments remains challenging, pipelines produce monolithic, multi-view RGB observations, RGB observations

备注：

点击查看摘要

Abstract:Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.

14. 【2606.03920】Benchmarking Visual State Tracking in Multimodal Video Understanding

链接：https://arxiv.org/abs/2606.03920

作者：Sihyun Yu,Nanye Ma,Pinzhi Huang,Hyunseok Lee,Shusheng Yang,June Suk Choi,Ellis Brown,Oscar Michel,Boyang Zheng,Jinwoo Shin,Saining Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual state tracking, recognizing isolated moments, Large Language Models, Multimodal Large Language, visual state

备注： Website: [this https URL](https://vision-x-nyu.github.io/vstat-site/)

点击查看摘要

Abstract:Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

15. 【2606.03915】PatchScene: Patch-based Voxel Diffusion for Large-Scale Scene Completion

链接：https://arxiv.org/abs/2606.03915

作者：Qingdong Xu,Jiajun Zhu,Shilin Zhu,Xinjing He,Chao Lu,Huanran Wang,Jiyao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diffusion-based framework, framework for large-scale, large-scale LiDAR scene, scene completion, propose PatchScene

备注： 10 pages, 5 figures, 5 tables

点击查看摘要

Abstract:We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.

16. 【2606.03911】Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

链接：https://arxiv.org/abs/2606.03911

作者：Yoad Tewel,Yuval Atzmon,Gal Chechik,Lior Wolf

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Modern generative models, typically requires massive, requires massive datasets, Modern generative, editing typically requires

备注： Accepted at ICML 2026. Project page is at [this https URL](https://research.nvidia.com/labs/par/byg/)

点击查看摘要

Abstract:Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

17. 【2606.03909】SparseStreet: Sparse Gaussian Splatting for Real-Time Street Scene Simulation

链接：https://arxiv.org/abs/2606.03909

作者：Qingpo Wuwu,Xiaobao Wei,Peng Chen,Nan Huang,Zhongyu Zhao,Hao Wang,Ming Lu,Ningning Ma,Shanghang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：capture fine details, slow rendering speeds, shown promising results, prohibitive storage costs, Splatting has shown

备注：

点击查看摘要

Abstract:While 3D Gaussian Splatting has shown promising results in street scene reconstruction, existing methods require massive numbers of Gaussian primitives to capture fine details, leading to prohibitive storage costs and slow rendering speeds. We observe that dynamic objects (e.g., vehicles and pedestrians) demand high-fidelity representations to maintain temporal consistency, while static background regions often contain substantial redundancy. Motivated by this, we propose SparseStreet, a general compression framework specifically designed for street scenes. First, we introduce a node-based learnable pruning strategy that systematically removes low-contributing Gaussian primitives while preserving visually critical regions. Second, after the scene representation stabilizes, we apply background compression, further reducing redundancy in static regions. Our method effectively preserves the geometry and appearance of dynamic objects while significantly reducing the total number of Gaussian primitives. Extensive experiments on the Waymo and nuScenes demonstrate that SparseStreet achieves up to 80% compression ratio with minimal quality degradation, enabling resource-efficient, high-fidelity dynamic scene reconstruction. Project website: this https URL.

18. 【2606.03904】MAdam: Metric-Aware Multi-Objective Adam

链接：https://arxiv.org/abs/2606.03904

作者：Fengbei Liu,Rachit Saluja,Sunwoo Kwak,Ruibo Wang,Ruining Deng,Heejong Kim,Johannes C. Paetzold,Mert R. Sabuncu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Pareto-based families, machine learning problems, underlies many machine, families almost universally, universally hand

备注：

点击查看摘要

Abstract:Multi-objective optimization (MOO) underlies many machine learning problems, yet MOO solvers across the loss-balancing, gradient-balancing, and Pareto-based families almost universally hand their reconciled directions to Adam~\cite{kingma2015adam}. We show this coupling introduces two systematic gaps between the solver's intent and the optimizer's execution. The first is a \emph{weighting mismatch}: Adam's second-moment denominator entangles the time-varying preference vector with gradient statistics, marginalizing the preference into a history average and collapsing distinct Pareto trade-offs toward a near-uniform mixture. The second is a \emph{geometric mismatch}: Adam's adaptive metric distorts the Euclidean geometry MOO solvers assume, turning aligned objectives into apparent conflicts. To resolve both jointly, we introduce \textbf{MAdam} (Metric-Aware Multi-Objective Adam), a drop-in wrapper that leaves both solver and optimizer unchanged. MAdam preconditions the reconciled direction by the preference-conditioned curvature of the scalarized objective; on this whitened input, Adam's second moment collapses to identity, so the realized update is governed by the preference-conditioned metric. Across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, MAdam consistently improves over Adam for every solver family.

19. 【2606.03903】An Attention-Based Denoising Model for Diffusion Weighted Imaging

链接：https://arxiv.org/abs/2606.03903

作者：Prithviraj Verma,Pawan Kumar,Chandan Deshani,Prasun Chandra Tripathi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：whole-body cancer screening, Diffusion-weighted imaging, long acquisition time, cancer screening, whole-body cancer

备注：

点击查看摘要

Abstract:Diffusion-weighted imaging (DWI) is used for whole-body cancer screening, but it typically requires a long acquisition time. When the scan time is reduced, the image quality often suffers, leading to increased noise in the scans. Magnitude reconstruction in DWI introduces signal-dependent Rician noise, which makes denoising more challenging for conventional convolution-based methods. To address this limitation, we propose a noise-aware attention-driven denoising framework that integrates hierarchical Swin Transformer window attention with transformer-based multi-dimensional gated refinement for DWI restoration. The model incorporates explicit noise-level conditioning and residual reconstruction to enable adaptive suppression of heteroscedastic noise across a wide range of corruption levels. Experimental evaluation on corrupted DWI scans demonstrates strong restoration performance. Our model achieves a mean PSNR of 33.69~dB and SSIM of 0.8539 across noise levels from 1\% to 15\%, while maintaining stable behavior under severe noise conditions. These results indicate that attention-guided contextual modeling combined with channel-adaptive refinement provides a robust and generalizable solution for DWI denoising.

20. 【2606.03893】Electromagnetic Navigation for Femoral Osteotomy Using High-Accuracy X-ray-to-CT Registration

链接：https://arxiv.org/abs/2606.03893

作者：Roman Flepp,Arend Nieuwland,Bastian Sigrist,Philipp Fürnstahl,Lilian Calvet,Thomas Dreher

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：osteotomies remains challenging, corrective femoral osteotomies, femoral osteotomies remains, remains challenging, fluoroscopic images

备注： Will be published in the International Journal of Computer Assisted Radiology and Surgery

点击查看摘要

Abstract:Accurate execution of preoperative plans in corrective femoral osteotomies remains challenging. Current techniques are limited by variable accuracy, invasiveness, and radiation exposure, with free-hand methods and patient-specific instrumentation (PSI) often requiring 30 and 6 fluoroscopic images, respectively. We present an integrated, electromagnetic tracking (EMT)-based navigation system for femoral osteotomies that minimizes dissection and intraoperative fluoroscopy. The system couples CT-based preoperative planning with one-time intraoperative C-arm calibration and accurate X-ray-to-CT registration from two fluoroscopic images acquired at initialization. This enables real-time, fluoroscopy-free EMT navigation of the saw blade and bone fragments relative to the preoperative plan, and is compatible with uniplanar and biplanar osteotomies. In a feasibility study using 18 synthetic femora, EMT guidance significantly outperformed free-hand execution in total angular error ($(3.05 \pm 0.75)^\circ$ vs.\ $(6.32 \pm 2.36)^\circ$, $p=0.031$), assuming the same minimal surgical exposure for both. No EMT-guided trials exceeded the 5° clinical threshold, whereas free-hand produced 4 outliers of 6 trials. The system achieved statistical equivalence ($\pm 2^\circ$, $\pm 2,\text{mm}$) to PSI for total angular ($p \le 0.02$) and total translational ($p=0.048$) errors, with no significant differences in user questionnaire scores. By transferring preoperative plans using only two fluoroscopic images while matching PSI accuracy without additional surgical exposure, the proposed system motivates subsequent cadaveric and clinical validation.

21. 【2606.03890】OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

链接：https://arxiv.org/abs/2606.03890

作者：Yifei Li,Pengyiang Liu,Yuhang Zang,Zhongyue Shi,Qi Fu,Hongye Hao,Jiwen Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal agents, agents in robotics, current view, autonomous driving, driving must reason

备注： 48 pages, 12 figures, 15 tables. Project page: [this https URL](https://internlm.github.io/OVO-S-Bench/)

点击查看摘要

Abstract:Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

22. 【2606.03888】CoralBay: A Self-Supervised CT Foundation Model

链接：https://arxiv.org/abs/2606.03888

作者：Ioannis Gatopoulos,Nicolas Känzig,Sebastian Otálora,Fei Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：producing general-purpose visual, enabled large-scale pre-training, general-purpose visual representations, producing general-purpose, natural images

备注：

点击查看摘要

Abstract:Self-supervised learning has enabled large-scale pre-training on 2D natural images, producing general-purpose visual representations that transfer effectively across tasks. However, many medical imaging modalities, such as CT scans, are inherently three-dimensional and differ fundamentally from natural images in both structure and semantics. Volumetric modalities capture spatial continuity, organ anatomy, and intensity-based tissue properties (e.g., Hounsfield Units), which are not adequately modeled by 2D pre-training. To bridge this gap, we introduce CoralBay, a self-distillation framework that extends DINO by using a hierarchical 3D Swin backbone and applying self-distillation to concatenated multi-scale features, enabling data-efficient self-supervised learning of rich spatial representations that encode both global semantics and fine-grained local structure. As a result, CoralBay transfers effectively to a wide range of downstream radiological tasks, demonstrating strong and consistent performance across diverse anatomical targets. In addition, we contribute to the open-source \eva framework by introducing a public, reproducible 3D radiology leaderboard that unifies multiple datasets and establishes a standardized benchmark for evaluating volumetric representation learning methods.

23. 【2606.03879】Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

链接：https://arxiv.org/abs/2606.03879

作者：Wei Ding,Yudong Zhang,Ruobing Xie,Xingwu Sun,Jiansheng Chen,Yu Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：heterogeneous visual streams, foundation models scale, diverse encoders interact, joint training, visual streams

备注：

点击查看摘要

Abstract:As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.

24. 【2606.03877】MLP Splatting: Object-Centric Neural Fields

链接：https://arxiv.org/abs/2606.03877

作者：Shinjeong Kim,Yuzhou Cheng,Xin Kong,Paul H. J. Kelly,Andrew J. Davison

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：photorealistic novel-view synthesis, representations are fundamental, Neural Radiance Fields, novel-view synthesis, Gaussian Splatting

备注：

点击查看摘要

Abstract:3D representations are fundamental to scene rendering, understanding, and interaction. Recent approaches, such as 3D Gaussian Splatting and Neural Radiance Fields, achieve impressive photorealistic novel-view synthesis, but lack the ability to easily decompose scene elements into a few primitives, requiring additional segmentation or grouping for object-level manipulation. We present MLP-Splatting, a method that enables scene decomposition via a few expressive light-field primitives while providing photorealistic novel-view synthesis. MLP-Splatting models each primitive as an independent compact MLP with localized spatial support that predicts radiance and opacity. In contrast to low-level Gaussian primitives or a single global radiance field, our neural primitives provide greater expressive capacity while remaining spatially localized. Rendering is performed through efficient sparse volumetric compositing over ray-primitive interactions. Our primitives are supervised using RGB supervision alone, which yields primitives that represent local scene regions often corresponding to objects or object parts, enabling interactive object-level editing without segmentation masks by selecting a handful of primitives. Our method, augmented with optional semantic feature distillation, enables open-vocabulary scene interaction and open-set instant segmentation. Compared to state-of-the-art methods, we achieve substantially lower memory usage (1/15$\times$) and faster rendering (3$\times$), as we show in our experiments compared to semantic 3DGS methods. Project Page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.03877 [cs.CV]

(or
arXiv:2606.03877v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.03877

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

25. 【2606.03875】Seg2Track++: Probabilistic Track Validation and Data Association for Multi-Object Tracking and Segmentation

链接：https://arxiv.org/abs/2606.03875

作者：Diogo Mendonça,Tiago Barros,Cristiano Premebida,Urbano J. Nunes

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autonomous systems require, ensuring consistent object, precise mask-level delineation, consistent object identities, robust Multi-Object Tracking

备注：

点击查看摘要

Abstract:Autonomous systems require robust Multi-Object Tracking and Segmentation (MOTS) to operate reliably in dynamic environments, ensuring consistent object identities and precise mask-level delineation. Foundation models such as SAM2 have shown strong zero-shot generalization for segmentation, but their direct application to MOTS is limited by unreliable track association and false-positive propagation. This work introduces Seg2Track++, a framework that integrates instance segmentation with SAM2 and a novel track management module to perform zero-shot MOTS with enhanced temporal consistency. Tracks are associated using Mask Centroid Distance (MCD) and Confidence-Aware Cost Modulation (CCM), while Probabilistic Track Validation (PTV) employs a Bernoulli filter to validate track existence and suppress ghost tracks. Experimental results on KITTI MOTS demonstrate improved identity preservation, reduced false-positive propagation, and robust track management without fine-tuning.

26. 【2606.03874】DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

链接：https://arxiv.org/abs/2606.03874

作者：Koki Nagano,Hongyu Liu,Seonwook Park,Tianye Li,Amrita Mazumdar,Christian Jacobsen,Shengze Wang,Michael Stengel,Rajarshi Roy,Ka Chun Cheung,Simon See,Shalini De Mello

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：present DyaPlex, streaming motion pathway, full-duplex speech model, model designed, model

备注： Project page: [this https URL](https://research.nvidia.com/labs/amri/projects/DyaPlex)

点击查看摘要

Abstract:We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

27. 【2606.03871】Visual Instruction Tuning Aligns Modalities through Abstraction

链接：https://arxiv.org/abs/2606.03871

作者：Luis Palacios,Lorenzo Basile,Diego Doimo,Alberto Cazzaniga

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Model, pre-trained Large Language, Language Model, Large Language, information alongside text

备注：

点击查看摘要

28. 【2606.03868】Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

链接：https://arxiv.org/abs/2606.03868

作者：Dingrui Wang,YuAn Wang,Jinkun Liu,Yue Zhang,Mattia Piccinini,Yu Sun,Johannes Betz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent world action, aligning broad visual-dynamics, Recent world, broad visual-dynamics priors, leverage video foundation

备注： 9 pages, 5 figures

点击查看摘要

Abstract:Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.

29. 【2606.03837】Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

链接：https://arxiv.org/abs/2606.03837

作者：Luc P.J. Sträter,Hazel Doughty

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Parameter-efficient fine-tuning, trainable parameters, making it attractive, computation are expensive, small number

备注：

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

30. 【2606.03827】Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

链接：https://arxiv.org/abs/2606.03827

作者：Shaokun Lan,Haoran Dou,Jinghan Huang,Arezoo Zakeri,Fengming Lin,Zherui Zhou,Jinming Duan,Alejandro F. Frangi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：medical devices require, populations of anatomies, medical devices, devices require, require the generation

备注： This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

点击查看摘要

Abstract:In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

31. 【2606.03806】X-1500: A Paired Real-World LWIR Hyperspectral Dataset and Benchmark for Temperature-Emissivity-Texture Decomposition

链接：https://arxiv.org/abs/2606.03806

作者：Cheng Dai,Jiale Lin,Hongyi Xu,Bingxuan Song,Ziyang Xie,Fanglin Bao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：material spectral response, object heat state, infrared hyperspectral imaging, recover object heat, long-wave infrared hyperspectral

备注：

点击查看摘要

Abstract:Temperature-emissivity-texture (TeX) decomposition seeks to recover object heat state, material spectral response, and visible-like geometric texture from long-wave infrared hyperspectral imaging (LWIR HSI). Existing TeX pipelines are mainly scene-specific inverse solvers, and the lack of paired LWIR HSI-TeX supervision has limited learning-based decomposition. To address this gap, we introduce TeX-1500, a large-scale paired LWIR HSI-TeX dataset and benchmark for supervised HSI-to-TeX decomposition. TeX-1500 contains 1,522 calibrated real-scene pairs from DARPA Invisible Headlights (DARPA IH) pushbroom imagery and our FTIR acquisitions, covering five locations, four seasons, diverse acquisition times, heterogeneous wavelength layouts, and two sensor families. Each sample stores a calibrated valid-band radiance cube, calibrated wavelength positions, and aligned temperature, emissivity, and texture supervision constructed through a consistent restoration and TeX-construction protocol. We further provide TeX-UNet, a simple wavelength-aware baseline that maps calibrated HSI bands and wavelength positions to TeX fields. Experiments on the held-out DARPA IH pushbroom scenes and zero-/few-shot transfer to FTIR scenes show that TeX-1500 provides usable paired supervision and a measurable benchmark for data-driven physical-property-centered thermal perception.

32. 【2606.03802】mplate Collapse and Information-Theoretic Limits in Camera rPPG Pulse Morphology Restoration

链接：https://arxiv.org/abs/2606.03802

作者：Achraf Ben Ahmed

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：passive cardiovascular monitoring, enables passive cardiovascular, camera remote photoplethysmography, face camera remote, encoding arterial stiffness

备注：

点击查看摘要

Abstract:Objective: Consumer face camera remote photoplethysmography (rPPG) enables passive cardiovascular monitoring, but whether single-cycle waveform morphology encoding arterial stiffness biomarkers is recoverable from this measurement has not been characterised. Methods: We evaluated 16 architectures spanning six families on 153 subjects across three datasets, introducing cross-subject Pearson r to distinguish subject-specific recovery from template collapse. Results: No architecture recovered subject-specific morphology (cross-subject r range 0.773--0.9999; ground-truth ceiling 0.601). Supervised Contrastive (SupCon) converged to log N = 4.844, constituting the strongest available empirical evidence that no discriminative morphological structure is extractable from single-cycle rPPG by the encoder families tested. The VAE decoder restores population-level harmonic content absent from the rPPG input (H2/H1: 0.310 output vs. 0.275 input), generalising zero-shot to UBFC (r = +0.708); a directional hallucination gap (p = 0.150) suggests partial signal reading. Anti-collapse objectives fail when input carries no discriminative structure. Significance: Consumer cameras cannot encode individual arterial morphology; cross-subject r is a necessary collapse diagnostic for waveform reconstruction benchmarks.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.03802 [cs.CV]

(or
arXiv:2606.03802v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.03802

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Achraf Ben Ahmed [view email] [v1]
Tue, 2 Jun 2026 15:50:07 UTC (533 KB)

33. 【2606.03795】Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

链接：https://arxiv.org/abs/2606.03795

作者：Akayou A. Kitessa,Yijun Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-language models map, map visual features, shared embedding space, models map visual, Vision-language models

备注：

点击查看摘要

Abstract:Vision-language models map visual features into a shared embedding space through learned projection layers, yet it remains unclear how these transformations alter the structure of visual information. This study examines changes in representation through spatial-frequency accessibility, measured by the linear recoverability of band-limited Fourier energy from model representations. To isolate effects beyond dimensionality reduction, we introduce Residual Spectral Loss (RSL), which evaluates changes relative to a dimension-matched random projection baseline. To reduce confounding effects from optimization, the analysis uses pretrained models with all parameters frozen. The experimental results show consistent frequency-dependent changes in accessibility across CLIP and DINOv2 on ImageNet and MS-COCO datasets. Spectral accessibility follows a non-monotonic trajectory across depth, peaking at intermediate layers before decreasing toward the output representation. The final transformation differs across architectures: CLIP's learned projection is spectrally neutral, with changes explained by compression, whereas DINOv2's [CLS] pooling induces a structured loss across the spectrum. These findings identify intermediate layers and pooling mechanisms as primary drivers of spectral transformation in modern vision encoders.

34. 【2606.03793】Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

链接：https://arxiv.org/abs/2606.03793

作者：Hashmat Shadab Malik,Muzammal Naseer,Salman Khan

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：integrate visual perception, Multimodal Large Language, Large Language Models, Models integrate visual, Multimodal Large

备注：

点击查看摘要

35. 【2606.03792】raining-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

链接：https://arxiv.org/abs/2606.03792

作者：Georgios Tsoumplekas,Stella Bounareli,Vasileios Argyriou

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：successfully enables personalization, adapting pre-trained diffusion, pre-trained diffusion models, Low-Rank Adaptation, successfully enables

备注： Accepted at IEEE FG 2026

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at this https URL.

36. 【2606.03788】SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

链接：https://arxiv.org/abs/2606.03788

作者：Zeno Testa,Antonino Furnari,Lorenzo Baraldi,Natalia Díaz-Rodríguez

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Sign Language Translation, BLEU and ROUGE, source sign sequence, Sign Language, Sign Language Understanding

备注：

点击查看摘要

Abstract:Sign Language Translation (SLT) is typically evaluated with surface-form metrics such as BLEU and ROUGE, which reward lexical overlap but do not directly measure whether a translation preserves the meaning of the source sign sequence. This is in contrast with the final objective of integrating SLT in assistive technology. In this work, we shift the focus from Sign Language Translation (SLT) to Sign Language Understanding (SLU), with particular emphasis on semantic understanding. Specifically, we evaluate systems based on their ability to correctly recover, from the input video, key semantic aspects of the original sentence, such as actions taking place and facts about people and objects. To enable this evaluation systematically, we propose SLU-2K, a dataset of 2,350 closed-ended video question-answer pairs based on the popular PHOENIX-2014T and CSL-Daily datasets. To obtain SLU-2K, we propose and extensively evaluate an automated data generation pipeline which produces questions across 7 categories, namely actions, locations, numbers, objects, people, time, and weather conditions. We show the potential of SLU-2K by evaluating popular Multimodal Large Language Models (MLLMs) and two representative state-of-the-art systems, MMSTL and SpaMo. Our results show that MLLMs reach near-random performance, highlighting the need for a more systematic integration of SLU in current AI systems. Furthermore, state-of-the-art translation systems carefully fine-tuned on in-domain data still exhibit a substantial semantic gap, with results ranging from 56.7% to 75.2%. These findings suggest that current SLT evaluation protocols overestimate true understanding and that future progress should be measured not only by fluency and n-gram overlap, but also by semantic correctness. Code, prompts, and benchmark files are available at this https URL

37. 【2606.03774】AmbientEye: A Dataset for Pupil Segmentation under Natural Ambient Infrared Illumination

链接：https://arxiv.org/abs/2606.03774

作者：Mingyu Han,Hyunyoung Han,Nitheekulawatn Thommakoon,Gangtae Park,Jieun Han,Xucong Zhang,Ian Oakley

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ambient intelligence applications, smart glasses, intelligence applications, tracking is essential, essential for smart

备注： 12 pages, 7 figures

点击查看摘要

Abstract:Eye tracking is essential for smart glasses, as it provides insight into user attention for ambient intelligence applications. However, most existing eye-tracking systems rely on active infrared (IR) illumination, creating practical barriers to all-day outdoor use due to power consumption. In this paper, we investigate whether passive IR cameras alone, without any active IR light source, can enable reliable pupil detection in unconstrained outdoor environments, where ambient sunlight serves as the sole illumination source. To support this investigation, we introduce AmbientEye, a large-scale dataset of 2,606,225 eye images collected from 35 participants from 19 countries. It is captured outdoors under natural sunlight with two off-axis camera configurations and two sun-orientation conditions. We provide high-quality pupil annotation through SAM2 automatic segmentation, followed by refinement by human annotators. We benchmark a state-of-the-art pupil segmentation algorithm on our dataset and compare its performance with that on existing datasets under controlled IR illumination. Results reveal a substantial drop in pupil segmentation performance from 0.928 on controlled IR datasets to 0.767 on AmbientEye. This performance gap highlights the challenge of the ambient-light setting. This positions AmbientEye as a first benchmark for an unexplored and highly practical eye-tracking scenario.

38. 【2606.03748】Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

链接：https://arxiv.org/abs/2606.03748

作者：Glenn Jocher,Jing Qiu,Mengyu Liu,Shuai Lyu,Fatih Cagatay Akyon,Muhammet Esat Kalfaoglu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Distribution Focal Loss, Real-time vision demands, vision demands models, diverse hardware, simple to deploy

备注： 31 pages, 8 figures

点击查看摘要

Abstract:Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at this https URL.

39. 【2606.03746】Qwen-Image-Flash: Beyond Objective Design

链接：https://arxiv.org/abs/2606.03746

作者：Tianhe Wu,Kun Yan,Zikai Zhou,Lihan Jiang,Jiahao Li,Jie Zhang,Kaiyuan Gao,Ningyuan Tang,Shengming Yin,Xiaoyue Chen,Xiao Xu,Yilei Chen,Yuxiang Chen,Yan Shu,Yixian Xu,Yanran Zhang,Zihao Liu,Zhendong Wang,Zekai Zhang,Deqing Li,Liang Peng,Yi Wang,Jingren Zhou,Chenfei Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：visual generative models, accelerating advanced visual, advanced visual generative, generative models, strategy for accelerating

备注：

点击查看摘要

Abstract:Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.

40. 【2606.03730】Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models

链接：https://arxiv.org/abs/2606.03730

作者：Hashmat Shadab Malik,Muzammal Naseer,Salman Khan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strong zero-shot generalization, remain highly vulnerable, show strong zero-shot, Vision-language models, CLIP show strong

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP show strong zero-shot generalization but remain highly vulnerable to adversarial attacks. Adversarial training improves robustness but is computationally expensive, motivating test-time defenses. Recent approaches exploit how CLIP's visual representations respond to stochastic perturbations: aggregating predictions across noisy views, constructing Gaussian noise-averaged anchors and interpolating features toward them, or applying counter-perturbations. These strategies improve robustness but often degrade clean accuracy, yielding an unfavorable clean-robust trade-off. We revisit stochastic test-time defenses and identify an underexplored noise-regime transition in CLIP's representation space. Prior work explored perturbations mainly in the weak-noise regime, where adversarial examples can appear unusually stable (false stability). Our analysis shows this reverses as perturbation strength grows: beyond the weak-noise regime, adversarial representations become markedly more unstable than clean ones, giving a clearer separation signal. The transition is consistent across uniform and Gaussian noise, photometric and geometric transforms, datasets, and diverse attacks. It largely disappears in adversarially trained models, suggesting it is tied to the fragile local-basin geometry of adversarial representations in non-robust CLIP. We propose a training-free, plug-in drift-gated mechanism that uses high-noise feature drift as a lightweight gating signal to trigger existing test-time defenses only when adversarial-like instability is detected. Across 13 datasets it consistently improves the clean-robust trade-off. On eight fine-grained datasets, mean clean+adversarial accuracy rises from 65.7% to 71.4% for counterattack defenses and 68.4% to 73.2% for noise-anchoring; on ImageNet and four shifted variants, from 56.1% to 66.2% and 62.1% to 67.6%.

41. 【2606.03715】xt-to-Image Models Need Less from Text Encoders Than You Think

链接：https://arxiv.org/abs/2606.03715

作者：Nurit Spingarn,Noa Cohen,Tamar Rott Shaham,Tomer Michaeli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：text, human intent, primary interface, interface to human, image

备注： Project webpage: [this https URL](https://nsping13.github.io/contextless-TTI/)

点击查看摘要

Abstract:Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: this https URL

42. 【2606.03713】Investigating Adversarial Robustness of Multi-modal Large Language Models

链接：https://arxiv.org/abs/2606.03713

作者：Hashmat Shadab Malik,Muzammal Naseer,Salman Khan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Language Models

备注：

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) achieve strong performance on vision-language tasks, but incorporating visual inputs through a vision encoder (e.g., CLIP) substantially expands the attack surface, making these models vulnerable to visual adversarial perturbations. Prior defenses typically preserve compatibility with pretrained MLLMs by enforcing strict alignment to CLIP's original embedding space during adversarial fine-tuning; while practical, this constraint fundamentally limits achievable robustness. We present a systematic investigation of adversarial robustness in MLLMs. We first introduce a diagnostic CLIP-alignment protocol that predicts, prior to full MLLM training, which robust vision encoders will transfer effectively to the multimodal setting, revealing that large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the critical factor for strong robustness transfer. Integrating such encoders into MLLMs via end-to-end multimodal training yields average gains of 28 CIDEr points on captioning and 11.7% VQA accuracy under strong adversarial attacks compared to constrained plug-and-play baselines. We further show that adversarial training applied directly to a standard non-robust MLLM degrades both clean and adversarial performance, establishing robust visual representations as a strict prerequisite, while end-to-end adversarial training from a robust backbone delivers additional gains of 1.9 CIDEr points and 4.3% VQA accuracy. Beyond training-time defenses, lightweight test-time visual stochastic transformations serve as an effective black-box defense for non-robust MLLMs, elevating adversarial performance from near-zero to levels comparable with robust models. Finally, we show that our robust models substantially reduce toxic generation under white-box visual jailbreak attacks. Code and pretrained weights will be released publicly.

43. 【2606.03694】Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

链接：https://arxiv.org/abs/2606.03694

作者：Jessica Wenninger,Gabriel Skantze

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：enable meaningful human-robot, continuously assess engagement, consistently tracking users, meaningful human-robot interaction, users over time

备注： 8 pages, 5 figures, 3 tables. Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

点击查看摘要

Abstract:To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.

44. 【2606.03693】Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

链接：https://arxiv.org/abs/2606.03693

作者：Pieter Christy Yan Yudhistira,Dzaki Rafif Malik,Novanto Yudistira

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：language largely unexplored, English radiology visual, largely unexplored, Bahasa Indonesia, Indonesian

备注： accepted to MMFM-BIOMED Workshop @ CVPR 2026

点击查看摘要

45. 【2606.03675】A Fast Methane Detection Pipeline on Board Satellites Based on Mag1c-SAS and LinkNet

链接：https://arxiv.org/abs/2606.03675

作者：Jonáš Herec,Vít Růžička,Rado Pitoňák,Jan Sedmidubsky

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：potent greenhouse gas, change mitigation efforts, detecting leaks early, climate change mitigation, hyperspectral satellite imagery

备注： arXiv admin note: substantial text overlap with [arXiv:2507.01472](https://arxiv.org/abs/2507.01472)

点击查看摘要

Abstract:Methane is a potent greenhouse gas, and detecting leaks early via hyperspectral satellite imagery can help climate change mitigation efforts. Meanwhile, many existing hyperspectral missions only capture areas manually targeted by operators, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane detection methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. In particular, we test fast target detection ACE and CEM methods that have not been previously used for methane detection and propose Mag1c-SAS -- a significantly faster variant of the current state-of-the-art Mag1c algorithm. To explore their detection potential, we integrate them with a machine learning model based on U-Net and LinkNet. We evaluate our methods on the STARCOP dataset and a novel EMIT-MSeg dataset, which we introduce and open-source alongside a high-quality annotation strategy. The proposed Mag1c-SAS approach proves highly effective by operating ~80x faster than the original Mag1c approach, providing a visually similar, but noisier result. When additionally paired with the lightweight LinkNet approach, it effectively reduces noise, achieving AUPRC score improvements of over 30 pp on EMIT-MSeg compared to the baseline Mag1c approach, and an F1 score on STARCOP ~4 pp higher. We evaluate two novel band selection strategies and confirm the system's onboard viability through hardware profiling, demonstrating marginal power consumption and efficient CPU/RAM utilization. We release the final system in a user-friendly and lightweight PyPI library at: this https URL, alongside all experimental code, models, and data at: this https URL.

46. 【2606.03666】Beyond Single Solution: Multi-Hypothesis Collaborative Deep Unfolding Network for Image Compressive Sensing

链接：https://arxiv.org/abs/2606.03666

作者：Wenxue Cui,Hualin Li,Yuhang Qin,Yifu Xu,Xiaopeng Fan,Debin Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：advanced Compressive Sensing, Compressive Sensing, Recent deep unfolding, effectively integrating iterative, integrating iterative optimization

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent ill-posedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.

47. 【2606.03654】Graph Regularized Non-negative Reduced Biquaternion Matrix Factorization for Color Image Recognition

链接：https://arxiv.org/abs/2606.03654

作者：Hailang Wu,Yonghe Liu,Bingxuan Yu,Chaoqian Li

类目：Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

关键词：Non-negative reduced biquaternion, reduced biquaternion matrix, biquaternion matrix factorization, reduced biquaternion, color image pixels

备注：

点击查看摘要

Abstract:Non-negative reduced biquaternion matrix factorization (NRBMF) uses the product of reduced biquaternion (RB) matrices to incorporate the non-negativity constraints of color image pixels into the factorization process. However, NRBMF mainly focuses on reconstruction accuracy and does not exploit the local geometric structure of image data, which may limit the discriminative ability of the learned low-dimensional features. To address this issue, we propose a graph regularized non-negative reduced biquaternion matrix factorization (GNRBMF) model for color image recognition. The proposed model incorporates a graph Laplacian regularizer into the reduced biquaternion coefficient matrix, encouraging nearby samples in the original space to have similar representations in the learned feature space. Meanwhile, GNRBMF retains the non-negativity-preserving property of NRBMF in the reduced biquaternion domain. To solve the optimization problem, a component-wise alternating projected gradient algorithm is derived, and its convergence properties are analyzed. Experimental results demonstrate that the proposed GNRBMF model achieves competitive or superior recognition performance in some tested settings.

48. 【2606.03646】A Benchmark for Semi-supervised Multi-modal Crowd Counting

链接：https://arxiv.org/abs/2606.03646

作者：Haoliang Meng,Xiaopeng Hong,Yabin Wang,Wangmeng Zuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multi-modal crowd counting, crowd counting, semi-supervised multi-modal crowd, paper constructs, multi-modal crowd

备注：

点击查看摘要

Abstract:This paper constructs the first benchmark on semi-supervised multi-modal crowd counting. To lay the foundation for this unexplored task, we first formulate the semi-supervised multi-modal setting and a standardized protocol that specifies the labeled-unlabeled data partition across different labeled ratios. Next, to establish solid reference points, we carefully tailor a diverse set of representative baselines, including existing fully supervised multi-modal methods and semi-supervised single-modal methods. Then, we carefully evaluate their performance under our proposed benchmark. Codes and the data partition will be released on this https URL.

49. 【2606.03635】VidMsg: A Benchmark for Implicit Message Inference in Short Videos

链接：https://arxiv.org/abs/2606.03635

作者：Issar Tzachor,Michael Green,Rami Ben-Ari

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：identifying visible objects, online videos involves, Understanding short online, short online videos, objects and actions

备注： Project page: [this https URL](https://iyttor.github.io/VidMsg)

点击查看摘要

Abstract:Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.

50. 【2606.03626】urtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

链接：https://arxiv.org/abs/2606.03626

作者：Chao Wen,Jacqueline Staub,Adish Singla

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Vision-language models, visual programming, visual, Vision-language, Turtle Graphics domain

备注： ACL Findings 2026 paper

点击查看摘要

Abstract:Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.

51. 【2606.03610】SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition

链接：https://arxiv.org/abs/2606.03610

作者：Yanan Liu,Anqi Zhu,Jingmin Zhu,Jun Liu,Hossein Rahmani,Mohammed Bennamoun,Farid Boussaid,Dan Xu,Qiuhong Ke

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：single labeled exemplar, understand human behaviors, body joint sequences, Skeleton-based action recognition, aims to understand

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Skeleton-based action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and compositional structure of human motion while aligning effectively with high-level action semantics under extreme data scarcity. Existing approaches, largely based on Euclidean embeddings and low-level motion cues, struggle to model the tree-like organization of skeleton data, limiting cross-modal alignment and generalization to unseen action categories. We propose SkelHCC, a unified skeleton hyperbolic CLIP-driven cache adaptation framework for one-shot skeleton-based action recognition. SkelHCC introduces an Explicitly Hierarchical Hyperbolic CLIP (EH-HCLIP) module that embeds skeleton sequences and action language into a shared hyperbolic space. By leveraging the negative curvature and exponential volume growth of hyperbolic geometry, EH-HCLIP naturally encodes the joint-part-body hierarchy of human anatomy and yields structurally consistent cross-modal representations. To support efficient one-shot adaptation, SkelHCC further integrates a training-free LLM-guided Multi-granularity Voting Cache (LMV-Cache) for context-aware inference. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD demonstrate that SkelHCC consistently outperforms state-of-the-art methods.

52. 【2606.03603】World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

链接：https://arxiv.org/abs/2606.03603

作者：Yucheng Zhou,Wei Tao,Yiwen Guo,Jianbing Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：provide complementary capabilities, multimodal large language, large language models, static visual observations, predicting future outcomes

备注：

点击查看摘要

53. 【2606.03598】PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

链接：https://arxiv.org/abs/2606.03598

作者：Ziyang Chen,Shaoguang Wang,Weiyu Guo,Qianyi Cai,He Zhang,Pengteng Li,Yiren Zhao,Yandong Guo

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：language-conditioned robotic manipulation, achieved remarkable success, achieved remarkable, language-conditioned robotic, robotic manipulation

备注： 12 pages, 5 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive uniform sampling fundamentally misaligns with the temporal characteristics of manipulation trajectories. It systematically under-samples brief but causally critical sub-skills, leading to phase starvation, and completely overlooks the varying degrees of forgetting across historical tasks. To overcome these limitations, we introduce PHASER, an architecture-agnostic continual learning framework. PHASER employs a phase-centric capacity allocation to guarantee equal memory support for all sub-skills, coupled with a multi-modal interference routing strategy that dynamically prioritizes historical phases at high risk of forgetting. Furthermore, to enable fully autonomous lifelong adaptation, we integrate Auto-PC, a lightweight pipeline combining unsupervised action-signal change-point detection with VLM-based semantic verification to extract temporal boundaries without intensive manual supervision. Evaluated across three VLA backbones on LIBERO continual learning suites, PHASER yields substantial empirical improvements, increasing Average Success Rate (ASR) by up to 31% over matched-budget ER and achieving an 87.8% final ASR on the LIBERO-Goal CL setting.

54. 【2606.03581】UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene via Rendering Fusion

链接：https://arxiv.org/abs/2606.03581

作者：Ye Wu,Ruiqi Song,Baiyong Ding,Nanxin Zeng,Junjie Cheng,Yunfeng Ai

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：present unique challenges, scenes present unique, scene layouts undermine, object detection, autonomous driving

备注： 8 pages

点击查看摘要

Abstract:Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross-modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail-aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open-pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

55. 【2606.03578】Diffusing in the Right Space: A Systematic Study of Latent Diffusability

链接：https://arxiv.org/abs/2606.03578

作者：Tianxiong Zhong,Xingye Tian,Xuebo Wang,Xin Tao,Pengfei Wan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：efficient generative modeling, models leverage visual, leverage visual tokenizers, diffusion models leverage, generative modeling

备注：

点击查看摘要

Abstract:Latent diffusion models leverage visual tokenizers to compress images into latent spaces for efficient generative modeling. However, better reconstruction quality of a tokenizer does not necessarily translate into better generation quality, suggesting that latent representations should be evaluated not only by fidelity but also by their diffusability. Recent studies have proposed diverse explanations for diffusion-friendly latent spaces, including semantic separability, affine equivariance, distribution uniformity, spatial structure, spectral smoothness, and manifold continuity. Yet these properties are often validated on a limited set of tokenizers, leaving it unclear which factors are most predictive of downstream generation quality and whether such conclusions hold beyond the specific settings in which they are introduced. In this work, we conduct a systematic study of latent diffusability by training a large collection of tokenizers with diverse regularization strategies, architectures, and latent configurations, and evaluating them with multiple downstream diffusion backbones. Our analysis identifies several latent properties that consistently correlate with generation quality and exhibit strong generalization across experimental settings. Beyond existing metrics, we introduce Velocity Irreducible Variance (VIV), a measure of velocity ambiguity induced by trajectory crossings. Extensive experiments show that VIV is one of the most stable predictors of generation quality.

56. 【2606.03577】Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

链接：https://arxiv.org/abs/2606.03577

作者：Hao Zhong,Muzhi Zhu,Shenyan Zeng,Anzhou Li,Cong Chen,Hua Geng,Duochao Shi,Wentao Ye,Tao Lin,Hao Chen,Chunhua Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires integrating geometric, large language models, multimodal large language, integrating geometric understanding, occlusion reasoning

备注： CVPR 2026. Project page: [this https URL](https://aim-uofa.github.io/reasonmatch/) Code: [this https URL](https://github.com/aim-uofa/ReasonMatch)

点击查看摘要

Abstract:Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

57. 【2606.03569】When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

链接：https://arxiv.org/abs/2606.03569

作者：Jiahui Wang,Kai Zhang,Mai Han,Huanghe Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：demonstrated remarkable capabilities, significant computational overhead, Vision-Language Models, overhead during inference, demonstrated remarkable

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.

58. 【2606.03568】Learned Non-Maximum Suppression for 3D Object Detection

链接：https://arxiv.org/abs/2606.03568

作者：Timo Osterburg,Stefan Schütte,Torsten Bertram

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：stage in LiDAR-based, reliable perception, critical stage, dense and overlapping, overlapping proposals

备注： 6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026

点击查看摘要

Abstract:Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at this https URL .

59. 【2606.03566】Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

链接：https://arxiv.org/abs/2606.03566

作者：Po-Jui Lu,Alessandro Cagol,Mario Ocampo-Pineda,Federico Spagnolo,Marina Mastantuono,Andreea-Alexandra Aldea,Jannis Müller,Özgür Yaldizli,Matthias Weigel,Lester Melie-Garcia,Roberta Magliozzi,Maria Pia Sormani,Ludwig Kappos,Jens Kuhle,Cristina Granziera

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：ventricle choroid plexus, lateral ventricle choroid, key imaging biomarker, choroid plexus, multiple sclerosis

备注：

点击查看摘要

Abstract:Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

60. 【2606.03564】\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

链接：https://arxiv.org/abs/2606.03564

作者：Yifan Cao,Xiaocui Yang,Faxian Wan,Shi Feng,Daling Wang,Yifei Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, joint visual-textual reasoning, Large Language Models, Reasoning segmentation aims, segment target objects

备注：

点击查看摘要

Abstract:Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-modal alignment, or explicit spatial prompts such as bounding boxes, which may lose holistic response semantics. To address these limitations, we propose Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation, termed CR-Seg, a two-stage framework for coarse-to-refined reasoning segmentation. Specifically, we design an Extract Attention Maps and Points (EAP) module to extract attention maps for coarse target localization and select informative points, both of which are fed into SAM for mask refinement. To alleviate reasoning--answer inconsistency, we further introduce Global-to-Local Chain-of-Thought (GLCoT), which guides the model to reason progressively from global scene context to local target details. Extensive experiments on reasoning segmentation benchmarks demonstrate the effectiveness of CR-Seg.

61. 【2606.03540】Attend to Anything: Foundation Model for Unified Human Attention Modeling

链接：https://arxiv.org/abs/2606.03540

作者：Wenzhuo Zhao,Ronghao Xian,Keren Fu,Qijun Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing human attention, Existing human, fragmented across modalities, persist as highly, highly fragmented

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at this https URL.

62. 【2606.03539】Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding

链接：https://arxiv.org/abs/2606.03539

作者：Haoxuan Chen,Xianqin Liu,Jian-Fang Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Spatio-Temporal Video Grounding, Video Grounding aims, localize object tubes, object tubes based, Video Grounding

备注： Accepted by ICME 2026

点击查看摘要

Abstract:Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.

63. 【2606.03509】EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

链接：https://arxiv.org/abs/2606.03509

作者：Zuhao Ge,Xiaosong Jia,Chao Wu,Yuchen Zhou,Zuxuan Wu,Yu-Gang Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：zero-shot embodied navigation, Building memory, essential for long-horizon, long-horizon planning, embodied navigation

备注： Preprint

点击查看摘要

Abstract:Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

64. 【2606.03508】Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection

链接：https://arxiv.org/abs/2606.03508

作者：Peitong Wang,Nuo Wang,Enxin Qin,Chengjin Yu,Hanyu Xuan,Yuanting Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Printed circuit board, dense circuit backgrounds, automated optical inspection, Printed circuit, PCB defect detection

备注： Preprint. 38 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect detection framework that combines structure-guided mixed masked pretraining with spatial continuity regularization. In the pretraining stage, we design a sparse convolutional masked pretraining scheme to exploit unlabeled PCB images, where structure-guided mixed masking is used to construct informative masked inputs. The sparse convolutional reconstruction pipeline suppresses invalid responses from masked regions and enables the detector backbone to infer missing PCB structures from visible conductive patterns, thereby learning PCB structural priors. In the fine-tuning stage, the pretrained backbone is transferred to the downstream defect detection task. For the task, a spatial continuity regularization term is introduced during fine-tuning. This term constrains dispersed positive predictions assigned to the same defect instance and promotes more compact localization on elongated defect regions. Experiments on the DsPCBSD+ dataset show that the proposed method achieves 85.5% mAP0.5 and 52.3% mAP0.5:0.95, outperforming several strong baseline detectors. Ablation studies and qualitative results further confirm the effectiveness of the proposed framework for robust PCB defect detection in industrial AOI scenarios.

65. 【2606.03506】AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization

链接：https://arxiv.org/abs/2606.03506

作者：Zhaorong Wang,Yoshihiro Kanamori,Yuki Endo

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：face distinct challenges, separately model body, transfer methods face, methods face distinct, approaches that lift

备注： CVPR 2026 Findings. 16 pages, including supplementary material

点击查看摘要

Abstract:Existing 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user's body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user's physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: this https URL

66. 【2606.03499】Characterizing Detectability in 3DGS Poisoning: A Stage-wise Benchmark

链接：https://arxiv.org/abs/2606.03499

作者：Quoc-Anh Bui-Huynh,Thanh Duc Ngo,Xue Geng,Kaixin Xu,Wang Zhe,Xulei Yang,Ngai-Man Cheung

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computation cost amplification, illusory object injection, hoc model watermarking, post hoc model, Gaussian Splatting

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has rapidly emerged as a leading representation for real-time novel view synthesis, but recent work shows it is vulnerable to diverse poisoning attacks, including illusory object injection, computation cost amplification, and post hoc model watermarking. Despite this expanding threat surface, existing studies focus mainly on attack success, while defense and detection remain underexplored. From a detection perspective, a key challenge and opportunity arise from the multi-stage nature of the 3DGS reconstruction pipeline, which produces heterogeneous intermediate representations. Forensic signals for detecting poisoning are inherently stage dependent: an attack introduced at one stage may produce signals that emerge only at later stages. This motivates a stage-wise view of detectability that goes beyond single-stage evaluation. We introduce Poison-3DGS, a benchmark for stage-wise characterization of poisoning detection in 3DGS. It exposes stage-specific artifacts, including multi-view images, geometry, training dynamics, and Gaussian parameters, across a diverse set of scenes and attacks. Using it, we conduct a systematic study of detectability across pipeline stages. Our analysis reveals several insights. First, detectability varies significantly across stages, and no single stage consistently dominates across attack types. Second, different attacks exhibit distinct stage-specific forensic signals, so detection effectiveness depends critically on where signals are observed. Third, later-stage signals such as training dynamics and Gaussian parameter statistics provide strong cues not observable at earlier stages. Overall, our work provides a principled benchmark and the first systematic characterization of stage-dependent detectability in 3DGS, offering a foundation for future research on robust and reliable 3DGS systems.

67. 【2606.03493】Low-Frequency Shortcuts in Texture-Driven Visual Learning

链接：https://arxiv.org/abs/2606.03493

作者：Utku Şirin,Cathy Hou,David Alvarez-Melis,Stratos Idreos

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Neural networks suffer, Neural networks, learned features generalize, Neural, networks suffer

备注：

点击查看摘要

Abstract:Neural networks suffer from shortcut learning, where learned features generalize well to the training set but not to in-distribution (ID) or out-of-distribution (OOD) test sets. Existing studies are all based on a few standard benchmarks, which are shape-driven. Numerous application domains, however, are texture-driven. In this work, we present shortcut learning analysis for texture-driven domains, and compare it with that of a standard benchmark. We show that texture-driven domains suffer from low-frequency shortcuts. They make the majority of their decisions based on a few low-frequency components (LFCs) with a skewed spectral behavior, despite that their classification information is in higher-frequency, fine-grained details. Pruning LFCs from training and test sets eliminates the shortcut and provides a more balanced spectral behavior, improving the ID accuracy by up to 8%. We show that low-frequency shortcuts make the models highly vulnerable to OOD corruptions, leading up to 70% accuracy drop compared to the ID accuracy. Pruning LFCs significantly improves robustness to low-frequency corruptions, by up to 40%, and introduces a trade-off for high-frequency corruptions; the balanced spectral behavior provides a better generalization performance, whereas the increased dependence on high-frequency features reduces it. OOD accuracy depends on the interaction between these two factors.

68. 【2606.03490】rAction: Action Recognition with Sparse Trajectories

链接：https://arxiv.org/abs/2606.03490

作者：Jan F. Meier,Felix B. Mueller,Alexander Ecker,Timo Lüddecke

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：frequently exploit appearance, RGB video volumes, Modern action recognition, dense RGB video, compute-intensive dense RGB

备注：

点击查看摘要

Abstract:Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: this https URL

69. 【2606.03479】PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting

链接：https://arxiv.org/abs/2606.03479

作者：Adrian Ramlal,John S. Zelek

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：synchronized multi-camera video, Gaussian Splatting, methods reconstruct time-varying, reconstruct time-varying scenes, Pattern Recognition

备注： Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction

点击查看摘要

Abstract:Dynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal and degrade. Existing approaches to incomplete observations in neural reconstruction rely on learned generative priors that prioritize visual plausibility over physical correctness. We propose $\textbf{PersistGS}$, a method that restores object permanence during occlusion by coupling differentiable rigid body simulation with 3D Gaussian Splatting. Our approach decomposes the scene into per-object Gaussians and collision meshes, estimates friction and velocity from the observed pre-occlusion trajectory via differentiable simulation, and uses the resulting SE(3) trajectory to position object Gaussians throughout the occlusion period. Because the predicted trajectory satisfies the governing equations of rigid body dynamics, it faithfully captures contact events (bounces, friction-based deceleration, direction changes) that kinematic extrapolation cannot model. We introduce a centroid silhouette loss that isolates positional gradients from appearance noise, yielding 40% lower trajectory error than photometric supervision. We evaluate using cameras withheld from training that observe the object during its occlusion. Experiments on synthetic scenes show that PersistGS outperforms constant velocity extrapolation by +2.46dB PSNR and comes within 0.19dB of a ground-truth trajectory upper bound.

Comments:
Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

ACMclasses:
I.4.8; I.3.7; I.2.9

Cite as:
arXiv:2606.03479 [cs.CV]

(or
arXiv:2606.03479v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.03479

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4687-4696

70. 【2606.03470】Mixed-Modality Dual Face-Hair Retrieval

链接：https://arxiv.org/abs/2606.03470

作者：Quoc-Anh Bui-Huynh,Mai-Tuyen Lam,Dai-Anh-Tuan Nguyen,Thanh Duc Ngo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：introduce Dual Face-Hair, hairstyle reference expressed, introduce Dual, Dual Face-Hair Retrieval, mixed-modality dual-reference task

备注：

点击查看摘要

Abstract:We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes -- identity and hairstyle -- originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.

71. 【2606.03460】From 3D Perception to Safety Reasoning: A Graph-Based Framework for Real-Time Underground Mine Monitoring

链接：https://arxiv.org/abs/2606.03460

作者：Pasindu Ranasinghe,Simit Raval,Dibyayan Patra,Bikram Banerjee,Ismet Canbulat

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：equipment proximity violations, coal mining requires, mining requires personnel, poorly illuminated spaces, occluded blind spots

备注：

点击查看摘要

Abstract:Underground coal mining requires personnel and heavy equipment to operate within shared, confined, and poorly illuminated spaces where hazards such as equipment proximity violations, structural instabilities, and occluded blind spots are difficult to anticipate. Conventional monitoring systems, including fixed cameras and rule-based proximity alerts, can detect predefined events but lack the 3D scene understanding and contextual memory needed to identify complex or evolving hazards. This paper presents a continuous monitoring framework that converts colourised 3D point clouds into structured and traceable safety reasoning outputs. The framework combines 3D semantic perception, uncertainty-based anomaly detection, rule-based hazard checks, on-device LLM reasoning, and GraphRAG -based memory analysis to identify immediate hazards and interpret longer-term safety patterns. Scene and temporal graphs serve as the explicit knowledge structure, linking perception outputs across reasoning stages. To overcome the scarcity of labeled underground data, real roadway scans, controlled object placement, and high-fidelity longwall simulation were combined to generate diverse hazard scenarios, while self-supervised pretraining improved segmentation from limited annotations. The perception model achieved 92.7% accuracy at 30 FPS with low memory usage. Across 115 hazard scenarios, rule-based checks achieved 57% coverage, increasing to 76% with contextual LLM reasoning and 93% with memory-based reasoning using historical records. Qualitative results show uncertainty-derived anomaly signals support the interpretation of out-of-distribution hazards beyond predefined classes. Overall, graph-based knowledge representation combined with 3D perception and layered safety reasoning provides a practical foundation for intelligent decision support in underground mine monitoring.

72. 【2606.03444】PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

链接：https://arxiv.org/abs/2606.03444

作者：Ying Tang,Dong Li,Youjia Zhang,Zikai Song,Junqing Yu,Wei Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Vision Foundation Models, diverse Vision Foundation, single efficient model, negative transfer inherent, Vision Foundation

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel dual-stream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization. We propose a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces to mitigate interference, followed by (2) dynamic recomposition, where the router learns to assemble these experts into tailored computational pathways for downstream tasks. Experiments on PASCAL-Context and NYUD-v2 show that \textbf{PRISM} establishes a new state of the art, validating that sparse, emergent specialization is a scalable approach for integrating diverse visual knowledge.

73. 【2606.03420】PHAF-Personalized Hand Avatars in a Flash

链接：https://arxiv.org/abs/2606.03420

作者：Meghana Shankar,Akanxit Upadhyay,Anmol Namdev,Green Rosh KS,Pawan Prasad BH

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unlike slow optimization-based, slow optimization-based techniques, PHAF-Personalized Hand Avatars, photo-realistic hand avatar, high quality multi-view

备注：

点击查看摘要

Abstract:We present PHAF-Personalized Hand Avatars in a Flash, a personalized photo-realistic hand avatar which provides high quality multi-view renders from just two images (dorsal and palmar views).Unlike slow optimization-based techniques, PHAF generates fast personalized textures for real-time deployment on edge devices. Our approach combines semantic guided mesh alignment and densified texture extraction to transfer high-frequency details efficiently. A view-based inpainting network refines textures ensuring smooth, continuous appearance. PHAF generalizes to novel viewpoints and leverages a parametric hand model for accurate articulations, making it compatible with standard graphics engines. Experiments show it is comparable to existing methods in visual fidelity while drastically reducing texture generation time by 30 times, enabling practical AR/VR applications.

74. 【2606.03418】IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection

链接：https://arxiv.org/abs/2606.03418

作者：Hengyang Zhou,Rongman Hong,Yuxuan Zhou,Jing Wang,Zhaoyan Pan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal fake, fake news detection, Existing multimodal fake, aims to identify, identify the authenticity

备注： Accept by GlobalSouthML@ICML 2026

点击查看摘要

Abstract:Multimodal fake news detection aims to identify the authenticity of news. Existing multimodal fake news detection methods mainly focus on cross-modal consistency, but often fail to explicitly model the semantic incongruity that characterizes deceptive multimodal content. However, misinformation often contains semantic information incongruity with the facts. To address these challenges, we propose Incongruity-aware Distribution Optimization (IDO) to improve the performance of fake news detection from the perspectives of factual incongruity and modality incongruity. For factual incongruity, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings and utilize gaussian distribution to model the uncertain correlation caused by factual incongruity. For modality incongruity, we utilize incongruity contrastive learning to learn cross-modal semantic information. Experiments demonstrate that IDO achieves state-of-the-art performance.

75. 【2606.03417】A unified multi-task framework enables interpretable chest radiograph analysis

链接：https://arxiv.org/abs/2606.03417

作者：Lijian Xu,Ziyu Ni,Xinglong Liu,Xiaosong Wang,Hongsheng Li,Shaoting Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal deep learning, Chest X-ray Analysis, Interpretable Multi-task Transformer, existing black-box systems, existing black-box

备注：

点击查看摘要

Abstract:While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transformer for Chest X-ray Analysis), a framework that emulates radiologists' diagnostic workflow through three evidence-driven stages: 1) Disease recognition; 2) Attribute characterization (e.g., size, location, severity quantification); 3) Evidence-integrated report generation with traceable decision pathways. The framework employs a unified transformer architecture optimized via medical-domain instruction tuning, sequentially executing four clinical tasks: multi-label disease classification, lesion localization, anatomical segmentation, and radiology report generation. Experimental validation demonstrates competitive performance on ten CXR benchmarks under direct inference and fine-tuning settings. In a blinded evaluation of 160 historical reports from four medical centers, three radiologists rated 66\% of AI-generated reports as comparable to or surpassing original clinical reports in diagnostic clarity, highlighting the framework's translational potential. By establishing traceable diagnostic pathways from anatomical findings to conclusions, this work bridges the gap between AI technical metrics and clinical utility, advancing trustworthy AI systems in medical imaging.

76. 【2606.03410】Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

链接：https://arxiv.org/abs/2606.03410

作者：Abhishek Kumar,Isha Motiyani,Tilak Kasturi,Ethan Seefried,Prahitha Movva,Tirthankar Ghosal

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unlike natural images, dense spatial layouts, Engineering diagrams pose, unlike natural, spatial layouts

备注：

点击查看摘要

Abstract:Engineering diagrams pose a distinct challenge for vision-language models: unlike natural images or general documents, they encode information through dense spatial layouts, domain-specific symbols, and cross-references between visual callouts and structured parts tables. Despite their centrality to service, repair, and design workflows, there is no public benchmark for measuring VLM capabilities in this domain; existing datasets primarily focus on flowcharts, scientific figures, or business documents. To address this gap, we introduce Enginuity, the first open dataset and benchmark for evaluating VLMs on complex engineering diagrams. We define two tasks over a corpus of U.S. military service and repair manuals: structured parts-table extraction (Task 1) and free-form visual diagram question answering (VQA)(Task 2) for benchmarking. We evaluate four frontier VLMs (GPT-5.2 Chat, Claude Opus 4.7, Gemma 4, Qwen3-VL-32B-Instruct) under zero-shot and chain-of-thought prompting. On Task 1, models reach Recall@all of 0.61-0.87 but Token F1pen of only 0.03-0.18, exposing a systematic gap between part identification and description fidelity. Task 2 reveals a consistent factual-reasoning gap across all models. A supporting analysis shows that token-overlap metrics under-report model capability on technical descriptions by 2-6x relative to semantic similarity, motivating LLM-as-judge calibration for domain-specific evaluation. We release the dataset, annotations, evaluation harness, and per-sample model outputs to support a reproducible study of VLM capability on engineering content.

77. 【2606.03406】SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

链接：https://arxiv.org/abs/2606.03406

作者：Xu Pan,Qiyuan Ma,Mingyue Dong,He Chen,Wei Ji,Xianwei Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Structure from Motion, Reliable correspondence estimation, Reliable correspondence, underpinning applications, fundamental problem

备注： 14 pages

点击查看摘要

Abstract:Reliable correspondence estimation is a fundamental problem in image processing, underpinning applications such as Structure from Motion, visual localization, and image registration. Existing learning-based methods have significantly improved local feature representations, yet most still operate at the pixel or patch level and lack explicit modeling of regions that are jointly visible across views. We propose SAMatcher, a feature matching framework that formulates correspondence estimation through co-visibility modeling. Instead of directly matching local features, SAMatcher first predicts co-visible region masks and bounding boxes as structured priors for correspondence estimation. Built upon the Segment Anything Model (SAM), it introduces a symmetric cross-view interaction mechanism that enables bidirectional feature exchange and cross-view semantic alignment. We further develop a unified supervision scheme that jointly optimizes mask prediction and box localization through mask learning, box regression, and mask-box consistency constraints. Extensive experiments on challenging benchmarks demonstrate substantial improvements over existing matching pipelines, particularly under large viewpoint and scale variations. Our results show that foundation models originally designed for monocular segmentation can be effectively extended to multi-view correspondence reasoning through explicit co-visibility modeling, offering a new perspective on structured representation learning for image matching. Code and project page: this https URL

78. 【2606.03402】Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation

链接：https://arxiv.org/abs/2606.03402

作者：Xuan Wei,Jiahui Chen,Kaiheng Li,Mingyu Shao,Qingqi Hong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：co-speech gesture generation, temporally coherent human, Audio-driven human motion, video generation aims, single static image

备注： accepted by ICME 2016

点击查看摘要

Abstract:Audio-driven human motion video generation aims to synthesize realistic and temporally coherent human animations from a single static image, with applications in talking-head synthesis, co-speech gesture generation, and dynamic presentations. Moving beyond conventional keypoint-based methods that often struggle to capture subtle motion dynamics, We propose a novel implicit-motion framework for generating realistic and temporally coherent human motion videos from a single static image and audio. Our approach uses a two-stage pipeline that decouples motion prediction from rendering. The first stage integrates appearance priors and hierarchical depth cues into a region-aware attention mechanism to model latent motion features. The second stage employs a Mamba-enhanced diffusion model to directly predict these features from audio and the source image, enabling unsupervised learning of fine-grained motion patterns. This decoupled architecture enhances flexibility and efficiency. Trained on a new 380-hour high-quality dataset, our method outperforms prior work across multiple public benchmarks and our collected data in accuracy, naturalness, and temporal coherence, setting a new state-of-the-art.

79. 【2606.03401】owards Characterizing Scientific Image Utility and Upgradability

链接：https://arxiv.org/abs/2606.03401

作者：WenZhe Li,Qihang Yan,Liang Chen,Junying Wang,Farong Wen,Yijin Guo,Chunyi Li,Zicheng Zhang,Guangtao Zhai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：integrity faces unprecedented, faces unprecedented threats, textbf, research communication, function as critical

备注：

点击查看摘要

Abstract:Scientific images function as critical evidence in research communication, yet their integrity faces unprecedented threats from AI-generated content that introduces subtle but consequential errors. Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities. To address this gap, we propose the \textbf{S}cientific \textbf{I}mage \textbf{U}tility and \textbf{U}pgradability \textbf{A}ssessment (\textbf{SIU$^2$A}) framework, which introduces two complementary dimensions for scientific image evaluation. \textbf{Utility} encompasses \textit{error detection} (identifying scientific inaccuracies) and \textit{correction feasibility} (assessing whether errors can be reliably repaired). \textbf{Upgradability} measures the quality of correction. We categorize scientific image corruption into four fundamental types: Detail Distortion, Incompleteness, False Content, and Entity Confusion. Based on this taxonomy, we construct SIU$^2$A-Benchmark, a dataset with expert annotations for error identification and repair. The framework implements a two-stage evaluation protocol: the \textit{Utility} stage evaluates error detection capability and repair instruction generation, while the \textit{Upgradability} stage assesses whether corrections faithfully restore scientific validity without compromising existing accurate information. Experiments reveal that current multimodal systems exhibit significant limitations in both scientific error assessment and faithful correction, exposing a fundamental gap between visual perception and scientific usability.

80. 【2606.03376】P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

链接：https://arxiv.org/abs/2606.03376

作者：Ruipeng Zhang,Zhihao Li,Haozhang Yuan,C. L. Philip Chen,Tong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Vision-Language Models, recently garnered significant, garnered significant research, Large Vision-Language, Direct Preference Optimization

备注：

点击查看摘要

81. 【2606.03348】SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

链接：https://arxiv.org/abs/2606.03348

作者：Junxiao Yang,Minghao Zhang,Xiaoce Wang,Haoran Liu,Shiyao Cui,Hongning Wang,Minlie Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent generative models, realistic embedded text, Recent generative, produce visual artifacts, text and layouts

备注：

点击查看摘要

Abstract:Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: synthetic credibility. We introduce SYNCRED-Bench, a benchmark of 600 AI-generated misinformation images balanced across six credible-form categories and seven fine-grained circulation styles, together with FP450, a real-image negative set for measuring false positives. Extensive evaluation shows that existing systems remain unreliable: under a 5% false-positive-rate constraint, 15 MLLMs achieve only 10.5% true positive rate (TPR), open-source AIGC detectors achieve less than 5%, and commercial APIs reach 57.6%. Human annotators also struggled to identify synthetic credibility, reaching only 63% TPR. These findings establish synthetic credibility as a severe and underexplored visual misinformation challenge, and provide a benchmark for developing detectors that reason beyond superficial credibility cues.

82. 【2606.03345】Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

链接：https://arxiv.org/abs/2606.03345

作者：Youssef Mohamed,Kenneth Ward Church,Mohamed Elhoseiny

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：problem for understanding, perceived affectively, perception experiences, Perception Topics, Perception

备注： 8 pages

点击查看摘要

83. 【2606.03341】Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network

链接：https://arxiv.org/abs/2606.03341

作者：Zhikang Li,Yan Wu,Xin Hu,Yi Dai,Ming Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Structured State Space, State Space Duality, primary challenge lies, multi-modal image registration, feature

备注：

点击查看摘要

Abstract:In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advantages, we propose a novel algorithm for multi-modal image registration, named RegNetMamba-2. Our algorithm incorporates SSD into coarse-to-fine matching process to extract local and global structural features effectively. Firstly, SSD is applied in three different scales for multi-modal feature extraction in our network. To strengthen local representation, we pay more attention on foreground edge and structural information by feature scaling function of SSD. Secondly, for shared feature extraction of input images and multi-modal feature fusion in all scales, we propose cross-modality feature fusion model based on SSD, consisting of Cross-Modality feature Interaction (CMI) module and Multi-Scale feature Fusion (MSF) module. CMI module is designed for cross-modality feature extraction of each scale by SSD in cross form. MSF module is designed to employ a progressive upward fusion in feature-level to obtain fine features, consisting of multi-modal features in all scales. Following coarse-to-fine, the features in 1/8 scale from CMI and 1/2 scale from MSF are collected to calculate matching probability scores. Then we respectively establish matching process by correspondences of pixel-wise. Extensive experiments demonstrate that comparing with state-of-the-art deep-learning based algorithms, RegNetMamba-2 has achieved good effects in both performance and efficiency for multi-modal image registration on the following datasets: VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence) and VIS-NIR (RGB-NIR sense).

84. 【2606.03338】IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension

链接：https://arxiv.org/abs/2606.03338

作者：Julie Mordacq,Vicky Kalogeiton,Steve Oudot

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Self-supervised learning, learning meaningful representations, unlabeled data, Minimum Spanning Tree, learning meaningful

备注： ICML 2026

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a powerful paradigm for learning meaningful representations from unlabeled data. However, the standard protocol for evaluating these representations, linear probing, is computationally expensive, sensitive to hyperparameters, and provides limited insight into the geometric structure of the representation space. In this work, motivated by connections between neural network generalization and intrinsic dimension (ID) we propose IdEst, a method for estimating the ID of SSL representations via the Minimum Spanning Tree dimension estimator ($\mathrm{dim}_\mathrm{MST}$). Across diverse datasets, architectures, and SSL pretraining objectives, we show that IdEst strongly correlates with downstream linear probe performances. Furthermore, we demonstrate that IdEst enables efficient hyperparameter selection, significantly reducing the computational cost compared to supervised alternatives. Our results highlight intrinsic dimensionality as a principled geometric proxy for assessing SSL representations, complementing standard supervised probing protocols.

85. 【2606.03314】ASE: Truncation-Aware Semantic Embeddings for 3D Scene Understanding and Editing

链接：https://arxiv.org/abs/2606.03314

作者：Tim-Felix Faasch,Jochen Kall,Lucas Nunes,Jens Behley,Cyrill Stachniss

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：including robotics, autonomous driving, High-fidelity semantic, crucial for numerous, High-fidelity

备注：

点击查看摘要

Abstract:High-fidelity semantic 3D scene representations are crucial for numerous applications, including robotics, autonomous driving, and simulation. Beyond this, the ability to edit such representations enables developers to adapt these applications more easily to specific target scenarios. Current approaches provide limited support for controllable editing. We introduce TASE, a method that projects pretrained 2D semantic features into a truncation-aware embedding space to enable flexible 3D scene editing. Our method explicitly optimizes a feature space in which progressively reducing feature channels yields increasingly abstract semantic representations, while retaining more channels preserves fine-grained detail. Additionally, we improve multi-view consistency of the features using a scale- and translation-equivariance loss. The resulting truncation-aware embedding space enables text-driven edits to 3D scenes, providing explicit control over how strongly edits adhere to the original scene content and allowing more substantial modifications than prior methods. Moreover, we propose a finetuning stage for the editing diffusion model to mitigate artifacts caused by geometric changes. Experimental results demonstrate competitive performance in 3D scene editing, substantially outperforming prior methods on edits involving large geometric modifications.

86. 【2606.03301】SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

链接：https://arxiv.org/abs/2606.03301

作者：Galann Pennec,Zhengyuan Liu,Nicholas Asher,Philippe Muller,Nancy F. Chen

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：long-form video benchmark, full-length TV series, long-form video, reasoning, video reasoning benchmarks

备注：

点击查看摘要

87. 【2606.03287】BA-T: An Iterative Transformer for Two-View Bundle Adjustment

链接：https://arxiv.org/abs/2606.03287

作者：Ganlin Zhang,Weirong Chen,Daniel Cremers,Xi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved strong performance, Feed-forward models, achieved strong, strong performance, Feed-forward

备注：

点击查看摘要

Abstract:Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by BA, we propose BA-T, an iterative Transformer that implements BA-style structured updates as a repeatable layer in implicit token space. Instead of relying on deep attention stacks, BA-T refines predictions based on latent residual by a single lightweight layer. Experiments demonstrate that BA-T progressively improves pose and reconstruction accuracy across iterations, achieves stronger cross-view consistency than conventional decoders, and matches or surpasses substantially larger models while using only 16% of their decoder parameters. BA-T provides a compact, efficient, and structural alternative to depth-heavy attention, enabling accurate 3D reconstruction within a lightweight architecture. The code will be made publicly at this https URL.

88. 【2606.03273】VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

链接：https://arxiv.org/abs/2606.03273

作者：Hang He,Chuhuai Yue,Chengqi Dong,Chengcheng Wan,Ting Su,Haiying Sun,Jiajun Chai,Xiaohan Wang,Guojun Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：connecting fine-grained clues, complex visual queries, inspecting image regions, requires multimodal large, repeatedly inspecting image

备注：

点击查看摘要

89. 【2606.03264】PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

链接：https://arxiv.org/abs/2606.03264

作者：Zelun Zhang,Hongen Liu,Suyin Liang,Yubo Zhang,Yiqing Xiang,Jiaxuan Liu,Ting Sun,Manhui Lin,Yue Zhang,Changda Zhou,Tingquan Gao,Cheng Cui,Yi Liu,Dianhai Yu,Yanjun Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：upgraded compact document, compact document parsing, document parsing model, parsing model built, upgraded compact

备注：

点击查看摘要

Abstract:We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, applies targeted enhancement to these regions, and improves the reliability of supervision signals. It further adopts a progressive post-training recipe based on curated data selection and reinforcement learning, pushing model performance to a higher level through staged optimization. PaddleOCR-VL-1.6 achieves a new state-of-the-art score of 96.33% on OmniDocBench v1.6, demonstrates strong competitiveness against top-tier VLMs, and provides a practical post-training recipe for the PaddleOCR-VL series.

90. 【2606.03254】FreeStreamGS: Online Feed-forward 3D Gaussian Splatting from Unposed Streaming Inputs

链接：https://arxiv.org/abs/2606.03254

作者：Ruiyang Chen,Feiran Li,Chu Zhou,Zonglin Li,Zhanyu Ma,Heng Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recorded image sequence, Gaussian Splatting, view synthesis, offline recorded image, high-fidelity novel view

备注：

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (3DGS) allows efficient and high-fidelity novel view synthesis (NVS) from an offline recorded image sequence. However, achieving online NVS from streaming and unposed image inputs remains challenging. Although online feed-forward geometric estimation methods have been proposed for streaming depth and point cloud recovery, they cannot be adapted to NVS due to severe rendering artifacts. This is because NVS demands stricter multi-view consistency in Gaussian scales and pose-geometry alignment; even minor deviations would accumulate over time and visibly degrade rendering quality. To this end, we propose FreeStreamGS, a robust online feed-forward framework for efficient and high-quality NVS. We introduce two key mechanisms: a Decoupled Intrinsic Recovery Head that removes cumulative camera intrinsic bias and prevents scene scale jitter during long-term streaming, and a Dynamic Point Refinement Offset strategy that relaxes rigid unprojection to correct coupled pose-depth drift. Extensive experiments show that FreeStreamGS achieves rendering quality competitive with state-of-the-art offline feed-forward 3DGS methods, despite operating without access to future frames.

91. 【2606.03251】Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

链接：https://arxiv.org/abs/2606.03251

作者：Gautam Gare,John Galeotti,Michael Mozer,Deva Ramanan,Nan Rosemary Ke

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)

关键词：natural experiments, events that affect, affect some individuals, individuals or groups, constitute an implicit

备注：

点击查看摘要

Abstract:In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

92. 【2606.03246】MariData: One-Step Unpaired Image Translation for Maritime Environments

链接：https://arxiv.org/abs/2606.03246

作者：Santeri Henriksson,Mehdi Asadi,Amin Majd,Juha Kalliovaara

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autonomous Surface Ships, Surface Ships, robust perception systems, Maritime Autonomous Surface, Autonomous Surface

备注：

点击查看摘要

Abstract:The development on robust perception systems for Maritime Autonomous Surface Ships (MASS) is heavily constrained by the scarcity of diverse training data, particularly for adverse weather and low-light conditions. Because collecting paired images in dynamic maritime environments is physically impossible, synthetic data generation via unpaired image-to-image translation offers a critical solution. However, existing generative models suffer from failing to preserve the fine structural details of small navigational objects due to latent compression bottlenecks. In this paper, we introduce a framework for generating synthetic maritime data using CycleGAN-turbo, a one-step unpaired translation architecture. By incorporating zero-convolution skip connections to bypass the Variational Autoencoder (VAE) bottleneck, our approach explicitly preserves small object details (e.g., distant vessels and sea marks) during translation. We compiled a dataset of 7,000 maritime images to train and evaluate models for Day-to-Foggy, Day-to-Sunset, and Day-to-Night domain translations. Qualitative evaluations and variable-strength inference studies demonstrate that our method effectively synthesizes realistic atmospheric conditions while maintaining the underlying semantic structure of the scene. The Day-to-Foggy and Day-to-Sunset models exhibit great structural retention, whereas the Day-to-Night model highlights the challenge of semantic hallucination, such as generating artificial coastal lights, induced by unbalanced training distributions. Ultimately, this work establishes an efficient, structure-aware data synthesis pipeline that directly addresses the data scarcity bottleneck in autonomous maritime navigation.

93. 【2606.03243】MemoGen: Can Past Experience Improve Future Text-to-Image Generation?

链接：https://arxiv.org/abs/2606.03243

作者：Wenshuo Chen,Kuimou Yu,Bowen Tian,Jianfei Song,Shaofeng Liang,Haozhe Jia,Kan Cheng,Haosen Li,Kaishen Yuan,Lei Wang,Jiemin Wu,Songning Lai,Yutao Yue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：prompts require implicit, require implicit visual, relational reasoning, external knowledge, models have achieved

备注：

点击查看摘要

Abstract:Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, references, or refined prompts for the current request, yet they typically treat each generation as an isolated episode and do not systematically preserve past successes or failures for future use. In this work, we ask whether a text-to-image system can continually improve from its own generation experience without updating the underlying generator. We propose MemoGen, a training-free framework that augments existing image generators with an agentic evolution layer. For each task, MemoGen explicitly infers visual requirements, retrieves external evidence and references when necessary, translates them into executable generation constraints, evaluates the generated result, and stores task understanding, reference choices, visual feedback, successful strategies, and failure lessons as reusable experience memory. Across evolution rounds, the agent retrieves relevant experience to improve similar future generations, selectively repairing previously failed cases while preserving successful ones, thereby enabling test-time self-evolution without parameter updates. Extensive experiments on knowledge-intensive and reasoning-oriented benchmarks demonstrate the effectiveness of this paradigm: after only two evolution rounds, MemoGen built upon the open-source Qwen-Image backbone surpasses strong proprietary systems such as Nano Banana Pro and GPT-Image-1 on WISE and Mind-Bench, showing that explicit experience memory can serve as a powerful continual learning signal for reliable text-to-image generation.

94. 【2606.03216】Follow-Your-Preference++: Rethinking Preference Alignment for Image Inpainting

链接：https://arxiv.org/abs/2606.03216

作者：Junkun Yuan,Yutao Shen,Toru Aonishi,Hideki Nakayama,Yue Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reward models, models, reward, preference, image inpainting

备注： 23 pages, 14 figures. arXiv admin note: substantial text overlap with [arXiv:2509.23082](https://arxiv.org/abs/2509.23082)

点击查看摘要

Abstract:We study preference alignment for image inpainting. Rather than proposing yet another method, we revisit the problem from first principles and reassess its core challenges. We adopt the widely used direct preference optimization framework and construct preference training data with publicly available reward models. Our empirical study spans nine reward models, two benchmarks, and two baseline inpainting models that differ in architecture and generative mechanism. Our main findings are: (1) Most reward models provide valid signals for preference data construction, although some are unreliable as evaluators. (2) Across models and benchmarks, preference data exhibits consistent trends under both candidate and sample scaling. (3) Reward models display pronounced biases--particularly in brightness, composition, and color scheme--that make them prone to inducing reward hacking. (4) A simple ensemble of reward models mitigates such biases and yields robust, generalizable performance. {\color{rebuttal_blue}(5) Preference alignment is transferable to the object removal task, where the goal shifts from open-ended creative generation to coherent background completion. (6) Further analysis reveals that a calibrated ensemble method further mitigates hacking and improves robustness.} Without modifying model architectures or introducing additional datasets, our models substantially outperform prior state-of-the-art models on standard metrics, large vision-language model evaluations, and human assessments. Our code is available at: this https URL.

95. 【2606.03214】Effect of Demographic Bias on Skin Lesion Classification

链接：https://arxiv.org/abs/2606.03214

作者：Ralf Raumanns,Gerard Schouten,Veronika Cheplygina,Josien P.W. Pluim

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词：skin lesion classification, ResNet-based convolutional models, skin lesion, lesion classification, classification using ResNet-based

备注： Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) , 26 pages, 12 figures

点击查看摘要

Abstract:In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

96. 【2606.03201】Reinforcement Learning from Cross-domain Videos with Video Prediction Model

链接：https://arxiv.org/abs/2606.03201

作者：Zhao Yang,Xinrui Zu,Jacob E. Kooi,Thomas Delliaux,He Liu,Shujian Yu,Kevin Sebastian Luck,Vincent François-Lavet

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Reinforcement learning, Cross-domain Video Prediction, DMC Body Suite, visually distinct domains, expert videos

备注：

点击查看摘要

Abstract:Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different domain, where the agent's appearance differs due to factors such as color, morphology, or the sim-to-real gap. More specifically, XIPER trains a cross-domain video prediction model that maps agent observations into the expert domain and uses the prediction likelihood as a reward signal. Experiments on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) show that XIPER consistently outperforms baselines despite domain gaps such as differences in agent color and morphology. We further analyze XIPER on a sim-to-real transfer dataset, demonstrating that it produces meaningful reward signals for real-robot observations given only simulated expert videos. Code, pretrained models, datasets and video demonstrations can be found on our project webpage: this https URL

97. 【2606.03183】Inference-Time Scaling for Joint Audio-Video Generation

链接：https://arxiv.org/abs/2606.03183

作者：Jaemin Jung,Kyeongha Rho,Inkyu Shin,Joon Son Chung

类目：Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：Joint audio-video generation, realistic audio-video pairs, synthesize realistic audio-video, audio-video generation aims, Joint audio-video

备注： Accepted by Transactions on Machine Learning Research (TMLR). Project page: [this https URL](https://jung-jaemin.github.io/ITS-AVGen-Proj/)

点击查看摘要

Abstract:Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: this https URL.

98. 【2606.03180】GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

链接：https://arxiv.org/abs/2606.03180

作者：Jonggwon Park,Seongeun Lee,Junhyun Park,Hannah Yun,Hyunwoong Kim,Sohyun Jeong,Hyewon Kang,Byungmu Yoon,Kyoyun Choi

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：pairs naturally produced, leveraging image-report pairs, image-report pairs naturally, Vision-language models, clinical workflows

备注：

点击查看摘要

99. 【2606.03175】Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

链接：https://arxiv.org/abs/2606.03175

作者：Xunyi Zhao,Sihao Lin,Gengze Zhou,Zerui Li,Shijie Li,Wei Tao,Jiajun Liu,Qi Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Instance Goal Navigation, specific object instance, underspecified natural-language description, Instance Goal, object instance

备注：

点击查看摘要

Abstract:Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an underspecified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived this http URL, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

100. 【2606.03168】JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation

链接：https://arxiv.org/abs/2606.03168

作者：Yinan Chen,Chuming Lin,Zhennan Chen,Yuxiang Zeng,Junwei Zhu,Yali Bi,Xijie Huang,Chengming Xu,Donghao Luo,Zhucun Xue,Xiaobin Hu,Chengjie Wang,Yong Liu,Jiangning Zhang,Shuicheng Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：joint audio-visual editing, instruction-guided joint audio-visual, joint audio-visual, editing remains constrained, audio-visual editing remains

备注： Equal contributions from first two authors. Project page: [this https URL](https://ryanchenyn.github.io/projects/JAVEdit) Code: [this https URL](https://github.com/RyanChenYN/JAVEdit) Dataset: [this https URL](https://huggingface.co/datasets/Coraxor/JAVEdit-100k)

点击查看摘要

Abstract:While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.

101. 【2606.03160】SRENet: Spectral Re-Entry Network for Point Cloud Action Recognition

链接：https://arxiv.org/abs/2606.03160

作者：Qiuxia Wu,Jiarui Lan,Wenxiong Kang,Zhiyong Wang,Kun Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recognizing human actions, perception driven applications, Recognizing human, point cloud sequences, perception driven

备注： 13 pages, 11 figures. Accepted by IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Recognizing human actions from point cloud sequences is critical for 3D perception driven applications such as autonomous driving and human-computer interaction. However, the irregular structure and temporal inconsistency of point clouds pose unique challenges for spatio-temporal representation learning, especially in capturing both global motion context and fine-grained temporal dynamics. We propose SRENet, a spectral-aware framework designed to explicitly learn both global context and fine-grained temporal dynamics of motion from a frequency perspective for action recognition. SRENet introduces a Spectral Decomposition Block (SDeBlock) that performs wavelet-based analysis along temporal and spatial axes, disentangling features into low- and high-frequency components with frequency-specific attention. To recover residual dynamics and re-align temporal frequency structures distorted during semantic fusion, a Spectral Re-entry Block (SReBlock) performs secondary temporal decomposition. Furthermore, a spectral-aware learning strategy is devised to enhance discriminability in both frequency subspaces via contrastive loss and a curriculum schedule that gradually shifts focus from low- to high-frequency spaces in line with coarse to detailed motion patterns. Extensive experiments on MSR-Action3D, NTU-RGBD and NTU-RGBD120 demonstrate that SRENet achieves state-of-the-art performance, validating the effectiveness of frequency modeling in point cloud-based action understanding.

102. 【2606.03159】NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

链接：https://arxiv.org/abs/2606.03159

作者：NVIDIA:Aarti Basant,Amlan Kar,Despoina Paschalidou,Fangyin Wei,Francesco Ferroni,Guillermo Garcia Cobo,Haithem Turki,Huan Ling,Jaewoo Seo,James Lucas,Jay Zhangjie Wu,Jialiang Wang,Jonathan Lorraine,Jun Gao,Kai He,Katarina Tothova,Kevin Xie,Michał Tyszkiewicz,Qi Wu,Riccardo de Lutio,Ruilong Li,Sanja Fidler,Seung Wook Kim,Tianchang Shen,Tianshi Cao,Tobias Pfaff,William Lew,Xindi Wu,Xuanchi Ren,Yifan Lu,Yuxuan Zhang,Zan Gojcic,Zian Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：vehicle capabilities advance, long-tail scenarios remains, capabilities advance, critical bottleneck, autonomous vehicle capabilities

备注：

点击查看摘要

Abstract:As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.

103. 【2606.03148】$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

链接：https://arxiv.org/abs/2606.03148

作者：Sreehari Rammohan,Huy Ha,Carl Vondrick

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Robust visual classification, ignoring contextual distractors, Robust visual, main foreground objects, contextual distractors

备注：

点击查看摘要

Abstract:Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

104. 【2606.03142】Disentangling Visual and Factual Correctness in LVLMs' Visualization Literacy

链接：https://arxiv.org/abs/2606.03142

作者：Soohyun Lee,Jaeyoung Kim,Seokhyeon Park,Sihyeon Lee,Jiwon Song,Bohyoung Kim,Hyunjoo Song,Jinwook Seo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, Large Vision-Language, show strong visualization, responses reflect genuine, strong visualization interpretation

备注： Under review at IEEE Transactions on Visualization and Computer Graphics (TVCG). 23 pages, 9 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in existing visualization literacy assessments. Across three experiments with 15 state-of-the-art LVLMs: (1) several models reach human-level performance on standard tests (VLAT), but this may reflect factual recall rather than visual understanding, while randomized-data tests (reVLAT) underestimate literacy when correct visual interpretation is superseded by factual priors. (2) Using our Counterfactual Visualization Literacy Assessment Test (CVLAT) with capability-normalized arbitration metrics, we classify models by the sign of their visual-factual reliance index (VFRI), revealing a visualization-oriented majority and a factual knowledge-oriented minority, though several near-zero cases warrant caution. A human baseline (N=30) on the same counterfactual items confirms that people overwhelmingly follow the chart under conflict, providing a human reference point. (3) Prompt-based intervention can shift prioritization, but its effectiveness is highly model-dependent and direction-asymmetric, and high chart-reading capability does not predict prompt-controllability. Overall, high visualization accuracy is not sufficient evidence of faithful visual reasoning: reliable integration into visual analytics requires evaluating not only visualization literacy but also how models arbitrate between visual evidence and factual priors when the two diverge. Benchmark and code: this https URL

105. 【2606.03120】KC-3DGS: Kurtosis-Constrained Gaussian Splatting for High-Fidelity View Synthesis

链接：https://arxiv.org/abs/2606.03120

作者：Vivekjyoti Banerjee,Abhay Yadav,Rama Chellappa,Aniket Roy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：anisotropic Gaussians optimized, Gaussian Splatting, anisotropic Gaussians, Gaussians optimized, enables real-time

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables real-time novel view synthesis by representing scenes as collections of anisotropic Gaussians optimized via differentiable rasterization. However, standard pixel-space losses (L1, SSIM) constrain only aggregate reconstruction error, permitting the optimization to redistribute error across frequency scales. This leads to oversmoothing and structural artifacts, particularly in sparse-view settings where supervision is limited. We propose KC-3DGS, which augments 3DGS training with wavelet-domain supervision based on natural image statistics. Our method combines three components: (1) a multi-scale wavelet coefficient alignment loss that explicitly penalizes missing high-frequency detail, (2) a supervised kurtosis concentration loss that encourages rendered images to match the heavy-tailed frequency statistics of ground-truth images, and (3) a cross-band covariance penalty that promotes frequency specialization. We provide theoretical analysis showing that pixel-space losses admit a family of indistinguishable perturbations under wavelet redistribution, and that our joint objective excludes degenerate solutions. Experiments across MipNeRF360, TanksTemples, MVImgNet, DeepBlending, and WRIVA-ULTRRA demonstrate consistent improvements in perceptual quality. On the challenging WRIVA-ULTRRA outdoor dataset, KC-3DGS achieves a 9.48% improvement in DreamSim while also improving PSNR, SSIM, and LPIPS. In sparse-view settings with only 12 training images, our method improves PSNR by up to 0.5 dB on MipNeRF360 while maintaining perceptual quality. The approach integrates seamlessly into existing 3DGS pipelines as a plug-and-play regularization strategy.

106. 【2606.03119】GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

链接：https://arxiv.org/abs/2606.03119

作者：Zehua Chen,Yucheng Yang,Binjie Yuan,Kaiwen Zheng,Jun S. Liu,Jun Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：prior, generation in diffusion, Guidance, prior exploitation, classifier-free guidance

备注： ICML 2026

点击查看摘要

Abstract:Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models. Recently, bridge models have introduced a data-to-data generative process that can exploit an instructive clean prior. In this work, inspired by previous methods creating quality difference between denoising results as guidance, we propose a training-free bridge guidance method, termed Prior Guidance (PG). Specifically, we introduce a weak prior, which is unseen during bridge pre-training, hindering prior exploitation and thereby degrading denoising result. Then, we contrast it with the seen prior to highlight and enhance prior exploitation via a scaling factor. Moreover, we analyze the underlying mechanism of prior exploitation in the bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. To address prior exploitation in image in-painting, we develop a cascaded framework, CFG-FMPG, which first generates a noisy hidden representation via CFG and then exploits it as a generative prior with FMPG, fulfilling their complementary strengths without compromising inference efficiency. Experiments demonstrate that our PG methods consistently improve pre-trained bridge models across diverse image translation tasks.

107. 【2606.03118】Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

链接：https://arxiv.org/abs/2606.03118

作者：Jacob Lavoie,Marwan Besrour,William Lemaire,Jean Rouat,Réjean Fontaine,Eric Plourde

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词：age-related macular degeneration, retinal ganglion cells, photoreceptor layer, age-related macular, macular degeneration

备注： 18 pages, 6 figures. Published version: Biomed. Phys. Eng. Express 10, 025006 (2024)

点击查看摘要

Abstract:Objective: Diseases such as age-related macular degeneration and retinitis pigmentosa cause the degradation of the photoreceptor layer. One approach to restore vision is to electrically stimulate the surviving retinal ganglion cells with a microelectrode array such as epiretinal implants. Epiretinal implants are known to generate visible anisotropic shapes elongated along the axon fascicles of neighboring retinal ganglion cells. Recent work has demonstrated that to obtain isotropic pixel-like shapes, it is possible to map axon fascicles and avoid stimulating them by inactivating electrodes or lowering stimulation current levels. Avoiding axon fascicle stimulation aims to remove brushstroke-like shapes in favor of a more reduced set of pixel-like shapes. Approach: In this study, we propose the use of isotropic and anisotropic shapes to render intelligible images on the retina of a virtual patient in a reinforcement learning environment named rlretina. The environment formalizes the task as using brushstrokes in a stroke-based rendering task. Main Results: We train a deep reinforcement learning agent that learns to assemble isotropic and anisotropic shapes to form an image. We investigate which error-based or perception-based metrics is adequate to reward the agent. The agent is trained in a model-based data generation fashion using the psychophysically validated axon map model to render images as perceived by different virtual patients. We show that the agent can generate more intelligible images compared to the naive method in different virtual patients. Significance: This work shares a new way to address epiretinal stimulation that constitutes a first step towards improving visual acuity in artificially-restored vision using anisotropic phosphenes.

108. 【2606.03114】FAF-CD: Frequency-Aware Fusion for Change Detection under Imperfect Multimodal Remote Sensing

链接：https://arxiv.org/abs/2606.03114

作者：Yufan Wang,Sokratis Makrogiannis,Chandra Kambhamettu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Remote sensing change, sensing change detection, imperfect heterogeneous observations, Remote sensing, affected by illumination

备注： Code will be released at [this https URL](https://github.com/VimsLab/FAF-CD)

点击查看摘要

Abstract:Remote sensing change detection for real-world monitoring often relies on imperfect heterogeneous observations, where pre- and post-event images may be asynchronous, cross-sensor, or affected by illumination, seasonal, and modality shifts. This setting is especially challenging for EO-SAR disaster mapping, where nuisance variation can resemble structural damage. We propose FAF-CD, a frequency-aware hybrid framework with a DINOv3-pretrained ConvNeXt encoder and a linear-complexity VMamba-based decoder. Its rectification-aware tri-branch fusion module combines deformable spatial alignment with Fourier and Haar-wavelet comparisons, using adaptive gating to aggregate complementary cues across scales. On BRIGHT validation, a matched heterogeneous EO-SAR adaptation improves clean and perturbed tc-mIoU/tc-mAP over NeXt2Former-CD. FAF-CD also generalizes to binary optical CD, achieving 0.924 cF1 on LEVIR-CD and 0.955 cF1 on WHU-CD, and obtains the best average perturbed cIoU/cF1 on both binary datasets among M-CD and NeXt2Former-CD under pseudo-change-aligned stress tests. It further reduces cost by approximately 24 GFLOPs relative to NeXt2Former-CD while maintaining or improving accuracy.

109. 【2606.03111】Inverting the Generation Process of Denoising Diffusion Implicit Models: Empirical Evaluation and a Novel Method

链接：https://arxiv.org/abs/2606.03111

作者：Yan Zeng,Masanori Suganuma,Takayuki Okatani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：inverting the DDIM, DDIM image generation, initial noise map, DDIM image, recover latent variables

备注：

点击查看摘要

Abstract:This paper studies the problem of inverting the DDIM image generation process to recover latent variables, particularly the initial noise map, from a generated image. Existing methods often struggle with accuracy in this task. We propose a novel hybrid approach that combines direct inversion via gradient descent for the first step, followed by a fixed-point method for subsequent steps. Empirical evaluations across three datasets demonstrate that our method significantly improves the prediction of initial latent variables while achieving superior reconstruction accuracy. Additionally, we introduce a new evaluation, called the self-interpolation test, which assesses the quality of images generated from interpolated points between the true and predicted latent maps, offering deeper insights into performance. Our results reveal that while existing methods perform reasonably well in reconstruction, they consistently fail to accurately predict the initial latent variables, resulting in poor performance on the self-interpolation test. In contrast, our method outperforms all others across all metrics, providing valuable insights into diffusion models and enhancing their applications in image generation and editing.

110. 【2606.03100】Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

链接：https://arxiv.org/abs/2606.03100

作者：Dongsheng Wang,Dawei Su,Hui Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：spatial reasoning capabilities, gained increasing research, increasing research interest, research interest due, promising spatial reasoning

备注： 19 pages, 6 figures,

点击查看摘要

Abstract:Recently, zero-shot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \texttt{KeyVT}, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views. Furthermore, we address redundancy among patches across selected views by identifying representative tokens under the optimal transport (OT) framework, where view tokens and key tokens are formulated as two discrete distributions in the embedding space. These key tokens are expected to cover all view features by minimizing the OT distance. We evaluate our framework on three widely used benchmarks, demonstrating significant improvements over existing tuning-free methods and performance comparable to training-based approaches.

111. 【2606.03084】Hierarchical Federated Learning with Dynamic Clustering and Adaptive Regularization for Robust Infrastructure Inspection

链接：https://arxiv.org/abs/2606.03084

作者：Yuhu Feng,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：data-driven computer vision, data silo dilemma, silo dilemma due, structural health monitoring, computer vision models

备注：

点击查看摘要

Abstract:The deployment of data-driven computer vision models for structural health monitoring (SHM) is heavily constrained by the data silo dilemma due to stringent privacy and security regulations. While federated learning (FL) offers a privacy-preserving collaborative alternative, its application to nationwide infrastructure networks is severely hindered by the challenge of ``double heterogeneity'': macro-level physical divergence across disparate structural types and micro-level statistical imbalances within local datasets. To overcome this challenge, this paper proposes a novel hierarchical federated learning framework. The framework orchestrates a synergistic two-tier optimization strategy. At the macro-level, a dynamic gradient-based clustering mechanism autonomously aggregates distributed clients into specialized expert groups based on their structural degradation trajectories, circumventing the need for prior geographical metadata. Concurrently, at the micro-level, an intra-cluster Dynamic Region-Adaptive Proximal Regularization (DRAPR) module computes a real-time statistical Non-IID Intensity Score for each client. By adaptively modulating a proximal penalty based on local label skewness and gradient divergence, DRAPR effectively calibrates local updates, mitigates client drift, and prevents the catastrophic forgetting of minority damage classes. Comprehensive evaluations on a large-scale, real-world structural inspection dataset demonstrate that the hierarchical integration of macro-clustering and micro-regularization successfully neutralizes dual-level heterogeneity, yielding highly robust and specialized diagnostic models for complex infrastructure inspection.

112. 【2606.03075】GV-KV: Text-Grounded KV Eviction for Vision-Language Models

链接：https://arxiv.org/abs/2606.03075

作者：Jizhihui Liu,Ruizi Han,Miao Zhang,Rui Shao,Xuebo Liu,Weili Guan,Yaowei Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：auto-regressive generation paradigm, inherit the auto-regressive, context length, auto-regressive generation, generation paradigm

备注： Accepted by ICML-2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality. Although KV cache eviction approaches can effectively reduce inference memory, they often incur significant performance degradation in VLMs, as most are designed for language models and overlook the inherent gap between text and vision. By systematically analyzing the modality gap in VLMs in this work, we argue that the importance of visual information should be grounded in textual guidance and accordingly propose a Text-Grounded KV Eviction method for VLMs (TGV-KV). TGV-KV comprises three submodules: (1) Text-Vision Budgeting (TVB) assigns budget to each layer based on the mutual information interaction. (2) Text-Weighted Ranking (TWR) assesses the priority of text and ranks vision importance based on weighted text-image attention. (3) Text-Prioritised Retention (TPR) policy strategically preserves text KV to avoid acute information loss. We evaluate TGV-KV across five models with different sizes and architectures, showing that TGV-KV preserves 99.2% full-KV accuracy on the VizWiz-VQA task with LLaVA-NeXT and boosts end-to-end throughput by 52.6% with an extreme retention budget of 5%. Code is available at this https URL.

113. 【2606.03069】ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements

链接：https://arxiv.org/abs/2606.03069

作者：Aqsa Naseer,Maryam Bibi,Syeda Samiya Urooj,Muhammad Khurram Shahzad

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：medical images prevents, Shape Regularization Extractor, Whitening Transform-based Probabilistic, Transform-based Probabilistic Shape, Probabilistic Shape Regularization

备注： 8 pages, 6 figures; code available at [this https URL](https://github.com/213269/WT-PSE-code-main)

点击查看摘要

Abstract:Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform-based Probabilistic Shape Regularization Extractor (WT-PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance-based knowledge distillation to achieve robust cross-domain segmentation. This study systematically examines improvements to the WT-PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per-pixel binary cross-entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain-adaptive augmentation including random erasing, gamma correction, and salt-and-pepper noise; (2) a hybrid BCE and Dice loss function for improved edge-aware segmentation under noisy conditions; (3) a curriculum-based Dice weight scheduling strategy; and (4) command-line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic-disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch-5 Dice score of 0.939. These results indicate that training-level improvements can provide consistent performance gains without modifying the underlying WT-PSE architecture.

114. 【2606.03050】FCUS-rPPG: A Fast-Converging Unsupervised Framework for Remote Photoplethysmography via Gradient Oscillation Suppression

链接：https://arxiv.org/abs/2606.03050

作者：Jiajie Li,Yu Liu,Rencheng Song,Xun Chen,Juan Cheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enables non-contact extraction, blood volume pulse, Remote photoplethysmography, enables non-contact, volume pulse

备注：

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables non-contact extraction of blood volume pulse (BVP) signals using consumer-grade cameras. Recent unsupervised rPPG methods learn BVP representations without requiring ground-truth physiological annotations, yet their optimization is often hindered by noisy and unstable gradients, resulting in slow convergence and limited cross-domain generalization. In this paper, we propose FCUS-rPPG, a fast-converging unsupervised rPPG framework with strong generalization capability. Motivated by the observation that BVP representations exhibit both multi-spectral covariation and low-dimensional manifold structure, we design a spectrally shared backbone that facilitates BVP feature disentanglement while improving optimization efficiency. To jointly enhance convergence stability and generalization performance, we further develop a unified optimization framework operating at the gradient, loss-landscape, and feature-representation levels. Specifically, a post-verification masking mechanism filters out misleading gradients according to the weak-amplitude physiological prior of BVP signals; a perturbation-based loss landscape smoothing strategy steers optimization toward more generalizable flat minima; and a noise-aware null-space regularization constrains feature updates to the orthogonal complement of the noise subspace, thereby mitigating noise-induced representation drift. Extensive experiments on five datasets demonstrate that FCUS-rPPG requires only one training epoch, whereas existing methods typically require tens to hundreds of epochs. Notably, FCUS-rPPG consistently achieves state-of-the-art (SOTA) performance in cross-dataset evaluations. This study provides an efficient and robust solution to the real-world deployment of unsupervised rPPG. The source code will be publicly available at this https URL.

115. 【2606.03005】MUSE: A Unified Agentic Harness for MLLMs

链接：https://arxiv.org/abs/2606.03005

作者：Jianglin Lu,Hailing Wang,Xu Ma,Qihua Dong,Mingyuan Zhang,Yizhou Wang,Yun Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：humans solve effortlessly, correct puzzle piece, large language models, multimodal large language, rapid progress

备注：

点击查看摘要

Abstract:Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

116. 【2606.02996】MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry

链接：https://arxiv.org/abs/2606.02996

作者：Yiquan Li,Taeyoung Yeon,Chenfeng Gao,Vasco Xu,Xuanyou Liu,Karan Ahuja

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：Inertial Measurement Units, Measurement Units, Inertial Measurement, augmented reality, wearable devices

备注： CVPR 2026 Findings

点击查看摘要

Abstract:Inertial odometry (IO) using only Inertial Measurement Units (IMUs) provides a lightweight solution for human motion tracking in augmented reality (AR) and wearable devices. Recent learning-based IO methods have improved the generalizability of inertial localization through large-scale pretraining on human motion datasets. However, these approaches remain prone to drift and noise because they do not explicitly capture human motion dynamics, especially on daily activity datasets such as Nymeria. In this work, we propose to ground inertial odometry in human kinematics through a learned IMU-inferred pose prior, which promotes physically consistent motion constraints. We integrate this pose prior into existing IO architectures and reduce positional drift by up to 36% on the challenging Nymeria dataset, which is 5x larger than datasets used in prior work. We further improve long-term performance with a sensor-fusion framework that incorporates auxiliary signals from lightweight sensors already available on commercial AR glasses, including magnetometers, barometers, and secondary IMUs. With this fusion strategy, positional drift is reduced by up to 42%, improving robustness and generalization across diverse motion conditions. Together, our results introduce a new paradigm for inertial and lightweight odometry by unifying human motion kinematics with multimodal sensing, setting a new benchmark for accurate and robust camera-less human tracking. Our website is available at this https URL.

117. 【2606.02979】owards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

链接：https://arxiv.org/abs/2606.02979

作者：Oskar Natan,Jun Miura

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：compact deep multi-task, autonomous driving perception, deep multi-task learning, driving perception tasks, forward pass

备注： This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. [this https URL](https://ieeexplore.ieee.org/document/9712213)

点击查看摘要

Abstract:We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at this https URL.

118. 【2606.02962】Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

链接：https://arxiv.org/abs/2606.02962

作者：Enmin Zhong,Carlos R. del-Blanco,Fernando Jaureguizar,Narciso García

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)

关键词：Egocentric Natural Language, Natural Language Query, Egocentric Natural, Natural Language, long first-person video

备注： Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026

点击查看摘要

Abstract:Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate this http URL propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.

119. 【2606.02956】he Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

链接：https://arxiv.org/abs/2606.02956

作者：Richard Schwarzkopf,Fabian Immel,Alexander Blumberg,Jonas Merkert,Nils Rack,Kaiwen Wang,Fabian Konstantinidis,Julian Truetsch,Carlos Fernandez,Annika Bätz,Kevin Rösch,Marlon Steiner,Willi Poh,Yinzhe Shen,Royden Wagner,Felix Hauser,Dominik Strutz,Jaime Villa,Gleb Stepanov,Holger Caesar,Ömer Şahin Taş,Frank Bieder,Jan-Hendrik Pauls,Christoph Stiller

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：enabled major progress, major progress, enabled major, fall short, present KITScenes Multimodal

备注： 28 pages, 21 figures

点击查看摘要

Abstract:Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset built around high-fidelity sensors and maps. Our fully synchronized sensor suite combines high-resolution global-shutter cameras, long-range lidar beyond 400m, 4D imaging radar, and redundant GNSS/INS localization. Our HD maps are, to our knowledge, the most complete of any sensor dataset, validated through autonomous driving trials on open-source software. For the first time in a public dataset, all driving-relevant traffic elements, such as traffic lights, are mapped in 3D to a reprojection-accurate level with full topological connectivity. Recorded in cities with irregular street layouts and mixed traffic modes, our dataset complements existing datasets by broadening the available geographic diversity. We also introduce four benchmarks, each advancing spatial learning for embodied AI: online HD map construction, long-range depth estimation, novel view synthesis, and end-to-end driving. Project page: this https URL

120. 【2606.02951】SCOPE: Real-Time Natural Language Camera Agent at the Edge

链接：https://arxiv.org/abs/2606.02951

作者：Nikolaj Hindsbo,Sina Ehsani,Pragyana Mishra

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：real-world task demands, reflect real-world task, Deploying language-driven agents, robotics requires evaluations, Blender-based simulation environment

点击查看摘要

Comments:
9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16–19, 2026. Code: this https URL

Subjects:

Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

ACMclasses:
I.2.9; I.2.10; I.2.7; I.2.11

Cite as:
arXiv:2606.02951 [cs.RO]

(or
arXiv:2606.02951v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.02951

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026

Related DOI:

https://doi.org/10.1145/3757279.3785641

Focus to learn more

            DOI(s) linking to related resources</p>

121. 【2606.02947】BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

链接：https://arxiv.org/abs/2606.02947

作者：Ivan Sabolić,Marin Oršić,Josip Šarić,Sven Lončarić

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：adapting autoregressive vision-language, autoregressive vision-language models, Supervised fine-tuning, downstream tasks, predominant approach

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image-text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.

122. 【2606.02935】CAD-to-CT Registration of Cylindrical Objects via Ellipse-Based Axis Estimation

链接：https://arxiv.org/abs/2606.02935

作者：Aleksander Ogonowski,Mikołaj Mrozowski,Daniel Więcek,Arkadiusz Ćwiek,Konrad Klimaszewski,Rafał Możdżonek,Adam Padee,Lech Raczyński,Piotr Wasiuk,Wojciech Wiślicki,Michał Matusiak,Sławomir Wronka

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)

关键词：Accurate registration, essential for establishing, establishing ground truth, Accurate, CAD

备注：

点击查看摘要

Abstract:Accurate registration of CAD models to CT scans is essential for establishing ground truth geometry in volumetric imaging. Obtaining reliable object masks is of growing importance in machine learning settings; as recent architectures grow more capable, huge datasets are required to fully utilise their capabilities. Traditional intensity-based methods fail when CT grayscale values lack calibration references, while point-based algorithms (e.g., ICP, RANSAC) require feature correspondence unavailable between idealized CAD geometry and noisy volumetric CT data. We propose a two-stage geometric registration method for cylindrical objects (ionization chambers) that takes advantage of the distinctive geometric features of the objects. First, we estimate the 3D rotation axis by detecting elliptical cross-sections across CT slices, fitting ellipses to edge-detected contours, and performing PCA on the fitted ellipse centers after RANSAC outlier removal. Second, we voxelize the CAD model, orient it along the detected axis, and maximize volumetric overlap with the CT scan through translational adjustment. This approach achieves robust registration with tilt and orientation errors below $0.1^\circ$ without intensity calibration or feature matching. Once registered, the aligned CAD model provides ground truth geometry for applications including machine learning-based object localization and automated analysis in industrial CT workflows.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)

Cite as:
arXiv:2606.02935 [cs.CV]

(or
arXiv:2606.02935v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.02935

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

123. 【2606.02927】SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks

链接：https://arxiv.org/abs/2606.02927

作者：Mourad Zaied(University of Gabes, Tuisia)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：BatchNorm and LayerNorm, LayerNorm have long, long been considered, considered essential, essential for stable

备注： 34 pages

点击查看摘要

Abstract:Normalization layers such as BatchNorm and LayerNorm have long been considered essential for stable training in deep networks. This work demonstrates that they can be fully replaced by a single learnable activation mechanism. We identify a plasticity suppression effect induced by standard normalization: learnable activation parameters rapidly lose adaptability when paired with normalization layers. Motivated by this observation, we introduce SALU (Saturated Adaptive Linear Unit), \[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a0,\; b0 \] a bounded, learnable activation that provides intrinsic signal stabilization without relying on batch statistics or external affine parameters. Building on SALU, we propose SaluNet, a paradigm grounded in total plasticity: SALU replaces normalization layers, while SWALU and GALU replace standard activations. With ResNet-18, SaluNet-C-18 achieves 97.35\% on CIFAR-10 and 83.25\% on CIFAR-100 without normalization, maintaining 93.44\% and 76.23\% at batch size 1 where normalized architectures fail. For transformers, SaluNet-T improves over LayerNorm-GELU from 90.92\% to 91.01\% on CIFAR-10 and from 66.54\% to 68.10\% on CIFAR-100. SaluNet-C-50 reaches 78.67\% Top-1 on ImageNet-1K at $224\times224$, and $79.23\%$ at $288\times288$. These results suggest normalization layers suppress total plasticity, a property biological neurons inherently possess, enabling deep networks to learn effectively.

124. 【2606.02924】ATLAS: A Large-Scale Evaluation Benchmark for Adversarial LiDAR Perception

链接：https://arxiv.org/abs/2606.02924

作者：Mellon M. Zhang,Siddhant Panse,Zimo Fan,Akshal Dhal,Rishit Sarkar,Glen Chou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world deployment requires, Autonomous driving perception, clean benchmark data, Autonomous driving, deployment requires robustness

备注： preprint

点击查看摘要

Abstract:Autonomous driving perception is typically evaluated on clean benchmark data, yet real-world deployment requires robustness to rare, structured, and potentially adversarial sensor anomalies. This gap is especially critical for LiDAR, where external actors can physically manipulate the sensing process to induce black-box perception failures without accessing the model. Existing LiDAR benchmarks provide little visibility into this failure mode. Prior adversarial LiDAR studies have largely centered on attack hardware, geometric and algorithmic defenses, and early-generation detectors, leaving the robustness of modern perception systems unexplored. To address this evaluation gap, we introduce ATLAS (Adversarial Temporal LiDAR Attack Suite), the first large-scale, physically grounded evaluation benchmark for LiDAR perception models under black-box sensor attacks, simulating the two primary attack modes -- point injection and point removal -- across real driving sequences. Evaluating a broad cross-section of current state-of-the-art LiDAR perception models, ATLAS reveals a surprising robustness asymmetry: models with stronger performance on standard benchmarks tend to better withstand removal attacks, yet are actually more vulnerable to injection attacks than weaker models. We trace this vulnerability to standard object database sampling augmentations, revealing how current training practices can induce architecture-agnostic robustness failures, and study initial directions for mitigating both attack modes. We release the ATLAS generation code to support extensible, reproducible evaluations as attack capabilities evolve, helping make black-box sensor robustness an explicit consideration in future LiDAR perception development.

125. 【2606.02919】Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction

链接：https://arxiv.org/abs/2606.02919

作者：Yufan Zhang,Yu Ji,Ayo Ajiboye,Rundi Wu,Yu Guo,Changxi Zheng,Jinwei Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic portrait videos, dynamic portrait video, dynamic portrait, relighting dynamic portrait, present a diffusion-based

备注： ACM SIGGRAPH 2026 Journal Track / ACM Transactions on Graphics, 17 pages. Project page: [this https URL](https://yufanzhang82.github.io/PixelCube/)

点击查看摘要

Abstract:We present a diffusion-based method for relighting dynamic portrait videos with photorealism and temporal consistency. Our method is fueled by a hybrid training dataset that consists of real-captured and rendered dynamic portrait videos with diverse subject appearances, facial motions, head poses, and known lighting conditions. Specifically, we construct an LED-based lighting system for realistic lighting emulation and high-speed video relighting data acquisition. By leveraging the image priors embedded in pre-trained video diffusion models, and using per-frame high dynamic range (HDR) environment map as lighting control, we train a high-performance generative model for realistic and identity-preserving dynamic portrait video relighting. In addition to the environment map control, our model uses a synthesized background image to enable control on the camera's exposure level and color tone. Our model can produce temporally consistent relit portrait video that looks realistic and harmonious under a provided new environment and faithfully preserve the subject's expression and fine facial features, including skin tone, wrinkles, and facial hair. Our model generalizes well to unseen data, in terms of the subject appearance, motion, and lighting condition. We perform extensive experiments on relighting in-the-wild videos with various environment maps and demonstrate practical applications on portrait photography. Results show that our method achieves state-of-the-art performance in photorealism, lighting harmony, and temporal consistency.

126. 【2606.02915】Any2Poster: Any-Source Poster Generation Across Modalities and Domains

链接：https://arxiv.org/abs/2606.02915

作者：Amogh Vinaykumar,Aiden Li,Suozhi Huang,Shilong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generation remains difficult, surface-level visual similarity, communicating dense information, automatic poster generation, poster generation remains

备注： Project Page: [this https URL](https://github.com/Any2Poster/Any2Poster)

点击查看摘要

Abstract:Visual posters are a compact medium for communicating dense information, yet progress on automatic poster generation remains difficult to measure because existing evaluations are often restricted to paper-only inputs, narrow domains, or surface-level visual similarity. We introduce Any2Poster Bench, a benchmark for any-source poster generation that evaluates systems across eight input modalities--PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks, and videos--and five content domains. Any2Poster Bench pairs each source with quiz-based probes of verbatim factual retention and interpretive understanding, together with VLM-based judgments of visual quality, layout, readability, content completeness, and logical flow, enabling reproducible assessment of both information fidelity and visual communication. To instantiate and validate this benchmark, we further present Any2Poster Agent, an end-to-end reference agent that parses heterogeneous sources, organizes salient content, plans poster layouts, renders posters, and iteratively refines them using visual feedback. On Any2Poster Bench, Any2Poster Agent achieves 87.25% average accuracy across input modalities and 87.28% across content domains. On PaperQuiz-style evaluation, where prior paper-to-poster agents are directly comparable, Any2Poster Agent improves over PosterAgent-4o from 51.06-51.33% to 72.58% overall accuracy and from 116-121 to 145.16 in density-augmented score. Together, Any2Poster Bench and Any2Poster Agent provide a reusable evaluation resource and a competitive baseline for studying multimodal, domain-general poster generation.

127. 【2606.02894】ny Collaborative Inference for Occlusion-Robust Object Detection

链接：https://arxiv.org/abs/2606.02894

作者：Chieh-Tung Cheng,Mustafa Aslanov,Eiman Kanjo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision locally, IoT surveillance nodes, run computer vision, platforms are increasingly, vision locally

备注：

点击查看摘要

Abstract:Small edge devices such as IoT surveillance nodes and search-and-rescue (SAR) platforms are increasingly expected to run computer vision locally. On ultra-low-end hardware, however, object detection is limited by available memory and compute, by communication costs when several devices cooperate, and by the loss of accuracy caused by occlusion. The work evaluates occlusion-robust object detection on devices with less than 1 MB SRAM by combining an MCUNet backbone, a YOLOv2 detection head, and TensorFlow Lite quantisation. We evaluate two collaborative inference strategies: feature-level fusion, which concatenates intermediate feature maps, and decision-level fusion via Weighted Boxes Fusion (WBF). Under the tested occlusion settings, WBF outperforms feature-level fusion and gives gains of up to +0.2736 mAP in asymmetric occlusion scenarios. Extending fusion to three views improves accuracy further (up to +0.3827 mAP) while adding communication overhead (approximately 1.3 KB per exchange). The hardware experiments start with a host-assisted USB-relay baseline and then move to a Wi-Fi peer-to-peer deployment on two Coral Dev Board Micro units, where WBF runs on-device and communication energy remains small relative to inference. In a representative 301.9 s autonomous session comprising 108 frames, fused output is observed on 61 frames compared with 47 for Board 2 alone, a frame-level coverage gain of +29.8%. We also include a small exploratory decentralised federated learning (DFL) feasibility note, but do not treat it as a main result because performance remains limited under non-iid local data. The results support decision-level fusion as a viable option for improving occlusion robustness in small-scale edge object detection, including host-free multi-board operation on ultra-low-end hardware.

128. 【2606.02877】Pathway-Structured Privileged Distillation for Deployable Computational Pathology

链接：https://arxiv.org/abs/2606.02877

作者：Yongxin Guo,Hao Lu,Onur Koyun,Zhengjie Zhu,Muhammet Demir,Metin Gurcan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cancer risk modelling, improve cancer risk, Integrating transcriptomics, risk modelling, routine settings

备注：

点击查看摘要

Abstract:Integrating transcriptomics and histopathology can improve cancer risk modelling, yet practical use is constrained by the limited availability of RNA profiling in routine settings. Here we introduce Mixture of Pathway Experts (MoPE), a knowledge-distillation framework that reframes multimodal learning as privileged distillation for histology-only inference. MoPE is motivated by the partial observability between RNA profiles and whole-slide images: histology can capture morphology-linked consequences of certain molecular programmes, but cannot be expected to reconstruct the full transcriptomic state. MoPE encodes RNA-derived pathways and transfers the molecular supervision to pathway-indexed pathology experts through memory-usage alignment. Across diverse public benchmarks and two independent breast cancer cohorts, MoPE consistently improved WSI-only inference performance relative to baseline methods. Pathway-usage analyses and human-audited visual inspection provide bounded inspection of model behaviour and candidate morphology-linked readouts. These results support pathway-structured privileged distillation as a promising route to using molecular information during training while preserving RNA-free inference.

129. 【2606.02831】Principled Reflection Separation via Nonlinear Superposition and Feature Interaction

链接：https://arxiv.org/abs/2606.02831

作者：Qiming Hu,Mingjia Li,Yuntong Li,Xiaojie Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Single-image reflection separation, Single-image reflection, fundamentally challenged, image formation processes, complex image formation

备注： 23 pages

点击查看摘要

Abstract:Single-image reflection separation is fundamentally challenged by the entanglement of transmission and reflection layers under complex image formation processes. Existing approaches largely rely on simplified assumptions or independent modeling, limiting their ability to handle real-world scenarios. In this work, we revisit the problem from a unified perspective and identify a key issue of existing approaches, i.e., the widely adopted linear composition model in the sRGB domain fails to capture the nonlinear coupling introduced by real-world image signal processing pipelines. To address this, we introduce a learnable nonlinear superposition model that more faithfully characterizes layer interactions and improves decomposition fidelity. Building upon this formulation, we propose a generalized dual-stream interactive framework that explicitly models bidirectional dependencies between transmission and reflection through feature exchange. This framework unifies activation-, gating-, and attention-based interaction mechanisms, and is compatible with both CNN and Transformer backbones. Extensive experiments on diverse real-world benchmarks demonstrate that the proposed approach achieves superior performance with strong generalization capability. More importantly, our study reveals that reflection separation is not about undoing a linear mixture, but about learning nonlinear formation and interaction}, offering new insights into the design of principled image decomposition models. Code and models are publicly available at this https URL.

130. 【2606.02809】Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

链接：https://arxiv.org/abs/2606.02809

作者：Bo Liu,Hanxue Gu,Xiangru Li,Zheren Zhu,Jacob Ellison,Kang Wang,Janine M. Lupo,Yang Yang,Hui Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Evaluating vision-language models, Evaluating vision-language, clinically grounded, Evaluating, evaluation confounds

备注：

点击查看摘要

Abstract:Evaluating vision-language models (VLMs) on medical images requires benchmarks that are clinically grounded, scalable, and controlled for evaluation confounds. Existing public benchmarks are limited in scale, manually annotated, or potentially leaked into VLM pretraining corpora. We present an automated agent-driven pipeline that generates multiple-choice VQA datasets directly from paired private radiology reports and 3D oncology imaging, producing two complementary question types: RADS-style questions deterministically derived from clinician-defined reporting schemas, and radiology report-derived questions generated by an LLM from radiologist findings and verified against the source report. Applied to four in-house cancer cohorts, the pipeline yields an instance-contamination-controlled benchmark without per-question human annotation. Zero-shot evaluation of six VLMs reveals no dominant model and substantial headroom across all cells. A blind ablation reveals that visual reliance is highly dataset-specific: liver Report-derived questions genuinely require the image, while Lung CT is essentially solvable without it - the leading closed model exceeds its sighted accuracy on Lung CT when blinded - indicating that even private clinical data does not guarantee a contamination-controlled read of visual capability. The pipeline is released as an open agent skill for in-house redeployment.

131. 【2606.02800】Cosmos 3: Omnimodal World Models for Physical AI

链接：https://arxiv.org/abs/2606.02800

作者：Aditi,Niket Agarwal,Arslan Ali,Jon Allen,Martin Antolini,Adeline Aubame,Alisson Azzolini,Junjie Bai,Maciej Bala,Yogesh Balaji,Josh Bapst,Aarti Basant,Mukesh Beladiya,Mohammad Qazim Bhat,Zaid Pervaiz Bhat,Dan Blick,Vanni Brighella,Han Cai,Tiffany Cai,Eric Cameracci,Jiaxin Cao,Yulong Cao,Mark Carlson,Carlos Casanova,Ting-Yun Chang,Yan Chang,Yu-Wei Chao,Prithvijit Chattopadhyay,Roshan Chaudhari,Chieh-Yun Chen,Junyu Chen,Ke Chen,Qizhi Chen,Wenkai Chen,Xiaotong Chen,Yu Chen,An-Chieh Cheng,Click Cheng,Xiu Chia,Jeana Choi,Chaeyeon Chung,Wenyan Cong,Yin Cui,Magdalena Dadela,Nalin Dadhich,Wenliang Dai,Joyjit Daw,Alperen Degirmenci,Rodrigo Vieira Del Monte,Robert Denomme,Sameer Dharur,Marco Di Lucca,Ke Ding,Wenhao Ding,Yifan Ding,Yuzhu Dong,Nicole Drumheller,Yilun Du,Aigul Dzhumamuratova,Aleksandr Efitorov,Hamid Eghbalzadeh,Naomi Eigbe,Imad El Hanafi,Hassan Eslami,Benedikt Falk,Jiaojiao Fan,Jim Fan,Amol Fasale,Sergiy Fefilatyev,Liang Feng,Francesco Ferroni,Sanja Fidler,Xiao Fu,Vikram Fugro,Prashant Gaikwad,TJ Galda,Katelyn Gao,Yihuai Gao,Wenhang Ge,Sreyan Ghosh,Arushi Goel,Vivek Goel,Akash Gokul,Rama Govindaraju,Jinwei Gu,Miguel Guerrero,Elfie Guo,Aryaman Gupta,Siddharth Gururani,Hugo Hadfield,Song Han,Ankur Handa,Zekun Hao,Mohammad Harrim,Ali Hassani,Nathan Hayes-Roth,Yufan He,Chris Helvig,Cyrus Hogg,Madison Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)

关键词：https URL, omnimodal world models, world models designed, introduce Cosmos, generate language

备注：

点击查看摘要

Abstract:We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 this https URL License at this https URL}{this http URL and this https URL . The project website is available at this https URL .

132. 【2606.02789】Diagnosis of Human Object Interaction Detectors for Real World Educational Applications

链接：https://arxiv.org/abs/2606.02789

作者：Divya Mereddy,Ashwin Tudur Sadashiva,Marcos Quinones-Grueiro,Gautam Biswas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：automatically analyzing student, analyzing student behavior, Human-object interaction, HOI, automatically analyzing

备注：

点击查看摘要

Abstract:Human-object interaction (HOI) recognition is critical for automatically analyzing student behavior in complex educational environments. Although state-of-the-art (SOTA) HOI detectors perform well on benchmark datasets, their performance often degrades when deployed in real-world training environments due to domain-specific objects, occlusions, and complex visual conditions. In this paper, we introduce a diagnosis-driven framework that integrates a triplet-level HOI error taxonomy with error-factor attribution analysis for real-world educational video data. We study this problem in the context of Critical Care Air Transport Team (CCATT) mixed-reality medical training. Based on an analysis of HOI failure modes and their causes, we develop a diagnosis-informed refinement strategy for adapting pretrained HOI models to the target domain. Experiments on the CCATT dataset show that this approach improves the macro-F1 score of a pretrained CDN model from 48.6 to 90.2 through targeted refinement guided by diagnosed error factors. These results highlight the value of detailed diagnostic analysis for informing targeted adaptation of HOI models in real-world educational environments.

133. 【2606.02774】GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

链接：https://arxiv.org/abs/2606.02774

作者：Yingzi Ma,Chaowei Xiao,Ming Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：rules remains underexplored, diverse global settings, traffic rules remains, shown promising performance, Vision-language models

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs' geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.

134. 【2606.02764】From Local Training to Large-Scale Mapping: A Comparative Assessment of Machine Learning and Deep Learning for Transferable Satellite-Derived Bathymetry

链接：https://arxiv.org/abs/2606.02764

作者：Hsiao-Jou Hsu,Joachim Moortgat

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)

关键词：complex coastal environments, optically complex coastal, Satellite-derived bathymetry, coastal environments, Barrier Reef regions

备注： 42 pages, 13 figures, 15 tables. Supplementary Information provided as ancillary file (anc/SI.pdf). Code and pretrained weights at [this https URL](https://github.com/buckai-observatory/DL_bathy)

点击查看摘要

Abstract:Satellite-derived bathymetry (SDB) from multispectral imagery is cost-effective but scales poorly across regions, especially in optically complex coastal environments. We evaluate machine learning and deep learning for transferable SDB over the 0-20 m depth range using Sentinel-2 imagery. A Random Forest baseline and four CNNs (ResNet-50, ResNet-101, EfficientNet-B4, ConvNeXt-Large) are trained on Pratas Island and selected Great Barrier Reef regions, then evaluated on spatially independent intra- and cross-regional test areas. Preserving spatial continuity during training, by keeping contiguous reef blocks rather than random patches, is the single most impactful design choice; we further introduce a Smooth Weight Function (SWF)-weighted RMSE loss that emphasizes near-surface depths. With these choices, intra-regional RMSE ranges from 1.15 to 1.92 m over 0-20 m and is as low as 0.26 m for depths = 3 m. Random Forest degrades sharply under cross-regional transfer (RMSE 1.53 m - 2.99-3.78 m), while the deep models stay more robust (2.46-2.98 m). On the public MagicBathyNet aerial-RGB benchmark (0-16 m) the proposed networks reach 0.19-0.22 m RMSE, outperforming a U-Net baseline and a task-specific transformer architecture with substantially fewer parameters. We further exploit multi-temporal repeat imagery: training on it broadens diversity, and median-aggregating predictions across passes at inference reduces noise from changing sun angles, atmospheric conditions, water properties, and tides. We release optimized architectures and pretrained weights to enable scalable transfer to new sites.

135. 【2606.02753】MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

链接：https://arxiv.org/abs/2606.02753

作者：Teng Hu,Mingchun Lu,Yating Wang,Jiangning Zhang,Jinkun Hao,Ye Pan,Ran Yi,Lizhuang Ma,Dacheng Tao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：single agent observing, foundational generative technology, Video world models, single perspective, multi-agent video world

备注：

点击查看摘要

Abstract:Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.

136. 【2606.02747】Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

链接：https://arxiv.org/abs/2606.02747

作者：Fabian Degen,Oishi Deb,Jindong Gu,Junchi Yu,Samuele Marro,Philip Torr,Jialin Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：records define restrictions, indirect spatial evidence, Planning records define, geographic areas, machine-readable boundaries

备注： Project page: [this https URL](https://odeb1.github.io/Plan2Map_Project_Page/) . Fabian Degen and Oishi Deb Contributed Equally

点击查看摘要

Abstract:Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document-grounded, geospatial-tool-in-the-loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8\% of predictions at or above 0.8 IoU, substantially outperforming direct VLM-to-GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel-level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: this https URL.

137. 【2606.02742】Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

链接：https://arxiv.org/abs/2606.02742

作者：S Divakar Bhat,Toshihiko Yamasaki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：metric distance queries, modern vision-language models, fundamental to robotics, remain unreliable, distance queries

备注：

点击查看摘要

Abstract:Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \href{this https URL}{here}

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.02742 [cs.CV]

(or
arXiv:2606.02742v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.02742

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

138. 【2606.02724】AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

链接：https://arxiv.org/abs/2606.02724

作者：Yaoting Wang,Yun Zhou,Zipei Zhang,Henghui Ding

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：speaker tracking aims, track active speakers, Audio-visual speaker tracking, enabling fine-grained, speaker tracking

备注： 19 pages, 10 figures, ICML 2026

点击查看摘要

Abstract:Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: this https URL

139. 【2606.02603】COD10K-C: Benchmarking Robustness of Camouflaged Object Detection Under Natural Image Corruptions

链接：https://arxiv.org/abs/2606.02603

作者：Arafat Hossain Sayem

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Camouflaged object detection, improved substantially, Camouflaged object, standard benchmarks evaluate, object detection

备注： 7 pages, 1 figure

点击查看摘要

Abstract:Camouflaged object detection has improved substantially, but most standard benchmarks evaluate models only on clean images. This is not realistic because real cameras often capture blur, sensor noise, weather effects, and compression artifacts. We present COD10K-C, a corruption robustness benchmark based on COD10K. It includes 8 corruption types and 5 severity levels, giving 40 conditions and 81,040 evaluation pairs in total. We evaluate three popular camouflaged object detection models, SINet-v2, PFNet, and ZoomNet, as well as a lightweight model called RobustCODLite. All models show clear performance drops on corrupted images. Motion blur and Gaussian blur cause the largest drops, with SINet-v2 losing 18.5 Dice points under motion blur. Brightness and fog are less harmful. RobustCODLite uses corruption augmentation, a frequency-prior branch, and an uncertainty-consistency loss. It retains 92.3% of its clean Dice score under corruption, compared with 87.7% for SINet-v2, 84.8% for ZoomNet, and 84.1% for PFNet. On the hardest corruptions, RobustCODLite matches or outperforms models that perform better on clean data. We will release the COD10K-C GitHub repository to support future research in robust camouflaged object detection.

140. 【2606.02602】Graph Mamba Survival Analysis Based on Topology-Aware ordering

链接：https://arxiv.org/abs/2606.02602

作者：Yuanfang Chen,Peiqiang Yan,Yuntao Shou,Qian Zhao,Xiangyong Cao

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：patient prognosis assessment, faces multiple technical, multiple technical challenges, Slide Images, Mamba

备注：

点击查看摘要

Abstract:In computational pathology, Whole Slide Images (WSIs) survival analysis is crucial for patient prognosis assessment, but it faces multiple technical challenges. Although the Transformer captures long-range dependencies through its self-attention mechanism, its $O(N^2)$ time complexity causes a severe computational bottleneck in large-scale WSIs graph structures. The Mamba model breaks through the Transformer's computational bottleneck with linear complexity. But, owing to Mamba's high sensitivity to the order of input data, traditional node sorting methods in Graph Mamba, such as those based on node degree or subgraph size, fail to adequately account for the topological connectivity of graph data. This inadequacy consequently restricts the performance of Mamba's sequential modeling. Moreover, its unidirectional architecture cannot leverage the bidirectional spatial structure of images. To address these challenges, this paper proposes a novel Graph Mamba survival analysis framework based on topology-aware ordering (TopoMamSurv) to adapt to the sequential sensitivity of Mamba. Our visualization experiments further confirmed that the nodes extracted through the topology-aware ordering (TAO) strategy indeed exhibit higher similarity. Furthermore, we designed a bidirectional Mamba module and integrated a Graph Convolutional Network (GCN) to achieve bidirectional spatial context modeling of images, forming a hierarchical feature learning architecture for "local aggregation - global capture." This framework effectively reconciles the contradiction between long-range dependency modeling, computational efficiency, and spatial structure utilization in WSIs analysis through its systematic design of TAO, bidirectional semantic modeling, and hierarchical feature fusion. This framework has been validated for its comprehensive performance advantage on five TCGA datasets.

141. 【2606.00384】VESTA: Visual Exploration with Statistical Tool Agents

链接：https://arxiv.org/abs/2606.00384

作者：William Rudman,Abhishek Divekar,Kanishk Jain,Sebastian Joseph,Stella S. R. Offner,Matthew Lease,Kyle Mahowald,Greg Durrett,Junyi Jessy Li

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computation (stat.CO)

关键词：Fitting quantitative models, central step, step in scientific, Statistical Tool Agents, refine statistical models

备注：

点击查看摘要

142. 【2606.00188】PaintBench: Deterministic Evaluation of Precise Visual Editing

链接：https://arxiv.org/abs/2606.00188

作者：Kai Xu,Ellis Brown,Shrikar Madhu,Rob Fergus,He He,Saining Xie

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：single-answer edits remains, executing precise single-answer, precise single-answer edits, important obstacle, proficient at open-ended

备注： Project Page: [this https URL](https://paintbench.github.io/)

点击查看摘要

Abstract:While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p 0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.

143. 【2606.03940】SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

链接：https://arxiv.org/abs/2606.03940

作者：Dan Jacobellis,Neeraja J. Yadwadkar

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：low-power hardware, vast amounts, resolution using low-cost, amounts of visual, visual data

备注：

点击查看摘要

Abstract:In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at this https URL .

144. 【2606.02937】BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting

链接：https://arxiv.org/abs/2606.02937

作者：Yanchen Wang,Lenny Aharon,Wangshu Zhu,Kyle Daruwalla,Linghua Zhang,Jiaru Zou,Selmaan Chettih,Helen Hou,Liam Paninski,Matthew R Whiteway

类目：Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)

关键词：recordings remains challenging, remains challenging, experimental settings, Multi-view video, extracting rich

备注：

点击查看摘要

Abstract:Multi-view video recordings are increasingly used to capture the 3D movements of animals in experimental settings, yet extracting rich 3D representations from these recordings remains challenging. Supervised pose estimation requires extensive manual annotation, while general-purpose 3D reconstruction models trained on generic scene datasets fail on the specialized imagery and sparse-view setting of laboratory experiments. We address these limitations with BEAST3D, a self-supervised pretraining framework that learns 3D visual representations from unlabeled, calibrated multi-view video. BEAST3D uses a vision transformer to predict 3D Gaussian splats that reconstruct held-out views through differentiable rendering, while simultaneously segmenting the animal from the background. BEAST3D reconstructs 3D structure with as few as four views by conditioning directly on known camera parameters--unlike general-purpose models, which must estimate camera geometry from dense overlapping viewpoints that are seldom available in lab settings. Through comprehensive evaluation across four species, we demonstrate that BEAST3D produces rich, viewpoint-invariant features that transfer effectively to three downstream tasks: novel view synthesis, which validates the quality of the learned 3D representations; multi-view pose estimation, which provides the sparse keypoint trajectories widely used in behavioral analysis; and neural encoding, which relates 3D behavioral features to simultaneously recorded neural activity. BEAST3D thus establishes a versatile framework for behavioral analysis that leverages 3D structure in modern multi-view laboratory recordings.

145. 【2606.02906】Depth from Dual Differential Defocus and Stereo Consensus

链接：https://arxiv.org/abs/2606.02906

作者：Junjie Luo,Wei Xu,Dylan Chu,Emma Alexander,Qi Guo

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Dual Differential Defocus, highly accurate depth, achieve highly accurate, closed-form algorithm, algorithm that unifies

备注：

点击查看摘要

Abstract:We introduce D^3S Consensus, a physics-based, closed-form algorithm that unifies depth-from-defocus (DfD) and stereo to achieve highly accurate depth estimation throughout an extended working range beyond the depth-of-field (DoF) of cameras. Given a pair of dual-defocus stereo images, the method estimates an overdetermined set of depth using a novel DfD theory, Dual Differential Defocus (D^3), and (S)tereo in a coupled fashion. It then picks the most confident depth prediction from the set by enforcing consensus between these physically independent cues to reject unreliable estimates. Analysis shows that D^3S achieves a comparable working range under the same error tolerance with 10x smaller baseline than previous triangulation-based depth estimation systems. This enables compact passive binocular rangefinders with substantially smaller form factors than conventional stereo and DfD designs. We demonstrate the first D^3S prototype with only 4 mm baseline and 12 mm EFL. It generates up to 900 x 1800-pixel depth maps with 1-cm mean absolute error over 0.3-1.64 m from a snapshot acquisition. This has surpassed the reported accuracy of certain commercially available stereo cameras with much larger form factors.

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
I.4.0

ACMclasses:
I.4.0

Cite as:
arXiv:2606.02906 [eess.IV]

(or
arXiv:2606.02906v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2606.02906

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

146. 【2606.02642】SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

链接：https://arxiv.org/abs/2606.02642

作者：Chenshuang Zhang,Kyeong Seon Kim,Chengxin Liu,Tae-Hyun Oh

类目：Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)

关键词：ungrounded outputs, produce plausible, plausible but ungrounded, audio-visual LLMs, audio-visual large-language models

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: this https URL.

147. 【2606.02639】Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

链接：https://arxiv.org/abs/2606.02639

作者：Spoorthi M,Suja Palaniswamy

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：RGB scene reconstruction, previously unreported failure, unreported failure mode, X-ray attenuation fields, suppresses density gradients

备注：

点击查看摘要

Abstract:We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p0.0001) for clinically actionable nodules =10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.

148. 【2606.02631】Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

链接：https://arxiv.org/abs/2606.02631

作者：Shenghao Ding

类目：Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)

关键词：modality-specific latent grids, separate modality-specific latent, one-level Haar DWT, paper studies, share a common

备注： 12 pages, 3 figures

点击查看摘要

Abstract:This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.