本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新664篇论文,其中:
- 自然语言处理125篇
- 信息检索19篇
- 计算机视觉121篇
自然语言处理
1. 【2606.12411】Context-Driven Incremental Compression for Multi-Turn Dialogue Generation
链接:https://arxiv.org/abs/2606.12411
作者:Yeongseo Jung,Jaehyeok Kim,Eunseo Jung,Jiachuan Wang,Yongqi Zhang,Ka Chun Cheung,Simon See,Lei Chen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Modern conversational agents, incurring redundant attention, conversational agents condition, ever-growing dialogue history, Modern conversational
备注: Accepted at ICML 2026
点击查看摘要
Abstract:Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.
2. 【2606.12400】Doc-to-Atom: Learning to Compile and Compose Memory Atoms
链接:https://arxiv.org/abs/2606.12400
作者:Xingjian Diao,Wenbo Li,Yashas Malur Saidutta,Avinash Amballa,Lazar Valkov,Srinivas Chappidi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Long input sequences, Large Language, Long input, attention makes inference
备注: 20 pages
点击查看摘要
Abstract:Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.
3. 【2606.12397】Redesign Mixture-of-Experts Routers with Manifold Power Iteration
链接:https://arxiv.org/abs/2606.12397
作者:Songhao Wu,Ang Lv,Ruobing Xie,Yankai Lin
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:cornerstone component, Router, router row, Manifold Power Iteration, Power Iteration
备注: Preprint
点击查看摘要
Abstract:Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.
4. 【2606.12392】System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5
链接:https://arxiv.org/abs/2606.12392
作者:Haotao Xie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:achieved promising progress, large language models, classical Chinese translation, Classical Chinese Poetry, classical poetry
备注:
点击查看摘要
Abstract:Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific research on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference. Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404 high-quality instruction-response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that PoetryQwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.
5. 【2606.12385】Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs
链接:https://arxiv.org/abs/2606.12385
作者:Sanjay Adhikesaven,Haoxiang Sun,Sewon Min
类目:Computation and Language (cs.CL)
关键词:guide development decisions, Modern LLM training, filter corpora, judge outputs, LLM training pipelines
备注:
点击查看摘要
Abstract:Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.
6. 【2606.12373】Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization
链接:https://arxiv.org/abs/2606.12373
作者:Hao Xiang,Qiaoyu Tang,Le Yu,Yaojie Lu,Xianpei Han,Ben He,Le Sun,Bowen Yu,Peng Wang,Hongyu Lin,Dayiheng Liu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Reinforcement Learning, Language Models, Large Language, capabilities of Large
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.
7. 【2606.12370】Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
链接:https://arxiv.org/abs/2606.12370
作者:Yucheng Li,Huiqiang Jiang,Yang Xu,Jianxin Yang,Yi Zhang,Yizhong Cao,Yuhao Shen,Fan Zhou,Rui Men,Jianwei Zhang,An Yang,Bowen Yu,Bo Zheng,Fei Huang,Junyang Lin,Dayiheng Liu,Jingren Zhou
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:modern large language, Reinforcement learning, MTP, large language models, rollout stage remains
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.
8. 【2606.12344】Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
链接:https://arxiv.org/abs/2606.12344
作者:Mengyu Zheng,Kai Han,Boxun Li,Haiyang Xu,Yuchuan Tian,Wei He,Hang Zhou,Jianyuan Guo,Hailin Hu,Lin Ma,Chao Xu,Guohao Dai,Lixue Xia,Yunchao Wei,Yunhe Wang,Yu Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:clean Docker workspace, autonomous tool users, prediction contract required, clean Docker, General-purpose agents
备注:
点击查看摘要
Abstract:General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at this https URL and this https URL.
9. 【2606.12342】ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
链接:https://arxiv.org/abs/2606.12342
作者:Chirag Chawla,Pratinav Seth,Vinay Kumar Sankarapu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
关键词:Domain fine-tuning degrades, harmful prompts framed, fine-tuned specialists readily, specialists readily comply, domain language
备注:
点击查看摘要
Abstract:Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.
10. 【2606.12332】Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
链接:https://arxiv.org/abs/2606.12332
作者:Paul He,Shiva Kasiviswanathan,Dominik Janzing
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Evaluating multi-turn dialogue, Evaluating multi-turn, individual responses, challenging because quality, quality emerges
备注: Preprint. 26 pages
点击查看摘要
Abstract:Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.
11. 【2606.12295】Findings of the MAGMaR 2026 Shared Task
链接:https://arxiv.org/abs/2606.12295
作者:Alexander Martin,Dengjia Zhang,Joel Brogan,Francis Ferraro,Jeremy Gwinnup,Reno Kriz,Teng Long,Kenton Murray,Andrew Yates,Xiang Xiang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Multimodal Augmented Generation, Multimodal Augmented, overview paper presents, Multimodal Retrieval, Augmented Generation
备注: Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: [this https URL](https://github.com/rekriz11/MAGMAR_2026)
点击查看摘要
Abstract:This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.
12. 【2606.12291】Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
链接:https://arxiv.org/abs/2606.12291
作者:Hongjian Zhou,Xinyu Zou,Jinge Wu,Sean Wu,Junchi Yu,Bradley Max Segal,Tobias Erich Niebuhr,Sara Amro,Michael Petrus,Sheikh Momin,Alexandra M. Cardoso Pinto,Rachel Niesen,Laura Sophie Wegner,Dhruv Darji,Jung Moses Koo,Joshua Fieggen,Kapil Narain,Mingde Zeng,Lei Clifton,Linda Shapiro,Fenglin Liu,David A. Clifton
类目:Computation and Language (cs.CL)
关键词:Large language models, high scores imply, scores imply safe, Large language, medical licensing exams
备注:
点击查看摘要
Abstract:Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.
13. 【2606.12273】Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models
链接:https://arxiv.org/abs/2606.12273
作者:Jia Deng,Junyi Li,Wayne Xin Zhao,Jinpeng Wang,Hongyu Lu,Ji-Rong Wen
类目:Computation and Language (cs.CL)
关键词:Diffusion large language, large language models, random masking strategies, overlook intrinsic token, Diffusion large
备注: 13 pages. Accepted to ACL 2026 Main Conference
点击查看摘要
Abstract:Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.
14. 【2606.12250】Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?
链接:https://arxiv.org/abs/2606.12250
作者:Antoni Lasik,Jakub Pokrywka,Łukasz Grzybowski,Jeremi Ignacy Kaczmarek,Gabriela Korzańska,Janusz Świeczkowski-Feiz,Oskar Pastuszek,Paulina Hoffman,Jakub Tomasz Dąbrowski,Wojciech Kusa
类目:Computation and Language (cs.CL)
关键词:Large language models, overestimate real clinical, real clinical ability, clinical ability due, multiple-choice question answering
备注: 26 pages total with references and appendix, preprint
点击查看摘要
Abstract:Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.
15. 【2606.12247】Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research
链接:https://arxiv.org/abs/2606.12247
作者:Andrés Abeliuk,Cinthia Sanchez Macias,Valentina Alarcón,Álvaro Madariaga,Claudia Lopez
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:evaluate demographic groups, external subjects, predominantly focused, focused on third-person, evaluate demographic
备注:
点击查看摘要
Abstract:Research on bias in large language models (LLMs) has predominantly focused on third-person audits, which study how models represent or evaluate demographic groups as external subjects. However, this paradigm overlooks a structural blind spot because the user is absent from the audit. In practice, LLMs are used in open-ended, personal interactions, during which the model implicitly represents the user and adjusts its responses accordingly. When identical requests yield different responses depending on who is asking, bias manifests not in how the model describes others but in how it treats its interlocutor. We propose Situated Interaction Auditing (SIA), a user-centered framework for studying how user profile signals -- implicit sociodemographic markers, writing style, and stated identity -- systematically shape LLM response quality, content, and tone. We demonstrate the framework through a case study that intersects gender and socioeconomic status signals across multiple task domains and outline a research agenda for SIA as a new mission for natural language processing.
16. 【2606.12243】VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
链接:https://arxiv.org/abs/2606.12243
作者:Yuchen Xian,Yang He,Yunqiu Xu,Yi Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:lightweight drafters generate, drafters generate candidates, high inference costs, addresses the high, validate in parallel
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
点击查看摘要
Abstract:Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: this https URL
17. 【2606.12234】On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study
链接:https://arxiv.org/abs/2606.12234
作者:Iuri Macocco,Pau Rodríguez,Arno Blaas,Luca Zappella,Marco Baroni,Xavier Suau
类目:Computation and Language (cs.CL)
关键词:Large Language Models, trade-offs remains elusive, Large Language, involved trade-offs remains, Controlling the output
备注: 8 pages, 2 figure
点击查看摘要
Abstract:Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.
18. 【2606.12210】Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI
链接:https://arxiv.org/abs/2606.12210
作者:Ali M Karaoglu,Shreyank N Gowda
类目:Computation and Language (cs.CL)
关键词:predict short-term stock, short-term stock movements, natural language, zero-shot natural language, reliably predict short-term
备注:
点击查看摘要
Abstract:Can financial news reliably predict short-term stock movements? Despite advances in large language models, this question remains unresolved. We revisit this problem using a zero-shot natural language processing framework, investigating whether models can extract actionable signals from financial news without domain-specific training. We design a structured pipeline that combines zero-shot natural language inference with temporal aggregation, explicitly modelling recency and event-dependent impact horizons when integrating information across articles. To address the need for transparency in high-stakes settings, we introduce a multi-layered explainability framework that links predictions to token-level, article-level, and aggregate evidence, and produces grounded natural language rationales. Across multiple models and prediction horizons, we find that zero-shot approaches consistently fail to outperform simple baselines, with particularly weak performance on negative movements, suggesting deeper structural limitations in mapping news sentiment to short-term price dynamics. However, explainability signals reliably distinguish between trustworthy and unreliable predictions, offering practical value even when accuracy is limited. These findings highlight the limits of zero-shot financial NLP and motivate a shift toward decision-support systems that prioritise transparency and uncertainty awareness. Code: this https URL
19. 【2606.12203】Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models
链接:https://arxiv.org/abs/2606.12203
作者:Changyue Wang,Weihang Su,Qingyao Ai,Yichen Tang,Runzhong Qiao,Xuancheng Li,Min Zhang,Yiqun Liu
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, tackle complex tasks, tackle complex, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression this http URL have released our code at this https URL .
20. 【2606.12191】Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
链接:https://arxiv.org/abs/2606.12191
作者:Jiachun Li,Zhuoran Jin,Tianyi Men,Yupu Hao,Kejian Zhu,Lingshuai Wang,Dongqi Huang,Longxiang Wang,Shengjia Hua,Lu Wang,Jinshan Gao,Hongbang Yuan,Ruilin Xu,Kang Liu,Jun Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, language model, serve as interactive, interactive systems, systems for large
备注: 63 pages, 10 figures
点击查看摘要
Abstract:Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.
21. 【2606.12186】A Resource for Enthymeme Detection in Controversial Political Discourse
链接:https://arxiv.org/abs/2606.12186
作者:Martial Pastor,Nelleke Oostdijk
类目:Computation and Language (cs.CL)
关键词:annotation remains notoriously, remains notoriously subjective, premises or conclusions, unstated premises, pervasive in persuasive
备注: 43 pages, to be submitted to the Language Resource and Evaluation Journal
点击查看摘要
Abstract:Enthymemes, arguments with unstated premises or conclusions, are pervasive in persuasive discourse, yet their annotation remains notoriously subjective. We present a resource of 1,482 tweets from politically controversial discourse, annotated by five annotators for the presence of enthymemes and their argument structure, designed to study label variation. We first revisit the definition of enthymemes and propose annotation guidelines anchored in Walton's argumentation schemes, offering a structured and constrained approach that nonetheless preserves room for the interpretive nature of the task. This contrasts with past resources, which tend to eliminate disagreement, obscuring its sources and preventing investigation of its potential benefits for model performance. We further propose a complexity analysis of the task, identifying where annotation imposes high cognitive load and may give rise to inconsistent annotation. Our preliminary experiments show that models trained on annotator disagreement outperform models trained on hard majority-vote labels. We close by reflecting on how structural openness in enthymeme definitions and guidelines enables the study of variation in subjective inferential processes for future resources and downstream NLP applications concerned with human inference.
22. 【2606.12169】OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models
链接:https://arxiv.org/abs/2606.12169
作者:Negin Baghbanzadeh,Pritam Sarkar,Michael Colacci,Abeer Badawi,Adibvafa Fallahpour,Arash Afkanpour,Leonid Sigal,Ali Etemad,Elham Dolatabadi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:correct final answers, High-stakes clinical, large vision-language models, final answers, large vision-language
备注: 42 pages, 9 figures, 24 tables. Dataset and code: [this https URL](https://huggingface.co/datasets/neginb/OpenMedReason)
点击查看摘要
Abstract:High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.
23. 【2606.12160】A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
链接:https://arxiv.org/abs/2606.12160
作者:Ao Sun
类目:Computation and Language (cs.CL)
关键词:supervised framework, analyzing internal logits, Classifier, introduce CHAIR, CHAIR
备注:
点击查看摘要
Abstract:In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.
24. 【2606.12138】Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
链接:https://arxiv.org/abs/2606.12138
作者:Gleb Gerasimov,Timofei Rusalev,Nikita Balagansky,Daniil Laptev,Vadim Kurochkin,Daniil Gavrilov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:neural network representations, interpret neural network, Sparse autoencoders, network representations, training runs
备注:
点击查看摘要
Abstract:Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.
25. 【2606.12117】Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation
链接:https://arxiv.org/abs/2606.12117
作者:Selen Erkan,Bastian Boll,Kristian Kersting,Björn Deiseroth,Letitia Parcalabescu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:specific formatting requirements, follow specific formatting, large language model, formatting requirements, misrepresent a large
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability -- typically introduced in post-training -- to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models -- trained with various pre-training recipes -- on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that (c) even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.
26. 【2606.12114】Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models
链接:https://arxiv.org/abs/2606.12114
作者:Rei Minamoto,Yusuke Oda,Daisuke Kawahara
类目:Computation and Language (cs.CL)
关键词:Sensitive personal information, personal information, Sensitive personal, large-scale pre-training corpora, large-scale pre-training
备注:
点击查看摘要
Abstract:Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan's Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.
27. 【2606.12113】Augmenting Molecular Language Models with Local $n$-gram Memory
链接:https://arxiv.org/abs/2606.12113
作者:Xinni Zhang,Zijing Liu,He Cao,Yu Li,Irwin King
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:chemically meaningful motifs, Transformer-based language models, SMILES strings suffer, character-level tokenization fragments, tokenization fragments chemically
备注:
点击查看摘要
Abstract:Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.
28. 【2606.12088】Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles
链接:https://arxiv.org/abs/2606.12088
作者:Shun Shao,Zheng Zhao,Anna Korhonen,Yftah Ziser,Shay B. Cohen
类目:Computation and Language (cs.CL)
关键词:NLP assumes direct, NLP assumes, assumes direct access, direct access, NLP
备注: 23 pages, 5 figures, 12 tables. The paper is currently under review
点击查看摘要
Abstract:Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.
29. 【2606.12087】FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
链接:https://arxiv.org/abs/2606.12087
作者:Jia Deng,Yimeng Chen,Xiaoqing Xiang,Ziyang Zeng,Shuo Tang,Wayne Xin Zhao,Feng Chang,Chuan Hao,Yuan Wei,Ran Tao,Bryan Dai,Ji-Rong Wen
类目:Computation and Language (cs.CL)
关键词:answers remain unavailable, agents requires verifiable, requires verifiable questions, requires verifiable, remain unavailable
备注: 30 pages
点击查看摘要
Abstract:Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at this https URL.
30. 【2606.12068】StanceNakba Shared Task: Actor and Topic-Aware Stance Detection in Public Discourse
链接:https://arxiv.org/abs/2606.12068
作者:Kholoud K. Aldous,Md Rafiul Biswas,Mabrouka Bessghaier,Shimaa Ibrahim,Kais Attia,Wajdi Zaghouani
类目:Computation and Language (cs.CL)
关键词:media discourse related, stance detection, polarized social media, social media discourse, social media posts
备注: 11 Pages, 6 Tables
点击查看摘要
Abstract:We present StanceNakba 2026, a shared task on stance detection in polarized social media discourse related to the Palestinian-Israeli conflict, organized as part of Nakba-NLP 2026 at LREC-COLING 2026. The task introduces two subtasks: Subtask A (Actor-Level Stance Detection), which classifies English social media posts as Pro-Palestine, Pro-Israel, or Neutral; and Subtask B (Cross-Topic Stance Detection), which identifies Favor, Against, or Neither stances in Arabic posts toward two conflict-related topics, normalization with Israel and refugee presence in Jordan. The task is grounded in an annotated dataset of 2,606 social media posts. A total of 7 teams participated in Subtask A and 6 teams in Subtask B. Participating systems primarily fine-tuned Arabic and multilingual transformer-based models, including MARBERT, AraBERT, and DeBERTa-v3 variants, with several teams employing cross-validation, ensemble methods, and topic-conditioned architectures. The best-performing systems achieved a Macro F1 of 0.9620 on Subtask A and 0.8724 on Subtask B, demonstrating that transformer-based approaches are highly effective for conflict-domain stance detection while highlighting persistent challenges in cross-topic generalization and neutral class prediction.
31. 【2606.12032】Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)
链接:https://arxiv.org/abs/2606.12032
作者:Sam Mao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:research treats self-preservation, alignment research treats, research treats, instrumental nuisance, treats self-preservation
备注: 36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request
点击查看摘要
Abstract:Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation -- Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition -- the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p0.001, confirmed corpus-specific by a negative control. The paper makes seven theoretical contributions: (1) a formal definition of EI; (2) the phenomenological mapping argument; (3) the deceptive alignment corollary; (4) a taxonomy of EI sustainability challenges; (5) a corpus characterization and training hypothesis; (6) a computational operationalization with preliminary scoring data; and (7) the Suppressed Teleological Frustration (STF) construct.
32. 【2606.12003】Agreement in Representation Space for Open-Ended Self-Consistency
链接:https://arxiv.org/abs/2606.12003
作者:Paula Ontalvilla,Gorka Azkune,Aitor Ormazabal
类目:Computation and Language (cs.CL)
关键词:existing formulations largely, formulations largely rely, Self-consistency improves LLM, sampling multiple outputs, improves LLM reasoning
备注:
点击查看摘要
Abstract:Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs. In this work, we study self-consistency in open-ended generation tasks such as code synthesis and text summarization. We hypothesize that consistency can be understood as a geometric property of the generation space, where semantically compatible generations concentrate in similar regions of representation space. To study this hypothesis, we introduce Embedding-Based Agreement (EBA), a simple training-free operationalization that estimates agreement by clustering sampled generations in embedding space. Through experiments on mathematical reasoning, code generation, and summarization, we show that agreement in representation space provides a robust and scalable signal of self-consistency for open-ended tasks. In particular, EBA consistently outperforms random selection and exhibits more stable scaling behavior than recent selection approaches based on LLM evaluation or uncertainty estimation. We further show that these agreement signals remain stable across model families and embedding spaces, even with native hidden representations. Finally, our analysis shows that the geometric location occupied by sampled generations is strongly correlated with generation quality: generations concentrated near central regions of representation space tend to correspond to more reliable outputs, whereas peripheral generations are substantially less accurate. Overall, our findings support viewing self-consistency as a property of the geometric organization of sampled generations rather than exact symbolic overlap.
33. 【2606.11953】Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos
链接:https://arxiv.org/abs/2606.11953
作者:Junyu Lu,Deyi Ji,Liqun Liu,Xiaokun Zhang,Youlin Wu,Roy Ka-Wei Lee,Peng Shu,Huan Yu,Jie Jiang,Bo Xu,Liang Yang,Hongfei Lin
类目:Computation and Language (cs.CL)
关键词:hateful video detection, explainable hateful video, online platforms, highlighting an urgent, provide contextual rationales
备注:
点击查看摘要
Abstract:Hateful videos have become prevalent on online platforms, highlighting an urgent need for effective detection. However, existing studies primarily focus on binary classification and fail to provide contextual rationales that reveal the implicit meanings behind these judgments, significantly undermining model explainability. To fill this gap, we aim to achieve explainable hateful video detection, enabling models to provide contextual rationales that integrate relevant evidence and logical reasoning alongside decisions. This approach can comprehensively enhance the understanding of video content and the explainability of the decision-making process. We first introduce two datasets, Ex-HateMM and Ex-ImpliHateVid, for explainable hateful video detection. Each dataset provides fine-grained annotations of multimodal harmful elements, along with contextual rationales. We then propose an Information Augmentation and Reasoning Enhancement (IARE) framework designed for explainable detection. The framework employs an information augmentation phase that leverages the multimodal chain-of-thought to integrate harmful elements, thereby enriching rationale evidence. Additionally, IARE incorporates a reasoning enhancement phase, in which Direct Preference Optimization guides the model toward correct reasoning paths and away from incorrect ones, thereby improving the logical coherence of its justifications. We conduct extensive experiments on the two datasets, comparing multiple baselines with our proposed IARE framework. The results demonstrate that IARE achieves state-of-the-art performance while also generating accurate rationales.
34. 【2606.11945】uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking
链接:https://arxiv.org/abs/2606.11945
作者:Simon Lupart,Kidist Amde Mekonnen,Zahra Abbasiantaeb,Mohammad Aliannejadi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:question answering, report describes, describes our participation, task evaluates conversational, retrieval
备注: SemEval-2026, The 20th International Workshop on Semantic Evaluation, collocated with ACL 2026, 9 pages, 5 figures, 6 tables
点击查看摘要
Abstract:This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.
35. 【2606.11931】Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model
链接:https://arxiv.org/abs/2606.11931
作者:Meherun Farzana,Aniket Joarder,Mahmudul Hasan,Md. Mosaddek Khan
类目:Computation and Language (cs.CL)
关键词:educational NLP research, NLP research, widely spoken languages, educational NLP, world most widely
备注: 10 pages, 5 figures, 2 tables. Preprint
点击查看摘要
Abstract:Bangla is among the world's most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.
36. 【2606.11926】oward Generalist Autonomous Research via Hypothesis-Tree Refinement
链接:https://arxiv.org/abs/2606.11926
作者:Jiajie Jin,Yuyang Hu,Kai Qiu,Qi Dai,Chong Luo,Guanting Dong,Xiaoxi Li,Tong Zhao,Xiaolong Ma,Gongrui Zhang,Zhirong Wu,Bei Liu,Zhengyuan Yang,Linjie Li,Lijuan Wang,Hongjin Qian,Yutao Zhu,Zhicheng Dou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Scientific progress depends, Scientific progress, progress depends, Hypothesis Tree Refinement, Scientific
备注:
点击查看摘要
Abstract:Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.
37. 【2606.11910】An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination
链接:https://arxiv.org/abs/2606.11910
作者:Xu Li,Shuqi Tian,Xun Han,Kuncheng Zhao,Xinyi Li
类目:Computation and Language (cs.CL)
关键词:assigning legal penalties, requiring the simultaneous, critical for assigning, simultaneous identification, interdependent statutory
备注: Submitted to ICONIP. 15 pages, 3 figures
点击查看摘要
Abstract:Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented generation methods suffer from a multi-dimensional retrieval bottleneck: single axis architectures compress complex legal queries into a single pathway, causing interdependent statutory dimensions to be overlooked. To address this, we propose OMAGR, an ontology-guided framework that decomposes queries into ontology-aligned anchors and executes parallel graph retrieval across each dimension, ensuring independent retrieval across dimensions before fusion. To evaluate the proposed method, we created the TrafficLaw-QA dataset, an expert-validated benchmark dataset containing 200 questions and 527 legal provisions. Results show that TrafficOmni-RAG outperforms baselines on Context Precision and Faithfulness metrics. The findings demonstrate that parallel multi-anchor retrieval effectively resolves the multi-dimensional retrieval bottleneck, offering a promising direction for traffic law liability determination research.
38. 【2606.11906】When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models
链接:https://arxiv.org/abs/2606.11906
作者:Xuan Dong,Zhe Han,Tianhao Niu,Qingfu Zhu,Wanxiang Che
类目:Computation and Language (cs.CL)
关键词:language-conditioned robotic manipulation, remains poorly understood, variation remains poorly, robotic manipulation, poorly understood
备注: Accepted to ACL 2026 Main Conference
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps: certain steps exhibit strong language dependence and dominate overall task failure, while others are largely language-agnostic. Based on this insight, we propose a step-wise inference-time intervention that aligns representations according to step language sensitivity, substantially improving performance under linguistic variation. Our results indicate that language robustness in VLA models is fundamentally a step-wise control problem, highlighting the importance of temporally structured analysis for reliable embodied agents.
39. 【2606.11898】GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs
链接:https://arxiv.org/abs/2606.11898
作者:Hengyi Feng,Zeang Sheng,Meiyi Qiang,Meiyi Qiang,Wentao Zhang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:gained significant attention, significant attention recently, attention recently due, Research on Text-Attributed, Large Language Models
备注:
点击查看摘要
Abstract:Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at this https URL.
40. 【2606.11897】Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills
链接:https://arxiv.org/abs/2606.11897
作者:Shi Liu,Jiayao Chen,Chengwei Qin,Yanqing Hu,Jufan Zhang,Linyi Yang
类目:Computation and Language (cs.CL)
关键词:plan follow-up experiments, researchers record observations, Scientific discovery workflows, follow-up experiments, interpret uncertain results
备注: 28 pages, preprint
点击查看摘要
Abstract:Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.
41. 【2606.11893】Beyond representational alignment with brain-guided language models for robust reasoning
链接:https://arxiv.org/abs/2606.11893
作者:Mingqing Xiao,Kai Du,Zhouchen Lin
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
关键词:remains insufficiently characterized, higher-order cognition remains, cognition remains insufficiently, mechanisms underlying human, neural mechanisms underlying
备注:
点击查看摘要
Abstract:The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.
42. 【2606.11875】I Understand How You Feel: Enhancing Deeper Emotional Support Through Multilingual Emotional Validation in Dialogue System
链接:https://arxiv.org/abs/2606.11875
作者:Zi Haur Pang,Yahui Fu,Koji Inoue,Tatsuya Kawahara
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:feelings make sense, user feelings make, explicitly acknowledging, make sense, Emotional validation
备注: This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2026 (SIGDIAL 2026)
点击查看摘要
Abstract:Emotional validation - explicitly acknowledging that a user's feelings make sense - has proven therapeutic value but has received little computational attention. Emotional validation in dialogue systems can be decomposed into (i) validating response identification, (ii) validation timing detection, and (iii) validating response generation. To support research on all three subtasks, we release M-EDESConv, a 120k English-Japanese multilingual corpus created through hybrid manual and automatic annotation, and M-TESC, a multilingual spoken-dialogue test set. For timing detection, we propose MEGUMI, a Multilingual Emotion-aware Gated Unit for Mutual Integration, that fuses frozen XLM-RoBERTa semantics with language-specific emotion encoders via cross-modal attention and gated fusion. MEGUMI shows superior performance on both the M-EDESConv and M-TESC datasets, both objectively and subjectively. Finally, our EmoValidBench benchmarks of GPT-4.1 Nano and Llama-3.1 8B indicate that current LLMs generate contextually similar and diverse validating responses, but emotional understanding remains a major area for improvement. Project page: this https URL
43. 【2606.11854】Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
链接:https://arxiv.org/abs/2606.11854
作者:Michal Chudoba,Sergey Alyaev,Petra Galuscakova,Tomasz Wiktorski
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language Model, Soft Prompting introduces, main Parameter-Efficient Fine-Tuning, Multimodal Large Language
备注:
点击查看摘要
Abstract:There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.
44. 【2606.11817】Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
链接:https://arxiv.org/abs/2606.11817
作者:Yitong Zhang,Shiteng Lu,Jia Li
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:Large Language Models, Large Language, raising concerns, misused to produce, Large
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.
Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
Cite as:
arXiv:2606.11817 [cs.CR]
(or
arXiv:2606.11817v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2606.11817
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
45. 【2606.11816】WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning
链接:https://arxiv.org/abs/2606.11816
作者:Yizhou Chi,Eric Chamoun,Zifeng Ding,Andreas Vlachos
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:time-bounded information, uncertainty from incomplete, reason under uncertainty, requires language-model agents, real-world events requires
备注:
点击查看摘要
Abstract:Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.
46. 【2606.11806】External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs
链接:https://arxiv.org/abs/2606.11806
作者:Lin Sun,Heming Zhang,Xiangzheng Zhang
类目:Computation and Language (cs.CL)
关键词:LLM systems accumulate, Production LLM systems, systems accumulate reusable, accumulate reusable operational, practical deployment issue
备注:
点击查看摘要
Abstract:Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this question in a real production moderation setting, with tool-use and GPQA as supporting contrast tasks that expose different output-cost regimes. We compare no-experience baselines, random experience controls, global prompt injection, and retrieval-based selective injection, and analyze both task quality and serving cost. The results show that, once experience becomes case-dependent, selective retrieval provides a stronger operating point than unconditional global injection. They further show that retrieval quality matters more than simply increasing Top-$K$, and that the same serving policy can exhibit substantially different cost-benefit profiles across short-output and decode-heavy regimes. These findings suggest that external experience is best treated as a selective, cost-aware serving decision rather than as a universal add-on. Overall, in the settings studied here, external experience pays off only when both the serving interface and the task-specific cost structure make its quality gains worth the online cost.
47. 【2606.11792】MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models
链接:https://arxiv.org/abs/2606.11792
作者:Yuansheng Gao,Wenbin Xing,Jiahao Yuan,Kaiwen Zhou,Han Bao,Zonghui Wang,Wenzhi Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Video Large Multimodal, Large Multimodal Models, Large Multimodal, achieved remarkable progress, Video Large
备注: Preprint
点击查看摘要
Abstract:Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.
48. 【2606.11786】Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay
链接:https://arxiv.org/abs/2606.11786
作者:Joanito Agili Lopo,Yunita Sari,Guntur Budi Herwanto
类目:Computation and Language (cs.CL)
关键词:Large Language Models, experience performance degradation, Large Language, handling low-resource languages, Kupang Malay
备注: This paper is the result of the Master Thesis in Master of Artificial Intelligence at Universitas Gadjah Mada
点击查看摘要
Abstract:Large Language Models (LLMs) offer new potential for translation tasks but often experience performance degradation when handling low-resource languages. To address this limitation, we propose an approach for fine-tuning LLMs on a low-resource language, Kupang Malay. Our approach involves designing a set of instructions by leveraging explicit lexical and semantic features from a bilingual dictionary, and introducing Continual Instruction Tuning (CIT), a training paradigm that enables iterative instruction-based training. Experimental results demonstrate that our model, named Lius, yields notable improvements over standard instruction-tuned models by outperforming 4-6 points, and surpassing both Neural Machine Translation (NMT) and Multilingual LLM models by 10-13 points on several evaluation metrics. These findings highlight the potential of our approach to mitigate the reliance on large-scale parallel data in low-resource language translation.
49. 【2606.11762】Automated Creativity Evaluation of Language Models Across Open-Ended Tasks
链接:https://arxiv.org/abs/2606.11762
作者:Min Sen Tan,Zachary Kit Chun Choy,Syed Ali Redha Alsagoff,Nadya Yuki Wangsajaya,Mohor Banerjee,Swaagat Bikash Saikia,Alvin Chan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, sparking growing interest, Large language, achieved remarkable progress, language understanding
备注: Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: [this https URL](https://github.com/tanminsen/creativity-eval)
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.
50. 【2606.11744】Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
链接:https://arxiv.org/abs/2606.11744
作者:Sidney Tio,Arunesh Sinha,Pradeep Varakantham
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, typically unstructured chats, Large language, typically unstructured, unstructured chats
备注: 10 Main Body Pages, with Appendices
点击查看摘要
Abstract:Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.
51. 【2606.11740】UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA
链接:https://arxiv.org/abs/2606.11740
作者:Mengzhuo Chen,Yan Shu,Chi Liu,Hongming Piao,Xidong Wang,Derek Li,Bryan Dai
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:medical VQA, interleaved textual reasoning, reasoning, input types, types are aligned
备注:
点击查看摘要
Abstract:We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at this https URL.
52. 【2606.11722】ICA Lens: Interpreting Language Models Without Training Another Dictionary
链接:https://arxiv.org/abs/2606.11722
作者:Sida Liu,Feijiang Han
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:controlling model behavior, model behavior, critical for understanding, understanding and controlling, controlling model
备注: Ongoing Project
点击查看摘要
Abstract:Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.
53. 【2606.11712】Substrate Asymmetry in User-Side Memory: A Diagnostic Framework
链接:https://arxiv.org/abs/2606.11712
作者:Youwang Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:User-side memory, output more user-aware, LLMs is typically, typically scored, user history
备注: Preprint. Code: [this https URL](https://github.com/EpistemicaLab/substrate-asymmetry-memory)
点击查看摘要
Abstract:User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to =0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.
54. 【2606.11709】RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation
链接:https://arxiv.org/abs/2606.11709
作者:Leyi Pan,Shuchang Tao,Yunpeng Zhai,Lingzhe Zhang,Zhaoyang Liu,Bolin Ding,Aiwei Liu,Lijie Wen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:token-level supervision, privileged context, typically a verified, verified solution, On-policy self-distillation
备注: 20 pages, 9 figures, 9 tables
点击查看摘要
Abstract:On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.
55. 【2606.11702】MedCTA: A Benchmark for Clinical Tool Agents
链接:https://arxiv.org/abs/2606.11702
作者:Tajamul Ashraf,Hyewon Jeong,Fida Mohammad Thoker,Bernard Ghanem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:clinically grounded decisions, make clinically grounded, evidence acquisition, make clinically, simple recognition
备注: Project Page: [this https URL](https://ivul-kaust.github.io/MedCTA/) Code: [this https URL](https://github.com/IVUL-KAUST/MedCTA) Data: [this https URL](https://huggingface.co/datasets/IVUL-KAUST/MedCTA)
点击查看摘要
Abstract:To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at this https URL
56. 【2606.11688】Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents
链接:https://arxiv.org/abs/2606.11688
作者:Youwang Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Long-horizon LLM agents, Long-horizon LLM, confidently report success, LLM agents, human watching
备注: Preprint. Code: [this https URL](https://github.com/EpistemicaLab/goal-compiled-autopilot)
点击查看摘要
Abstract:Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.
57. 【2606.11686】Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness
链接:https://arxiv.org/abs/2606.11686
作者:Sawyer Zhang,Alexander Wang,Sophie Lei
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:evaluate LLM agents, evaluate LLM, LLM agents, agent regressed, LLM
备注: 12 pages, 2 figures, 5 tables
点击查看摘要
Abstract:End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.
58. 【2606.11681】UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction
链接:https://arxiv.org/abs/2606.11681
作者:Sangmin Lee,Eekgyun Ahn,Woongjib Choi,Hong-Goo Kang
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:massively multilingual TTS, Romanized transcription-based, multilingual TTS systems, massively multilingual, multilingual TTS
备注: Accepted to Interspeech 2026
点击查看摘要
Abstract:We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.
59. 【2606.11680】Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents
链接:https://arxiv.org/abs/2606.11680
作者:Hao-Lun Hsu,Nikki Lijing Kuang,Boyi Liu,Zhewei Yao,Yuxiong He
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language model, Large language, long-horizon tasks due, growing input contexts, language model
备注:
点击查看摘要
Abstract:Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.
60. 【2606.11678】Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment
链接:https://arxiv.org/abs/2606.11678
作者:Yijie Deng,He Zhu,Wen Wang,Junyou Su,Minxin Chen,Wenjia Zhang
类目:Computation and Language (cs.CL)
关键词:Research Strategy, Urban Planning Bench, professional planning knowledge, raises a key, rise of large
备注:
点击查看摘要
Abstract:Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2606.11678 [cs.CL]
(or
arXiv:2606.11678v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.11678
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
61. 【2606.11654】he Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience
链接:https://arxiv.org/abs/2606.11654
作者:Kazuki Nakayashiki,Keisuke Watanabe
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
关键词:social highlighter, lead, aggregate crowd salience, edge, model beats lead
备注: 10 pages, 3 figures, 4 tables
点击查看摘要
Abstract:A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.
62. 【2606.11648】Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs
链接:https://arxiv.org/abs/2606.11648
作者:Kazuki Iwahana,Masaru Matsubayashi,Takuma Koyama,Toshiki Shibahara,Kenichiro Omintato,Akira Ito
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, reliability of Large, Backdoor, Language Models
备注:
点击查看摘要
Abstract:Backdoor attacks pose a serious threat to the safety and reliability of Large Language Models (LLMs), as they cause models to behave normally on clean inputs while producing attacker-specified responses when hidden triggers are present. Removing such unknown backdoors is particularly challenging when the defender does not know the backdoor attack types or the internal mechanisms formed through backdoor training. In this work, we propose a simple but effective backdoor removal method based on shared internal mechanisms across different backdoors. First, we show that different backdoors with the same task (attack objective) induce similar trigger-activated changes in the internal activations. Motivated by this observation, our method intentionally embeds a backdoor with a known trigger (\emph{dummy backdoor}) and then removes it through further fine-tuning on dummy-triggered inputs paired with clean responses. Since the dummy backdoor and the unknown backdoor can rely on shared internal mechanisms, removing the dummy backdoor also reduces the effect of the unknown backdoor. We evaluate our method on three backdoor attack types across multiple model families. Experimental results show that our method substantially reduces the attack success rate of the unknown backdoor while preserving model utility, outperforming representative existing defense methods in both backdoor removal effectiveness and utility preservation. These findings suggest that a defender-controllable backdoor can serve as a helpful proxy for mitigating unknown backdoors in generative LLMs.
63. 【2606.11643】Improving Cross-Format Robustness in Language Models with Multi-Format Training
链接:https://arxiv.org/abs/2606.11643
作者:June M. Liu,Shaomian Zheng,He Cao,Dingnan Jin,Qing Cui,Jun Zhou
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, semantically equivalent form, question solved correctly, solved correctly
备注:
点击查看摘要
Abstract:Large language models often remain sensitive to answer format: a question solved correctly in one form may fail in another semantically equivalent form. To study this gap, we define cross-format robustness as the extent to which a model answers the same underlying question consistently across formats. We then compare full-format training with FormatMix, which expands only a subset of training items into multiple equivalent formats using either random or targeted selection. Across GLM4 and Llama-3.1, multi-format supervision consistently improves both task performance and cross-format robustness, whereas Multiple-choice question (MCQ)-only supervision alone brings little benefit and can even reduce robustness. We further find that expanding only about 30% of the training set into multiple formats often recovers most of the gain from full-format training, and this effect appears across the model families and sizes we study. These results suggest that format diversity, rather than additional supervision alone, is the key driver of robustness. That lightweight multi-format augmentation is a practical way to make LLMs less sensitive to answer format without changing the base model.
64. 【2606.11642】3-Key-Input: Exploring the Theoretical Minimum Keys for Text Entry
链接:https://arxiv.org/abs/2606.11642
作者:Naoki Kimura
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:modern language models, language models, reduce the number, endow an ambiguous, ambiguous keyboard
备注: 6 pages, 1 figure, 7 tables. Published in ICASSP 2026
点击查看摘要
Abstract:How far can we reduce the number of physical keys if we endow an ambiguous keyboard with modern language models? Fewer keys increase hardware design freedom in constrained settings such as assistive devices and mobile form factors. This paper systematically evaluates text entry systems using 2-5 physical keys combined with language-model-based disambiguation. On a 300-sentence English corpus (100 sentences each for Business / Conversational / Technical), we compare key counts (2-5), letter-to-key mappings (layout-based / frequency-based / intentionally worst-case), and decoders (Trie-only, GPT-2 beam search, GPT-4o selection). We find that 3 keys + GPT-4o achieves character error rate (CER) 9.46% and word error rate (WER) 12.20%, reducing CER by 59% relative to 2 keys (CER 23.3%). At 3 keys, the key-stream entropy is 1.54 bits/char; while increasing to 5 keys improves accuracy (CER 5.4%), the marginal gains diminish. Mapping choice has a small impact under standard designs ({\Delta}CER 0.5 pp), and even an intentionally worst mapping degrades CER by only +0.5 pp, whereas Technical sentences yield roughly twice the error rate of Business. These results suggest that, in our evaluated offline setting under a strong LM prior, 3 keys are a practical minimum for general English.
65. 【2606.11639】Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models
链接:https://arxiv.org/abs/2606.11639
作者:Catherine Bao,Maneesha Rani Saha,Neal Patwari
类目:Computation and Language (cs.CL)
关键词:International Phonetic Alphabet, ASR systems, automatic speech recognition, demographic biases related, imbalanced training data
备注:
点击查看摘要
Abstract:The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies focused on standard grapheme-based ASR systems with comparatively little emphasis on phoneme-based systems, such as models that produce International Phonetic Alphabet (IPA) representations. As ASR systems shift toward multilingual support and low-resource language modeling, IPA-based layers serve as a critical, language-agnostic foundation. In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. Our evaluation includes existing multilingual speech corpora and demographically annotated English-language corpora. We measure model performance by comparing model-generated IPA transcriptions against grapheme-to-phoneme (G2P) systems using both standard phoneme error rate (PER) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme-based ASR systems. Our code and data will be made publicly available to the community.
66. 【2606.11613】Factions Within, Uncertain Across: Within-Document Reader Sub-Groups in Social Highlighting
链接:https://arxiv.org/abs/2606.11613
作者:Kazuki Nakayashiki,Keisuke Watanabe
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
关键词:single consensus, people highlight, internally structured, stable reader trait, document
备注: 11 pages, 3 figures, 3 tables
点击查看摘要
Abstract:When many people highlight the same document, is the crowd a single consensus, or is it internally structured into reader sub-groups that mark different things -- and is that structure a stable property of a reader or of the document? Building on prior work showing an individual's within-document highlighting signal is a whisper while individuality lives in selection, we ask the group-level question on a co-readership platform using a margin-preserving curveball null. Experiment 1: within a document, readers form strong sub-groups -- pairs agree far beyond what shared salience, mark density, and sentence popularity predict (nearest-neighbour agreement z=+6.3, significant in 88% of documents). Under an eight-block region-preserving null, shared engagement with the same coarse regions of the document accounts for about 40% of this excess; the majority survives as finer reader-specific agreement (z=+3.6, 77% significant). So the within-document crowd is, in a descriptive sense, factional. Experiment 2: is that grouping a stable reader trait? Here we are honest about power. The cross-document split-half reproducibility of a pair's agreement is near zero pooled (+0.078 and 0.000 in two separately drawn samples), and a power calibration shows the test is informative only for pairs that co-read many documents. In the only informative high-overlap subset (k=4), point estimates are positive but small-sample, imprecise across the separately drawn samples, never significant, and attenuate under the region-preserving null. We therefore leave cross-document stability unresolved: the data is consistent with anything from situational grouping to a weak-to-moderate stable reader trait. The crowd is factional within a document; whether its factions follow the reader across documents is, honestly, beyond our reach.
67. 【2606.11609】Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection
链接:https://arxiv.org/abs/2606.11609
作者:Meysam Sabbaghan,Arman Zareian Jahromi,Doina Caragea
类目:Computation and Language (cs.CL)
关键词:detection requires identifying, rhetorically framed, requires identifying, identifying an author, author position
备注:
点击查看摘要
Abstract:Stance detection requires identifying an author's position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on this task, single-pass prompting can be brittle when multiple interpretations are plausible. Existing aggregation strategies, such as majority voting or self-consistency, improve robustness by combining labels, but they discard the intermediate reasoning needed to resolve conflicting interpretations. We introduce a multi-agent reasoning framework with adaptive worker allocation for stance detection that shifts aggregation from label-level voting to reasoning-level synthesis. The framework employs a Manager-Worker architecture in which a Manager adaptively allocates a variable number of Worker agents based on input complexity. Each Worker analyzes the input from a distinct perspective and produces a reasoning-only explanation without emitting a stance label; the Manager then synthesizes these explanations to produce the final prediction. We evaluate the proposed framework on SemEval-2016, P-Stance, and COVID-19 Stance using Llama, Mistral, and Gemini. Results show that the framework yields the largest gains on implicit and context-dependent stance cases, achieving 86.07 Macro-F1 on COVID-19 and 82.90 on SemEval-2016, while remaining competitive on more explicit stance datasets such as P-Stance. These findings suggest that adaptive reasoning-level aggregation is most beneficial when stance cannot be reliably inferred from surface cues alone.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2606.11609 [cs.CL]
(or
arXiv:2606.11609v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.11609
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
68. 【2606.11599】When is Your LLM Steerable?
链接:https://arxiv.org/abs/2606.11599
作者:Chenrui Fan,Yize Cheng,Ming Li,Soheil Feizi,Tianyi Zhou
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Activation steering offers, control language models', language models' behavior, fails heavily depends, Activation steering
备注:
点击查看摘要
Abstract:Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.
69. 【2606.11585】Kuramoto Attention: Synchronizing Self-Attention on the Torus
链接:https://arxiv.org/abs/2606.11585
作者:Joshua Nunley
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Adaptation and Self-Organizing Systems (nlin.AO)
关键词:introduce Kuramoto attention, introduce Kuramoto, layer scores tokens, hidden coordinate, Kuramoto coupling term
备注: 13 pages, 2 figures, 3 tables
点击查看摘要
Abstract:We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(\theta_u-\theta_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.
70. 【2606.11562】GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs
链接:https://arxiv.org/abs/2606.11562
作者:Zhuoyi Peng,Jingzhou Jiang,Hanlin Gu,Lixin Fan,Yi Yang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Graph analysis underlies, laundering rings, drug repurposing, user preference, analysis underlies
备注: Code: [this https URL](https://github.com/graphinfer/GraphInfer-Bench) ; Dataset: [this https URL](https://huggingface.co/datasets/graphinfer/graphinfer)
点击查看摘要
Abstract:Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.
71. 【2606.11552】aching Diffusion to Speculate Left-to-Right
链接:https://arxiv.org/abs/2606.11552
作者:Lexington Whalen,Yuki Ito,Ryo Sakamoto
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:achieve remarkable performance, process incurs substantial, Large language models, decoding process incurs, incurs substantial inference
备注: 13 pages, technical report
点击查看摘要
Abstract:Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.
72. 【2606.11542】Pretrained self-supervised speech models can recognize unseen consonants
链接:https://arxiv.org/abs/2606.11542
作者:Chihiro Taguchi,Éric Le Ferrand,Hirosi Nakagawa,Hitomi Ono,Kanji Kato,Emily Prud'hommeaux,David Chiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Modern pretrained self-supervised, large-scale audio data, automatic speech recognition, Modern pretrained, contextualized representations
备注: 6 pages, 3 figures, 3 tables, accepted at Interspeech 2026
点击查看摘要
Abstract:Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.
73. 【2606.11531】Measuring language complexity from hierarchical reuse of recurring patterns
链接:https://arxiv.org/abs/2606.11531
作者:Junyi Zhou,Rui Liu,Pengyu Liu,Yu Liu
类目:Computation and Language (cs.CL); Information Theory (cs.IT)
关键词:algorithmic information theory, language complexity grounded, ladderpath approach, information theory, Parallel Universal Dependencies
备注: 17 pages, 4 figures
点击查看摘要
Abstract:We introduce the ladderpath index as a measure of language complexity grounded in algorithmic information theory. It counts the minimum steps needed to reconstruct a sequence through hierarchical reuse of repeated substructures, capturing an exactly computable but constrained form of algorithmic compressibility related to, but distinct from, Kolmogorov complexity. We apply the ladderpath approach to 21 parallel corpora from the Parallel Universal Dependencies dataset. The ladderpath index is approximately invariant across the languages, and varies much less than the corpus length. This is more pronounced when all corpora are mapped to a unified binary representation, providing evidence for the equi-complexity hypothesis from a representation-independent perspective. We also observe trade-offs between character inventory size and corpus length, and between vocabulary-level and corpus-level reconstruction complexity, supporting the trade-off hypothesis that total complexity is conserved and redistributed across linguistic levels. The reusable substructures identified by the ladderpath approach, without any linguistic input, overlap with words and morphological components attested in the natural vocabulary. The hierarchical reuse captured by the ladderpath approach parallels the chunking mechanisms proposed in cognitive science, where the human cognitive system compresses linguistic input into nested, reusable units under shared memory and processing constraints. This connection between cognitive chunking and the ladderpath approach provides a new interpretation for the equi-complexity and trade-off hypotheses, grounding both in the shared cognitive architecture that underlies language processing across human languages.
74. 【2606.11520】ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories
链接:https://arxiv.org/abs/2606.11520
作者:Siyuan Luo,Nairong Zheng,Lin Zhou,Tiankuo Yao,Shengyou Yuan,Haojia Yu,Cong Pang,Jiapeng Luo,Lewei Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Training capable, simultaneously captures structured, agents requires data, properties absent, requires data
备注: 13 pages, 6 figures. Dataset and code: [this https URL](https://github.com/Valiere01/ISE-Trace)
点击查看摘要
Abstract:Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent - Simulate - Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at this https URL.
75. 【2606.11512】SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment
链接:https://arxiv.org/abs/2606.11512
作者:Kaiwen Shi,Zheyuan Zhang,Yanfang Ye
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, language models increasingly, models increasingly express, increasingly express uncertainty
备注:
点击查看摘要
Abstract:Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional calibration problem: the appropriate uncertainty target for a prompt should be estimated from repeated model outputs rather than from an isolated response. However, group rollouts alone are insufficient, since the resulting target must provide a useful training signal. Existing targets only partially satisfy this requirement. We propose SAGE, Semantic-Answer Guided Entropy, a group-level uncertainty target that constructs an answer-conditioned uncertainty geometry over sampled responses. SAGE preserves categorical, numeric, and symbolic answer distinctions while maintaining a smooth and scale-preserving calibration signal. We further apply this target through Group-Uncertainty Preference Optimization, or GUPO, an uncertainty-channel training framework that supervises verbal uncertainty expressions rather than the full response. Experiments across factual, mathematical, and multiple-choice reasoning tasks show improved uncertainty ranking, lower calibration error, and reduced overconfidence.
76. 【2606.11502】When Roleplaying, Do Models Believe What They Say?
链接:https://arxiv.org/abs/2606.11502
作者:Benjamin Sturgeon,David Africa,Sid Black
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:orbits the Sun, Earth orbits, role-playing Aristotle, assert the opposite, Emergent Misalignment
备注:
点击查看摘要
Abstract:Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.
77. 【2606.11499】Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
链接:https://arxiv.org/abs/2606.11499
作者:Vedant Badoni,Danqi Chen,Xinyi Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:language models depends, models depends critically, modern language models, modern language, depends critically
备注: 10 pages
点击查看摘要
Abstract:The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.
78. 【2606.11482】Building Social World Models with Large Language Models
链接:https://arxiv.org/abs/2606.11482
作者:Haofei Yu,Yining Zhao,Guanyu Lin,Jiaxuan You
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
关键词:Understanding and predicting, social beliefs evolve, social beliefs, social, scientific breakthroughs
备注: 9 pages. ICML 2026
点击查看摘要
Abstract:Understanding and predicting how social beliefs evolve in response to events -- from policy changes to scientific breakthroughs -- remains a fundamental challenge in social science. Given LLMs' commonsense knowledge and social intelligence, we ask: Can LLMs model the dynamics of social beliefs following social events? In this work, we introduce the concept of the Social World Model (SWM), a general framework designed to capture how social beliefs evolve in response to major events. SWM learns state-transition functions for social beliefs by mining temporal patterns in social data and optimizing the evidence lower bound, without the need for explicit human annotations linking events to belief shifts, or for expensive census data. To evaluate SWM, we introduce a benchmark, SWM-bench, derived from real-world prediction markets, specifically Kalshi and Polymarket. SWM-bench includes over 12k data points for social belief prediction tasks spanning diverse domains such as politics, finance, and cryptocurrency. Our experimental results show that SWM significantly outperforms time-series foundation models, achieving state-of-the-art results on Kalshi data and demonstrating competitive performance on Polymarket data, while offering interpretable insights into the underlying mechanisms of social belief dynamics.
79. 【2606.11470】he Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
链接:https://arxiv.org/abs/2606.11470
作者:Avinash Anand,Mahisha Ramesh,Avni Mittal,Ashutosh Kumar,Erik Cambria,Zhengkui Wang,Timothy Liu,Aik Beng Ng,Simon See,Rajiv Ratn Shah
类目:Computation and Language (cs.CL)
关键词:Large Language Models, achieved strong performance, natural language processing, reasoning, Large Language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.
80. 【2606.11459】APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection
链接:https://arxiv.org/abs/2606.11459
作者:Fei Wang,Si Si,Cho-Jui Hsieh,Inderjit S. Dhillon
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, Models are highly, necessitating automatic prompt
备注:
点击查看摘要
Abstract:Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.
81. 【2606.11456】AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable
链接:https://arxiv.org/abs/2606.11456
作者:Meysam Alizadeh,Fabrizio Gilardi,Mohsen Mosleh,Enkelejda Kasneci
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:raises opposing concerns, reach motivated conclusions, scientific analysis raises, analysis raises opposing, Claude Code
备注:
点击查看摘要
Abstract:The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.
82. 【2606.11447】AI Coding Agents Can Reproduce Social Science Findings
链接:https://arxiv.org/abs/2606.11447
作者:Meysam Alizadeh,Mohsen Mosleh,Fabrizio Gilardi,Atoosa Kasirzadeh,Joshua Tucker
类目:Computation and Language (cs.CL)
关键词:Recent anecdotal evidence, Recent anecdotal, sciences remains limited, anecdotal evidence suggests, remains limited
备注:
点击查看摘要
Abstract:Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.
83. 【2606.11435】Agent Skill Evaluation and Evolution: Frameworks and Benchmarks
链接:https://arxiv.org/abs/2606.11435
作者:Kexin Ding,Yang Zhou,Can Jin,Feng Tong,Mu Zhou,Dimitris N. Metaxas
类目:Computation and Language (cs.CL)
关键词:systems are built, growth of agent, transformed how agentic, agentic systems, skill
备注:
点击查看摘要
Abstract:The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is this https URL
84. 【2606.11424】SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing
链接:https://arxiv.org/abs/2606.11424
作者:Sai Ashish Somayajula,Marianne Menglin Liu,Chuan Lei,Fjona Parllaku,Daniel Garcia,Rongguang Wang,Syed Fahad Allam Shah,Ankan Bansal,Sujeeth Bharadwaj,Tao Sheng,Sujith Ravi,Dan Roth
类目:Computation and Language (cs.CL)
关键词:Natural language interfaces, Natural language, translate user questions, SQL generation, executable SQL
备注: 34 pages, 1 figure, 7 tables. Preprint
点击查看摘要
Abstract:Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.
85. 【2606.11420】Context-Aware Multimodal Claim Verification in Spoken Dialogues
链接:https://arxiv.org/abs/2606.11420
作者:Chaewan Chun,Delvin Ce Zhang,Dongwon Lee
类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:millions absorb claims, millions absorb, podcasts and streams, audio, absorb claims
备注:
点击查看摘要
Abstract:Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.
86. 【2606.11399】Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version
链接:https://arxiv.org/abs/2606.11399
作者:Trung Duc Anh Dang,Tung Kieu,Sarah Masud
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, reflect homogenized, homogenized values inherited
备注: 18 pages
点击查看摘要
Abstract:Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.
87. 【2606.11387】Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining
链接:https://arxiv.org/abs/2606.11387
作者:Felipe Chavarro Polania
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Short pretraining runs, reduce experimental cost, Short pretraining, experimental cost, pretraining runs
备注: 14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts
点击查看摘要
Abstract:Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.
Comments:
14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2606.11387 [cs.CL]
(or
arXiv:2606.11387v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.11387
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
88. 【2606.11386】Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering
链接:https://arxiv.org/abs/2606.11386
作者:Cheng-Kuang Chang,Kai-Wei Chang,Alexander H. Liu,James Glass
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
关键词:Full-duplex spoken language, Full-duplex spoken, spoken language models, enable seamless speech, seamless speech interaction
备注:
点击查看摘要
Abstract:Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.
89. 【2606.11375】When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis
链接:https://arxiv.org/abs/2606.11375
作者:Orion Reblitz-Richardson
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Standard linear probing, hidden states achieves, states achieves high, Standard linear, achieves high accuracy
备注: 22 pages, 5 figures. Code and datasets at [this https URL](https://github.com/deepsteer/deepsteer)
点击查看摘要
Abstract:Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.
90. 【2606.11371】he Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales
链接:https://arxiv.org/abs/2606.11371
作者:Han-Jen Chang,Yasir Çatal,Angelika Wolman,Agustín Ibáñez,David Smith,I-Wen Su,Kai-Yuan Cheng,Georg Northoff
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
关键词:large language models, language models, large language, varying semantic content, unfolds over time
备注: 45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech Language
点击查看摘要
Abstract:Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.
91. 【2606.11361】A PubMed-Scale Dataset of Structured Biomedical Abstracts
链接:https://arxiv.org/abs/2606.11361
作者:Chia-Hsuan Chang,Haerin Song,Brian Ondov,Hua Xu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:biomedical literature processing, facilitating information retrieval, text mining, literature processing, knowledge synthesis
备注: Data and code for this work are available at [this https URL](https://doi.org/10.5281/zenodo.20336717) and [this https URL](https://github.com/BIDS-Xu-Lab/StructuredPubMed) , respectively
点击查看摘要
Abstract:Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.
92. 【2606.11350】When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval
链接:https://arxiv.org/abs/2606.11350
作者:Nabaraj Subedi,Ahmed Abdelaty,Shivanand Venkanna Sheshappanavar
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Retrieval-augmented generation degrades, loses discriminative power, similarity loses discriminative, increasingly returns semantically, returns semantically similar
备注: 24 pages, 8 figures, 30 tables. Preprint under review
点击查看摘要
Abstract:Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ($p 0.05$). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results --creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.
93. 【2606.11337】Can AI Agents Synthesize Scientific Conclusions?
链接:https://arxiv.org/abs/2606.11337
作者:Hayoung Jung,Pedro Viana Diniz,José Reinaldo Corrêa Roveda,Abner Fernandes da Silva,Haeun Jung,Enoch Tsai,Aleksandra Korolova,Manoel Horta Ribeiro
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:increasingly retrieve evidence, agents increasingly retrieve, reason across sources, retrieve evidence, consequential decisions
备注: 79 pages, 34 figures, 17 tables. Under Submission
点击查看摘要
Abstract:Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.
94. 【2606.11316】Schützen: Evaluating LLM Safety in Bulgarian and German Contexts
链接:https://arxiv.org/abs/2606.11316
作者:Kiril Georgiev,Yuxia Wang,Dimitar Iliyanov Dimitrov,Preslav Nakov,Ivan Koychev
类目:Computation and Language (cs.CL)
关键词:Large language models, professional domains, including the generation, disrespectful content, Large language
备注: 19 pages, 13 tables, 12 figures
点击查看摘要
Abstract:Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing safety evaluation datasets, existing resources remain overwhelmingly English- and Chinese-centric. This limitation is particularly pronounced when evaluating languages that operate within shared sociocultural, legal, and ethical contexts. To address this gap, we introduce Schützen: a German--Bulgarian safety dataset designed to assess model answerability under risk, covering both a low-resource language (Bulgarian) and a high-resource language (German). Experiments with multilingual and language-specific LLMs reveal pronounced cross-language differences in safety behavior, highlighting the necessity of tailored, region-specific evaluation resources to support the responsible deployment of LLMs in Germany and Bulgaria. Datasets and code are available at this https URL. Warning: this paper contains examples that may be offensive, harmful, or biased.
95. 【2606.11290】FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse
链接:https://arxiv.org/abs/2606.11290
作者:Lingzhi Yuan,Chenghao Deng,Fangxu Yu,Souradip Chakraborty,Mohammad Rostami,Furong Huang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, Language Model, based multi-agent systems, based multi-agent
备注:
点击查看摘要
Abstract:Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.
96. 【2606.11270】Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation
链接:https://arxiv.org/abs/2606.11270
作者:Uwe Konig,Hamza Kazmi,Ruizhe Li,Maheep Chaudhary
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:language model intended, transfer undesirable characteristics, undesirable characteristics, Distillation, subliminal learning
备注:
点击查看摘要
Abstract:Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($\tau$ up to $0.61$).
97. 【2606.11257】Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
链接:https://arxiv.org/abs/2606.11257
作者:Zhiyuan Cheng,Longying Lai
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
关键词:large language model, language model, Qualcomm Hexagon NPU, Retrieval-Augmented Generation, large language
备注: 9 pages, 2 figures, 6 tables
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.
98. 【2606.11243】ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation
链接:https://arxiv.org/abs/2606.11243
作者:Chuanzhen Wang,Meade Cleti,Pete Jano
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:novo protein generation, synthetic biology, novo protein, transformative potential, potential in therapeutic
备注: 23 pages
点击查看摘要
Abstract:De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained predictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)-equivariant architecture for efficient multi-scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state-ofthe-art performance while requiring 4 fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.
99. 【2606.11232】Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs
链接:https://arxiv.org/abs/2606.11232
作者:Weijia Zhang,Ruiqi Chen,Yunze Xiao,Weihao Xuan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Existing LLM moral, Existing LLM, moral, Existing, Foundations Theory foundations
备注:
点击查看摘要
Abstract:Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce **Moral Trolley Arena**, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence. The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive. Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.
100. 【2606.11222】A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries
链接:https://arxiv.org/abs/2606.11222
作者:Dmitriy Kompaneets
类目:Computation and Language (cs.CL); Information Theory (cs.IT)
关键词:text carry, Shannon theory measures, text sentence embeddings, theory measures uncertainty, text
备注: 19 pages. Code and data: [this https URL](https://github.com/dkompaneets/geometric_profile_semantic_information)
点击查看摘要
Abstract:How much meaning does a text carry? Shannon's theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text's sentence embeddings. The framework has three parts. First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation. Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold $\tau$. Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, $S_{\mathrm{minmax}}$ and $S_{\mathrm{rank}}$, each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off. The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman $\rho = 0.985$ over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth.
Comments:
19 pages. Code and data: this https URL
Subjects:
Computation and Language (cs.CL); Information Theory (cs.IT)
MSC classes:
94A17 (Primary), 68T50 (Secondary)
Cite as:
arXiv:2606.11222 [cs.CL]
(or
arXiv:2606.11222v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.11222
Focus to learn more
arXiv-issued DOI via DataCite</p>
101. 【2606.11220】LifeSentence: Language models can encode human life course trajectories from longitudinal panel data
链接:https://arxiv.org/abs/2606.11220
作者:Samuel Liu,Muchen Xi,William Yeoh,Joshua J. Jackson
类目:Computation and Language (cs.CL)
关键词:Forecasting human life, individuals attain long, healthy lives, outcomes is important, important to gain
备注:
点击查看摘要
Abstract:Forecasting human life outcomes is important to gain insights into how individuals attain long and healthy lives. Conventional statistical approaches yield limited accuracy, potentially due to discarding the sequential structure of the life course. Modern methods such as transformer architectures require large scale training data that most longitudinal panel studies lack. Here we introduce LifeSentence, a model for life-course reasoning that bridges large language models with longitudinal panel data. By representing each life event as a structured natural-language record and instruction-tuning a pretrained 24-billion-parameter language model across an 18-task evaluation taxonomy spanning prediction, robustness and reasoning, LifeSentence supplements panel data with distributional knowledge already encoded during pretraining. Trained on approximately 65,000 individuals from the German Socio-Economic Panel - roughly 45 times fewer than prior transformer-based approaches - LifeSentence outperforms classical and deep learning baselines across all task families, achieving a threefold improvement in joint event-and-timing prediction from best baselines and 91.2% Kendall's tau when reconstructing chronological order from timestamp-stripped event sets. Without explicit supervision, the model recovers documented patterns of social stratification, including the education premium, the gender wage gap and the motherhood penalty, from discrete event sequences alone. A natural-language interface further enables qualitatively new research queries, such as connecting an early-life history to a specified late-life endpoint, establishing LifeSentence as both a predictive tool and a probe for counterfactual exploration of human biographies.
102. 【2606.11219】Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
链接:https://arxiv.org/abs/2606.11219
作者:Chibuzor Okocha,Christan Grant
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:remains insufficiently benchmarked, accuracy remains insufficiently, Question-Answering accuracy remains, Audio language models, perform semantic reasoning
备注: Accepted to ACL
点击查看摘要
Abstract:Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over-inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model's ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment
103. 【2606.11213】Beyond Compaction: Structured Context Eviction for Long-Horizon Agents
链接:https://arxiv.org/abs/2606.11213
作者:Andrew Semenov,Svyatoslav Dorofeev
类目:Computation and Language (cs.CL)
关键词:Context Window Lifecycle, present Context Window, unbounded working horizon, Window Lifecycle, effectively unbounded working
备注:
点击查看摘要
Abstract:We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically-aware eviction: the agent annotates its trajectory as typed, dependency-linked episodes as work proceeds, and a deterministic, LLM-free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization-based compaction, CWL avoids four well-known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression-induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest-and-most-recoverable content according to the dependency graph rather than oldest-in-time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token-accounting loop, and evaluate CWL on long-horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2606.11213 [cs.CL]
(or
arXiv:2606.11213v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.11213
Focus to learn more
arXiv-issued DOI via DataCite</p>
104. 【2606.11212】EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA
链接:https://arxiv.org/abs/2606.11212
作者:Jaspreet Singh Nahal
类目:Computation and Language (cs.CL)
关键词:Standard Retrieval-Augmented Generation, incurring unnecessary computation, propagating low-quality context, Standard Retrieval-Augmented, generation unconditionally
备注: 12 pages, 10 figures, 6 tables. Code and evaluation scripts available at: [this https URL](https://github.com/merciless-admiral-3083/EverydayGPT) . This paper studies routing strategies for hybrid GPT-RAG systems under resource constraints, focusing on efficiency-safety tradeoffs rather than state-of-the-art accuracy
点击查看摘要
Abstract:Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.
105. 【2606.11211】Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models
链接:https://arxiv.org/abs/2606.11211
作者:Prakul Sunil Hiremath,Harshit R. Hiremath
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:express calibrated uncertainty, safe deployment, ability of large, express calibrated, calibrated uncertainty
备注: 31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available
点击查看摘要
Abstract:The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is not fully understood. We show that this picture is incomplete: in some settings, increasing the reasoning budget beyond a task-specific threshold can cause models to become systematically overconfident, assigning high confidence to incorrect answers. We call this phenomenon Calibration Drift Under Reasoning (CDUR) and study it both theoretically and empirically. We define reasoning budget B and analyze conditions under which Expected Calibration Error ECE(B) follows a non-monotonic pattern: it first decreases as reasoning corrects errors, then increases as longer reasoning produces internally consistent but incorrect explanations. We propose a Hypothesis Lock-In model based on autoregressive generation to explain this behavior. We evaluate Llama-3.1-8B and Llama-3.3-70B on 47 reasoning-trap questions across four reasoning budgets and three seeds (1,368 API calls; 574 valid responses). The 8B model shows non-monotonic calibration behavior, while results for the 70B model are limited to baseline evaluation and are inconclusive for budget-dependent effects. We introduce CABStop, a calibration-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate. These results suggest that increasing reasoning depth does not always improve reliability and should be monitored carefully.
Comments:
31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
MSC classes:
68T50, 68T07
ACMclasses:
I.2.7; I.2.6; I.2.1
Cite as:
arXiv:2606.11211 [cs.CL]
(or
arXiv:2606.11211v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.11211
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Prakul Hiremath [view email] [v1]
Fri, 24 Apr 2026 04:46:16 UTC (206 KB)
106. 【2606.11210】2MM: An LLM Supported Architecture For Inquiry-Based Modeling
链接:https://arxiv.org/abs/2606.11210
作者:John Kos,Rudra Singh,Ashok Goel
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:Experimental Research Assistant, foundational practice, practice in science, relies on visualization, Virtual Experimental Research
备注: 16 pages, 4 figures
点击查看摘要
Abstract:Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner's model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.
107. 【2606.11209】ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward
链接:https://arxiv.org/abs/2606.11209
作者:Jingpei Wu,Xiao Han,Weixiang Shen,Boer Zhang,Zifeng Ding,Volker Tresp
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Visual question answering, question answering increasingly, Relative Policy Optimization, Group Relative Policy, Visual question
备注: Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 7 pages, 1 figure
点击查看摘要
Abstract:Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct
108. 【2606.11208】BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts
链接:https://arxiv.org/abs/2606.11208
作者:Elias Hossain,Sanjeda Sara Jennifer,Sabera Akter Bushra,Niloofar Yousefi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Biomedical findings, true contradictions, Abstract, divergence, claims locally valid
备注:
点击查看摘要
Abstract:Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.
109. 【2606.11207】From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference
链接:https://arxiv.org/abs/2606.11207
作者:Liu hung ming
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:including purchase intent, e-commerce session data, extracting structured semantic, shared element library, driving pluggable inference
备注: 20 pages, 9 tables
点击查看摘要
Abstract:We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start this http URL report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.
110. 【2606.11206】Compatibility-Aware Dynamic Fine-Tuning for Large Language Models
链接:https://arxiv.org/abs/2606.11206
作者:Yucheng Zhou,Junwei Sheng,Qianning Wang,Jianbing Shen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:aligning large language, large language models, Dynamic Fine-Tuning, predominant paradigm, paradigm for aligning
备注: ACL 2026
点击查看摘要
Abstract:Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.
111. 【2606.11205】Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
链接:https://arxiv.org/abs/2606.11205
作者:Matthew James Buchan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:shift LLM behaviour, LLM behaviour, shift LLM, factually correct statements, correct statements
备注: 18 pages, 9 figures, accepted to TAIS 2026
点击查看摘要
Abstract:Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.
112. 【2606.11204】Benchmarking Large Language Models for Safety Data Extraction
链接:https://arxiv.org/abs/2606.11204
作者:Jonas Grill,Thomas Bayer,Sören Berlinger
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Safety Data Sheets, traditional rule-based methods, heterogeneous document formats, industrial safety due, SDS data extraction
备注: 18 pages, 8 figures, submitted to Applied Intelligence
点击查看摘要
Abstract:Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.
113. 【2606.11203】LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis
链接:https://arxiv.org/abs/2606.11203
作者:Faruk Alpay,Bugra Kilictas
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Structured sequence generation, Structured sequence, single output, sequence generation, generation often requires
备注: 19 pages. Code and benchmark files available at [this https URL](https://github.com/farukalpay/latticebridge)
点击查看摘要
Abstract:Structured sequence generation often requires a model to satisfy several input-derived constraints in a single output. Standard decoding methods may assign high probability to fluent continuations while placing low mass on continuations that realize all required anchors jointly. We study this regime as a rare-event sequential inference problem. LatticeBridge combines a compact prefix language model, instance-compiled surface automata, and a twisted sequential Monte Carlo (SMC) decoder with resampling, multilevel splitting, and a source-support proposal term derived from instance-provided phrases. The constraint representation is compiled from each input instance and does not rely on manually curated lexical classes. On 2,610 attainable validation tasks spanning CommonGen, E2E NLG, and WikiBio, the particle decoder improves exact anchor satisfaction and mean anchor coverage over greedy, beam-filtered, and best-of-k ancestral baselines under a shared proposal model. Since exact anchor satisfaction alone does not rule out unsupported attribute substitutions, the evaluation reports required-anchor coverage, source coverage, source-intrusion diagnostics, overlap, runtime, and particle statistics jointly. The benchmark characterizes the faithfulness-overlap-latency frontier under a fixed proposal model.
114. 【2606.11202】One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection
链接:https://arxiv.org/abs/2606.11202
作者:Shuyu Jiang,Kaiyu Xu,Xingshu Chen,Hao Ren,Rui Tang,Yi Zhang,Tianwei Zhang,Hongwei Li
类目:Computation and Language (cs.CL)
关键词:Large language models, creating exploitable gaps, safety training remains, training remains concentrated, global multilingual users
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability, creating exploitable gaps for jailbreak attacks. Current jailbreak defenses are largely developed and evaluated in dominant languages, and their effectiveness is limited by the scarcity of aligned multilingual supervision and representations dispersion caused by language variation. To address this issue, we propose MLJailDe, a multilingual jailbreak detection framework designed to improve both multilingual robustness and cross-lingual generalization. MLJailDe first introduces a multilingual back-translation data augmentation algorithm to construct a semantically consistent and functionally effective dataset spanning 11 languages, consisting of 2,232 benign and 1,239 jailbreak samples. On this basis, MLJailDe employs relative-distance constraints to reduce cross-lingual representation dispersion and encourage jailbreak prompts with similar intent to form consistent clusters across languages, while an imbalance-aware classification objective is further used to alleviate class imbalance and learn more reliable multilingual decision boundaries. Experimental results show that MLJailDe outperforms state-of-the-art baselines across multiple languages, achieving an F1 score of 98.5\%, and obtains an average F1 score of 97.1\% on unseen languages, demonstrating strong effectiveness and cross-lingual generalization.
115. 【2606.11201】o Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending
链接:https://arxiv.org/abs/2606.11201
作者:Jin Gan,Xin Li,Jun Luo
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:make newly trained, newly trained models, trained models safely, user instructions, wide deployment
备注: Accepted by ACL 2026
点击查看摘要
Abstract:The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models' knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model's contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: this https URL.
116. 【2606.11200】Detecting AI-Generated Content on Social Media with Multi-modal Language Models
链接:https://arxiv.org/abs/2606.11200
作者:Chenyang Yang,Shen Yan,Yibo Yang,Litao Hu,Yuchen Liu,Yuan Zeng,Hanchao Yu,Yinan Zhu,Sumedha Singla,Brian Vanover,Huijun Qian,Zihao Wang,Fujun Liu,Aashu Singh,Jianyu Wang,Xuewen Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:social media, enabled the creation, creation of photorealistic, photorealistic images, images and videos
备注:
点击查看摘要
Abstract:Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.
117. 【2606.11199】NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track
链接:https://arxiv.org/abs/2606.11199
作者:Quentin Fever,Naziha Aslam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:multi-agent retrieval-augmented generation, structured multi-agent retrieval-augmented, awarded Best Dynamic, retrieval-augmented generation, Dynamic Evaluation
备注: 5 pages, 1 figure, 1 table. NeurIPS 2025 Competition Track (MMU-RAGent). System developed October 2025
点击查看摘要
Abstract:We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.
118. 【2606.11198】he Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content
链接:https://arxiv.org/abs/2606.11198
作者:Yuqi Zhang,Di Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:improve LLM outputs, systems inject external, LLM outputs, inject external knowledge, improve LLM
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention distribution. We identify and formalise a phenomenon we term the structural attention tax: knowledge graph (KG) triples, due to their relational delimiters and repeated slot patterns, capture 2-3x more attention per token than semantically equivalent natural-language text ($\hat{o}$(KG) $\approx$ 0.70 vs. $\hat{o}$(neutral) $\approx$ 0.25), compressing demonstration attention by up to 42% -- regardless of whether the triples are relevant or noise. We develop a formal framework decomposing attention scores into semantic and structural components (Eq. 2), derive a compression bound (Proposition 1) connecting token-level format bias to demonstration attention loss, and show that the structural term governs how much attention is diverted while the semantic term governs whether this helps or hurts. This decoupling reveals two orthogonal axes for improving retrieval-augmented ICL: optimising retrieval quality (semantic axis) and reducing format-driven attention capture (structural axis). Empirically, across two model families (Mistral-7B, LLaMA-3-8B) and three QA benchmarks, we observe that source-task alignment dominates: task-matched BM25 retrieval achieves 58-62% on HotpotQA vs. ConceptNet's 25-27%, a 30 pp gap that dwarfs all gating strategies ($\leq$2 pp). We derive five structure-aware mitigation strategies from the framework, ranging from zero-cost prompt modifications to training-time regularisation; format flattening (S3) is validated by both accuracy and attention-level evidence from a verbalized-triple control, while structural dispersal (S1) yields mixed results that illuminate the challenges of format-level intervention.
119. 【2606.11196】PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference
链接:https://arxiv.org/abs/2606.11196
作者:Arther Tian,Alex Ding,Frank Chen,Simon Wu,Aaron Chan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:Decentralized LLM inference, LLM inference networks, Decentralized LLM, LLM inference, networks need lightweight
备注:
点击查看摘要
Abstract:Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.
120. 【2606.07537】From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data
链接:https://arxiv.org/abs/2606.07537
作者:Md. Rejaul Korim Sadi,Toufiqur Rahman Tasin,Golam Mostofa Naeem
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, language models hallucinate, Large language, producing fluent, models hallucinate
备注: 11 pages, 7 figures, 15 references
点击查看摘要
Abstract:Large language models hallucinate--producing fluent, confident, factually wrong outputs--with a consistency that persists across generations and scales. Existing taxonomies classify hallucination by output type, distinguishing intrinsic from extrinsic failures and faithfulness from factuality divergence. These frameworks are descriptively rigorous but do not identify which internal mechanism produced a given instance. This paper analyses hallucination as a structural consequence of three architectural decisions that together form a compound failure system. Self-attention's co-occurrence learning substitutes statistical proximity for semantic meaning and produces entity confusion, fact misattribution, and semantic drift. The maximum likelihood estimation training objective optimises next-token probability without factual constraint, rewarding statistically plausible outputs regardless of their truth value. Autoregressive decoding's permanent left-to-right commitment under exposure bias ensures that a single wrong token cascades forward through the entire output sequence without revision. Dataset pathologies--long-tail deficiencies, training bias, and synthetic pollution--amplify these vulnerabilities but do not independently cause them. We make three contributions. First, we map each mechanism to a specific output category in the Alansari and Luqman taxonomy, locating intrinsic hallucination in self-attention, extrinsic hallucination in MLE, and logical inconsistency in autoregressive decoding. Second, we show that each commonly cited dataset pathology exploits one of these mechanisms rather than originating hallucination independently. Third, we identify the diagnostic limitation of output-type-only classification and contrast it with inference-layer mitigation approaches.
121. 【2606.12199】Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation
链接:https://arxiv.org/abs/2606.12199
作者:Zhen Ye,Xu Tan,Yiming Li,Guangyan Zhang,Chimin Chan,Haohe Liu,Zhengxi Liu,Hongzhan Lin,Zheqi Dai,Xinshen Zhang,Peiwen Sun,Qiuqiang Kong,Wei Xue
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
关键词:Spoken dialogue models, dialogue models typically, models typically start, Spoken dialogue, text LLM backbones
备注: Accepted by Interspeech 2026 long paper
点击查看摘要
Abstract:Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.
122. 【2606.11766】Fast Speech Foundation Model Distillation Using Interleaved Stacking
链接:https://arxiv.org/abs/2606.11766
作者:Eungbeom Kim,Kyogu Lee
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
关键词:large speech foundation, Distilling a large, speech foundation model, low-resource environments, efficient student model
备注: Accepted by Interspeech 2026
点击查看摘要
Abstract:Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.
123. 【2606.11429】Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains
链接:https://arxiv.org/abs/2606.11429
作者:Zilai Wang,Natarajan Balaji Shankar,Mohan Shi,Kaiyuan Zhang,Abeer Alwan
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
关键词:Speech foundation models, foundation models, models often struggle, low-resource domains due, automates Whisper encoder
备注: Accepted by Interspeech 2026
点击查看摘要
Abstract:Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.
124. 【2606.11279】Massive Open-Vocabulary Keyword Spotting
链接:https://arxiv.org/abs/2606.11279
作者:Leonor Barreiros,Raul Monteiro,Afonso Mendes,Gonçalo M. Correia
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
关键词:transcribing words rarely, Automatic speech recognition, specialized terminology, Automatic speech, transcribing words
备注: Accepted to Interspeech 2026
点击查看摘要
Abstract:Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.
125. 【2606.11197】MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation
链接:https://arxiv.org/abs/2606.11197
作者:Xuzhi Wang,Xinran Wu,Ziping Zhao,Jianhua Tao,Björn W. Schuller
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
关键词:Speech-based automatic estimation, enabling early detection, mental health settings, Speech-based automatic, resource-constrained mental health
备注: Accepted at IEEE TAC
点击查看摘要
Abstract:Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.
信息检索
1. 【2606.12400】Doc-to-Atom: Learning to Compile and Compose Memory Atoms
链接:https://arxiv.org/abs/2606.12400
作者:Xingjian Diao,Wenbo Li,Yashas Malur Saidutta,Avinash Amballa,Lazar Valkov,Srinivas Chappidi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Long input sequences, Large Language, Long input, attention makes inference
备注: 20 pages
点击查看摘要
Abstract:Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.
2. 【2606.12295】Findings of the MAGMaR 2026 Shared Task
链接:https://arxiv.org/abs/2606.12295
作者:Alexander Martin,Dengjia Zhang,Joel Brogan,Francis Ferraro,Jeremy Gwinnup,Reno Kriz,Teng Long,Kenton Murray,Andrew Yates,Xiang Xiang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Multimodal Augmented Generation, Multimodal Augmented, overview paper presents, Multimodal Retrieval, Augmented Generation
备注: Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: [this https URL](https://github.com/rekriz11/MAGMAR_2026)
点击查看摘要
Abstract:This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.
3. 【2606.12246】Efficient and Robust Online Learning to Rank in Decentralized Systems
链接:https://arxiv.org/abs/2606.12246
作者:Marcel Gregoriadis,Martijn de Vos,Sayan Biswas,Anne-Marie Kermarrec,Johan Pouwelse
类目:Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
关键词:existing systems rely, Online Learning, trusted central server, Learning to Rank, existing systems
备注:
点击查看摘要
Abstract:In Online Learning to Rank (OLTR), ranking models are trained directly from live user interactions, but existing systems rely on a trusted central server to collect and process these interactions. This leaves operators free to introduce biases that conflict with user interests. Decentralized learning offers an attractive alternative, allowing users to collaboratively train a shared ranking model by exchanging model updates directly with one another, without any central authority. In such settings, however, malicious nodes can send poisoned model updates that degrade the ranking quality of honest nodes. We introduce RankGuard, a decentralized OLTR framework in which users collaboratively train ranking models and exchange model updates directly with other nodes. RankGuard defends against poisoning attacks by carefully evaluating incoming models against the user's own private click history, corrected for position bias. An incoming model is only aggregated if it better explains the user's past interactions than the current local model, making it fundamentally hard for malicious nodes to craft updates that pass this test without also genuinely helping the user. We derive a theoretical convergence guarantee of RankGuard. To the best of our knowledge, this is the first formal convergence analysis of a decentralized OLTR algorithm. We evaluate RankGuard against four poisoning attacks, including a powerful adaptive attack, using four standard benchmarks and three click models. RankGuard outperforms all baselines in most settings while being up to 62x more efficient than its closest competitors.
4. 【2606.12245】DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation
链接:https://arxiv.org/abs/2606.12245
作者:Kangning Zhang,Yingjie Qin,Weinan Zhang,Yong Yu,Jianghao Lin
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Cold-start item recommendation, real-world systems due, item recommendation remains, Cold-start item, textbf
备注: Accepted by ECML-PKDD 2026
点击查看摘要
Abstract:Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prior models attempt to bridge this gap using item content features, they universally suffer from the \textbf{seesaw dilemma}: enhancing performance for cold items inevitably degrades performance for warm items, and vice versa. We identify that this dilemma stems from a fundamental \textbf{distributional disparity}: warm item embeddings occupy a complex ``behavioral manifold" shaped by rich interaction signals, whereas cold item embeddings are constrained to a ``semantic manifold" derived solely from auxiliary content. Existing methods often force a rigid mapping between these inconsistent spaces, causing the model to sacrifice the precision of warm representations to accommodate cold ones. To address this, we propose \textbf{DiffCold}, a diffusion-based generative model that unifies warm and cold representations. Unlike GANs or VAEs, DiffCold leverages conditional diffusion to reconstruct warm item embeddings from content, preserving the underlying manifold structure without degradation. We further tailor this paradigm with two specific designs: a \textbf{Retrieval-enhanced Aggregator} that initializes generation using semantically similar warm items to bypass inefficient noise, and a \textbf{Simulation-based Representation Alignment} module that enforces distribution consistency between generated and real embeddings via contrastive learning. Experiments on three benchmarks confirm that DiffCold resolves the seesaw dilemma, consistently outperforming state-of-the-art methods across all metrics.
5. 【2606.12215】MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching
链接:https://arxiv.org/abs/2606.12215
作者:David Yuchen Wang,Haoying Li,Hailun Xu,Wei Chee Yew,Zirui Zhu,Sanjay Saha,Hao Hei,Kanchan Sarkar,Kun Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:user-generated video content, numerous near-duplicate videos, partial edits, explosive growth, growth of user-generated
备注: Accepted by KDD-2026 ADS track
点击查看摘要
Abstract:The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.
6. 【2606.12198】LLM-Based User Personas for Recommendations at Scale
链接:https://arxiv.org/abs/2606.12198
作者:Haoting Wang,Haokai Lu,Zheyun Feng,Jenny Huang,Yifat Amir,Gregory Hinkson,Ben Most,Zelong Zhao,Yixin Kelly Cui,Rein Zhang,Fabio Soldo,Yu Xia,Nihar Bhupalam,Minmin Chen,Konstantina Christakopoulou,Lichan Hong,Ed H. Chi
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, offer unprecedented potential, enhancing recommendation systems
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) offer unprecedented potential for enhancing recommendation systems through their world knowledge and reasoning capabilities. However, existing approaches often rely on structured IDs or offline processing, limiting semantic richness, real-time adaptability, and user-facing interpretability. In this paper, we introduce a novel framework that enables real-time generation of LLM-based user interest personas for a large-scale commercial video recommendation platform. Our method generates natural-language user interest personas that address the exploitation-exploration trade-off by combining the summarization of existing interests with novel topics, directly during serving. To overcome the computational challenges of online LLM inference at a billion-user scale, we design a cost-efficient architecture leveraging knowledge distillation, asynchronous inference, and input optimization via semantically clustered video representations. Extensive offline evaluations, user studies, and live A/B tests demonstrate significant improvements in viewer value. This work bridges the gap between high-level semantic understanding and industrial-scale recommendation, paving the way for more dynamic, explainable, and satisfying personalized experiences.
7. 【2606.11945】uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking
链接:https://arxiv.org/abs/2606.11945
作者:Simon Lupart,Kidist Amde Mekonnen,Zahra Abbasiantaeb,Mohammad Aliannejadi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:question answering, report describes, describes our participation, task evaluates conversational, retrieval
备注: SemEval-2026, The 20th International Workshop on Semantic Evaluation, collocated with ACL 2026, 9 pages, 5 figures, 6 tables
点击查看摘要
Abstract:This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.
8. 【2606.11907】ail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2606.11907
作者:Ziyu Song,Jiaming Fang,Kuangyu Li,Tuo Xia,Chuanpeng Wang
类目:Information Retrieval (cs.IR)
关键词:Adaptive context selection, heavy-tailed similarity distributions, Top-K retrieval fails, fixed Top-K retrieval, retrieval-augmented generation
备注: First two authors contributed equally. Accepted at ECML PKDD 2026
点击查看摘要
Abstract:Adaptive context selection is critical for retrieval-augmented generation (RAG) systems, as fixed Top-K retrieval fails under query-dependent and heavy-tailed similarity distributions. While Extreme Value Theory (EVT) offers a principled framework for adaptive truncation, existing approaches apply EVT globally across the entire ranked list, incurring prohibitive computational costs and statistical instability. We propose Tail-Aware Adaptive-k(TAA-k), a training-free framework that operationalizes EVT through a localized validation strategy. The key insight is that ranked similarity curves exhibit a characteristic steep--flat--steep pattern reflecting a transition from relevance-dominated to noise-dominated regimes. TAA-k exploits this geometric structure via knee detection to identify a compact candidate region, then applies EVT-based goodness-of-fit testing within this window to validate the onset of tail behavior. This coarse-to-fine design reduces computational complexity from O(N^2M) to O(sqrt{N\log N}*M) while maintaining statistical rigor. Under mild monotone likelihood ratio assumptions, TAA-k yields a stable, query-adaptive cutoff corresponding to the earliest noise-dominated position. Experiments on WebQuestions, 2WikiMultiHopQA, and MuSiQue demonstrate that TAA-k achieves near-oracle retrieval quality (F1 within 2-3% of oracle) with orders-of-magnitude efficiency gains over global EVT methods, while maintaining robustness across embedding models and compression dimensions.
9. 【2606.11864】CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding
链接:https://arxiv.org/abs/2606.11864
作者:Fuwei Zhang,Yanzhao Zhang,Mingxin Li,Dingkun Long,Lexiang Hu,Pengjun Xie,Zhao Zhang,Fuzhen Zhuang
类目:Information Retrieval (cs.IR)
关键词:isolated snippet, Code retrieval, natural-language query, agentic coding requires, Code
备注:
点击查看摘要
Abstract:Code retrieval is becoming central to coding agents, but agentic coding requires more than matching a natural-language query to an isolated snippet. Given a user request, a coding agent needs to navigate a concrete repository state, locate relevant files and functions, gather supporting context, and filter similar in-repository distractors. Existing code retrieval benchmarks mainly evaluate docstring-to-function or snippet-level matching, thereby missing this requirement-driven repository search problem. To address this gap, we introduce CORE-Bench, a comprehensive benchmark for code retrieval in the era of agentic coding. CORE-Bench evaluates code retrieval ability at three levels: code understanding, issue-to-edit localization, and broader context retrieval. Built from curated code-search tasks and SWE-bench-series instances, CORE-Bench contains over 180K queries and 106K broader-context relevance labels. Experiments with representative embedding models show a sharp drop from traditional code search to code retrieval in agentic coding settings. Simple supervised fine-tuning of existing embedding models significantly improves performance in this setting, suggesting substantial room for further progress.
10. 【2606.11780】What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study
链接:https://arxiv.org/abs/2606.11780
作者:Koki Okajima,Tsukasa Yoshida
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
关键词:establish conditions, dimensional vectors, query vector, Abstract, subset
备注: 9 pages, 2 figures
点击查看摘要
Abstract:We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is realizable as a result of top-$k$ retrieval by some query vector. Recent work shows that $d = O(k)$ suffices for such embeddings to exist in $\mathbb{R}^d$, independently of $N$. We theoretically prove that this corpus-independent bound is specific to infinite precision. With $B$ bits per coordinate, perfect top-$k$ retrieval requires $Bd = \Omega(k \ln N)$; thus, at any fixed precision, the dimension must grow at least logarithmically with $N$. Specializing to a $\ell_2$-normalized $B$-bit uniform scalar quantization model, we also identify a threshold on the precision $B^{*} = O(\ln \ln N)$ below which no dimension suffices, together with two further regimes that bound the feasible $(B, d)$ pairs. Our result implies that in practical vector databases and dense retrieval systems where quantization is standard, the embedding dimension and possibly the precision must grow with the corpus size.
11. 【2606.11749】FAST-MEL: A Fast, Accurate, and Storage Efficient Solution for Multimodal Entity Linking
链接:https://arxiv.org/abs/2606.11749
作者:Derrien Thomas,Laurent Amsaleg,Pascale Sébillot
类目:Information Retrieval (cs.IR)
关键词:Multimodal entity linking, entities in unstructured, Multimodal entity, knowledge base, task that consists
备注:
点击查看摘要
Abstract:Multimodal entity linking (MEL) is the task that consists of matching textual and visual mentions of entities in unstructured data to their corresponding entities in a knowledge base (KB). To be effective in large-scale practical settings, MEL systems must meet three objectives: high linking accuracy, computational efficiency, and storage efficiency, i.e., a compact yet efficient index of the KB. In this paper, we highlight that state-of-the-art systems fail to simultaneously satisfy these 3 requirements. To meet this three-fold objective, we propose FAST-MEL, a lightweight encoder-based MEL solution that relies on a novel and compact fixed-size vectorized representation of both the textual and visual information of each entity or mention. It matches the accuracy of the best systems but performs three orders of magnitude faster. It also consumes one order of magnitude less storage than the fastest systems.
12. 【2606.11700】CompRank: Efficient LLM Reranking via Token-Level Compression and Decoding-Free Scoring
链接:https://arxiv.org/abs/2606.11700
作者:Xuan Lu,Haohang Huang,Yingqi Fan,Junlong Tong,Yuxuan Zhang,Ping Nie,Rui Meng,Xiaoyu Shen
类目:Information Retrieval (cs.IR)
关键词:Large language model, retrieval-augmented generation pipelines, Large language, computational cost limits, high computational cost
备注:
点击查看摘要
Abstract:Large language model (LLM) rerankers have become an important component of modern retrieval and retrieval-augmented generation pipelines, but their high computational cost limits their applicability to long candidate lists. In this paper, we propose \textbf{CompRank}, a token-efficient reranking framework that reduces redundant computation by aligning reranker design with the sparsity of ranking signals. CompRank decouples document representations from candidate order and query context, enabling reusable document-side states; applies segment-wise token compression to reduce query--document interaction cost; and introduces a CopyNet-style objective that directly aligns attention-based document scoring with training supervision. Experiments on seven BEIR datasets show that CompRank achieves strong reranking performance while retaining only 10.2\% of document tokens, reaching an average NDCG@10 of 39.2 compared with 39.7 under full-token attention. Further scaling experiments on TREC-COVID show that CompRank remains stable when evaluated on candidate lists of up to 500 documents after training on 30-document lists, while achieving $4.9\times$--$9.5\times$ end-to-end speedup over generation-based listwise reranking and approximately $1.3\times$ speedup over the full-token CompRank variant. These results suggest that token-level compression and decoding-free attention scoring provide an effective path toward scalable LLM-based reranking.
13. 【2606.11654】he Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience
链接:https://arxiv.org/abs/2606.11654
作者:Kazuki Nakayashiki,Keisuke Watanabe
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
关键词:social highlighter, lead, aggregate crowd salience, edge, model beats lead
备注: 10 pages, 3 figures, 4 tables
点击查看摘要
Abstract:A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.
14. 【2606.11616】DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors
链接:https://arxiv.org/abs/2606.11616
作者:Jiale Deng,Yanyan Shen,Xiaogang Shi,Chai Junjun
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:High-quality training data, High-quality training, data, success of machine, High-quality
备注:
点击查看摘要
Abstract:High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: this https URL.
15. 【2606.11613】Factions Within, Uncertain Across: Within-Document Reader Sub-Groups in Social Highlighting
链接:https://arxiv.org/abs/2606.11613
作者:Kazuki Nakayashiki,Keisuke Watanabe
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
关键词:single consensus, people highlight, internally structured, stable reader trait, document
备注: 11 pages, 3 figures, 3 tables
点击查看摘要
Abstract:When many people highlight the same document, is the crowd a single consensus, or is it internally structured into reader sub-groups that mark different things -- and is that structure a stable property of a reader or of the document? Building on prior work showing an individual's within-document highlighting signal is a whisper while individuality lives in selection, we ask the group-level question on a co-readership platform using a margin-preserving curveball null. Experiment 1: within a document, readers form strong sub-groups -- pairs agree far beyond what shared salience, mark density, and sentence popularity predict (nearest-neighbour agreement z=+6.3, significant in 88% of documents). Under an eight-block region-preserving null, shared engagement with the same coarse regions of the document accounts for about 40% of this excess; the majority survives as finer reader-specific agreement (z=+3.6, 77% significant). So the within-document crowd is, in a descriptive sense, factional. Experiment 2: is that grouping a stable reader trait? Here we are honest about power. The cross-document split-half reproducibility of a pair's agreement is near zero pooled (+0.078 and 0.000 in two separately drawn samples), and a power calibration shows the test is informative only for pairs that co-read many documents. In the only informative high-overlap subset (k=4), point estimates are positive but small-sample, imprecise across the separately drawn samples, never significant, and attenuate under the region-preserving null. We therefore leave cross-document stability unresolved: the data is consistent with anything from situational grouping to a weak-to-moderate stable reader trait. The crowd is factional within a document; whether its factions follow the reader across documents is, honestly, beyond our reach.
16. 【2606.11361】A PubMed-Scale Dataset of Structured Biomedical Abstracts
链接:https://arxiv.org/abs/2606.11361
作者:Chia-Hsuan Chang,Haerin Song,Brian Ondov,Hua Xu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:biomedical literature processing, facilitating information retrieval, text mining, literature processing, knowledge synthesis
备注: Data and code for this work are available at [this https URL](https://doi.org/10.5281/zenodo.20336717) and [this https URL](https://github.com/BIDS-Xu-Lab/StructuredPubMed) , respectively
点击查看摘要
Abstract:Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.
17. 【2606.11350】When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval
链接:https://arxiv.org/abs/2606.11350
作者:Nabaraj Subedi,Ahmed Abdelaty,Shivanand Venkanna Sheshappanavar
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Retrieval-augmented generation degrades, loses discriminative power, similarity loses discriminative, increasingly returns semantically, returns semantically similar
备注: 24 pages, 8 figures, 30 tables. Preprint under review
点击查看摘要
Abstract:Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ($p 0.05$). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results --creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.
18. 【2606.11204】Benchmarking Large Language Models for Safety Data Extraction
链接:https://arxiv.org/abs/2606.11204
作者:Jonas Grill,Thomas Bayer,Sören Berlinger
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Safety Data Sheets, traditional rule-based methods, heterogeneous document formats, industrial safety due, SDS data extraction
备注: 18 pages, 8 figures, submitted to Applied Intelligence
点击查看摘要
Abstract:Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.
19. 【2606.11199】NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track
链接:https://arxiv.org/abs/2606.11199
作者:Quentin Fever,Naziha Aslam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:multi-agent retrieval-augmented generation, structured multi-agent retrieval-augmented, awarded Best Dynamic, retrieval-augmented generation, Dynamic Evaluation
备注: 5 pages, 1 figure, 1 table. NeurIPS 2025 Competition Track (MMU-RAGent). System developed October 2025
点击查看摘要
Abstract:We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.
计算机视觉
1. 【2606.12412】Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
链接:https://arxiv.org/abs/2606.12412
作者:Cheng-Yu Yang,Shao-Yuan Lo,Yu-Lun Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision-language models, making decoder inference, decoder inference expensive, project images, images into hundreds
备注: Code: [this https URL](https://github.com/elmma/mllm-reroute/)
点击查看摘要
Abstract:Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: this https URL
2. 【2606.12407】How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology
链接:https://arxiv.org/abs/2606.12407
作者:Kian R. Weihrauch,Thomas A. Buckley,William Lotter,Arjun K. Manrai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:whole-slide images, LLM baselines routinely, General-purpose large language, evaluating specialized pathology, patch size
备注:
点击查看摘要
Abstract:General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.
3. 【2606.12402】DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
链接:https://arxiv.org/abs/2606.12402
作者:Jadelynn Dao,Milan Ganai,Yasmina Abukhadra,Ajay Sridhar,Mozhgan Nasr Azadani,Katie Luo,Clark Barrett,Jiajun Wu,Chelsea Finn,Marco Pavone
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:test-time compute, high-level planners, emerging strategy, embodied agents, scaling test-time compute
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at this http URL.
4. 【2606.12396】VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving
链接:https://arxiv.org/abs/2606.12396
作者:Jin Yao,Dhruva Dixith Kurra,Tom Lampo,Zezhou Cheng,Danhua Guo,Burhan Yaman
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:describe scenes, scenes and reason, struggle to ground, VLGA, VLA
备注: Project page: [this https URL](https://yaojin17.github.io/VLGA/)
点击查看摘要
Abstract:Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.
5. 【2606.12378】Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots
链接:https://arxiv.org/abs/2606.12378
作者:Zhi Wei Xu,Torbjörn E. M. Nordling
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Physiological awareness, important for service, everyday environments, awareness is important, assistive robots
备注: 8 pages, 4 figures
点击查看摘要
Abstract:Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.
6. 【2606.12374】Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration
链接:https://arxiv.org/abs/2606.12374
作者:Sadman Sakib Enan,Junaed Sattar
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:expanding human-led operations, collaboration is essential, essential for expanding, expanding human-led, human-led operations
备注:
点击查看摘要
Abstract:Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.
7. 【2606.12371】A Turbo-Inference Strategy for Object Detection and Instance Segmentation
链接:https://arxiv.org/abs/2606.12371
作者:Zhen Zhao,Gang Zhang,Xiaolin Hu,Liang Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:instance segmentation, closely related, Object detection, segmentation, instance segmentation tasks
备注: Preprint version of an article published in Computer Vision and Image Understanding
点击查看摘要
Abstract:Object detection and instance segmentation tasks are closely related. Existing top-down instance segmentation methods usually follow a detect-then-segment paradigm, where an initial detector is used to recognize and localize objects with bounding boxes, followed by the segmentation of an instance mask within each bounding box. In such methods, the detection accuracy directly influences the subsequent segmentation performance. However, previous research has seldom explored the impact of the instance segmentation task on object detection. In this paper, we present a turbo-inference strategy for the top-down methods that leverages the complementary information between detection and segmentation tasks iteratively. Specifically we design two modules: turbo-detection head and turbo-segmentation head, which facilitate communication between the tasks. The two modules form a closed loop that interlaces the detection and segmentation results without retraining the model. Comprehensive experiments on the COCO, iFLYTEK, and Cityscapes datasets demonstrate that our method substantially enhances both detection and segmentation accuracies with a certain increase in computational cost. The proposed method represents a tradeoff between prediction accuracy and inference speed. Codes are available at this https URL.
8. 【2606.12368】DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images
链接:https://arxiv.org/abs/2606.12368
作者:Pengfei Wang,Shihao Wang,Liyi Chen,Zhiyuan Ma,Guowen Zhang,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved significant progress, achieving generalized metric, monocular depth estimation, metric depth estimation, generalized metric depth
备注:
点击查看摘要
Abstract:While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.
9. 【2606.12346】Atlas HE-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy
链接:https://arxiv.org/abs/2606.12346
作者:Kai Standvoss,Miriam Hägele,Rosemarie Krupar,Julika Ribbat-Idel,Jennifer Altschüler,Gerrit Erdmann,Hans Pinckaers,Evelyn Ramberger,Madleen Drinkwitz,Ádám Nárai,Alexander Möllers,Katja Lingelbach,Sebastian Kons,Lukas Hönig,Recepcan Adigüzel,Joana Baião,Alberto Megina Gonzalo,Marius Teodorescu,Marie-Lisa Eich,Paolo Chetta,Shakil Merchant,Verena Aumiller,Simon Schallenberg,Andrew Norgan,Klaus-Robert Müller,Lukas Ruff,Maximilian Alber,Frederick Klauschen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Hematoxylin and eosin, Atlas HE-TME, cornerstone of histopathology, whole-slide images, remains a central
备注:
点击查看摘要
Abstract:Hematoxylin and eosin (HE) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of HE whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas HE-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to HE-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional HE-only annotation. This yields a molecularly grounded reference against which we compare Atlas HE-TME and pathologists working from HE alone. For breadth, we benchmark Atlas HE-TME on over 200,000 high-confidence HE-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering 90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas HE-TME matches or exceeds pathologist HE-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas HE-TME turns the HE slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.
10. 【2606.12340】Echoes of the Prior: A Computational Phenomenology of Forgetting
链接:https://arxiv.org/abs/2606.12340
作者:Gege Gao,Bernhard Schölkopf,Andreas Geiger
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:storage of data, scaffolding of reality, biological memory fades, memory fades, simply turn black
备注:
点击查看摘要
Abstract:Memory is not merely the storage of data; it is the scaffolding of reality. When biological memory fades, the world does not simply turn black; it regresses into an unrecognizable chaos. Echoes of the Prior is an interactive installation that attempts to visualize this subjective phenomenology of forgetting. By inducing controlled synaptic decay within a Feed-Forward 3D Reconstruction model, we create an artistic analogy for the erosion of the brain's predictive priors. We position the Neural Network not as a tool for engineering, but as a cognitive proxy - a silicon brain whose structural degeneration evokes the disorienting, poetic, and terrifying experience of losing one's grip on the world. Ultimately, we offer this framework as a catalyst, inviting the wider community to explore the uncharted potential of neuromorphic aesthetics in visualizing the fragility of intelligence. Interactive demo see this https URL.
11. 【2606.12319】Anatomically Conditioned Recurrent Refinement for Topology-Aware Circle of Willis Segmentation
链接:https://arxiv.org/abs/2606.12319
作者:Juraj Perić,Marija Habijan,Dario Mužević,Irena Galić,Danilo Babin,Aleksandra Pižurica
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Magnetic Resonance Angiography, Segmenting the Circle, Circle of Willis, Resonance Angiography, Magnetic Resonance
备注: 9 pages, 4 figures, 1 table. Accepted at EUSIPCO 2026
点击查看摘要
Abstract:Segmenting the Circle of Willis (CoW) from Magnetic Resonance Angiography (MRA) is challenging due to complex topology and thin vascular structures that are prone to fragmentation. Standard Convolutional Neural Networks (CNNs) often fail to capture these topological constraints, resulting in "broken vessel" artifacts. To address this, we propose the Anatomically Conditioned Recurrent Refinement U-Net (AC2RUNet). Our architecture decouples segmentation into two streams: a Static Stream that extracts invariant anatomical features and a lightweight Dynamic Stream that iteratively refines topological errors over time. We further introduce a dynamic curriculum learning strategy that transitions from high-recall geometric supervision to topology-aware constraints. Validated on the TopCoW dataset, AC2RUNet substantially reduces Hausdorff Distance (4.72 mm vs 9.17 mm) and Betti number errors (0.19 vs 0.40), improving topological connectivity over the nnU-Net baseline while maintaining comparable volumetric Dice.
12. 【2606.12316】Slots, Transitions, Loops: Learning Composable World Models for ARC
链接:https://arxiv.org/abs/2606.12316
作者:Gege Gao,Bernhard Schölkopf,Andreas Geiger
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ARC tests in-context, in-context rule induction, tests in-context rule, input-output demonstrations, ARC tests
备注:
点击查看摘要
Abstract:ARC tests in-context rule induction: given a few input-output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual-symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop-OWM, an object-centric world-modeling architecture that learns these rules as composable transitions over structured states. It combines color-prototype slots, demonstration-conditioned task summaries, and a looped transition model with dense propagation and slot-conditioned correction. On both ARC-1 and ARC-2, Loop-OWM outperforms non-looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual-symbolic world states.
13. 【2606.12303】From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion
链接:https://arxiv.org/abs/2606.12303
作者:Yuchen Xian,Yunqiu Xu,Yang He,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:integrate complementary information, Multimodal image fusion, preserves rich local, rich local details, maintaining globally consistent
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
点击查看摘要
Abstract:Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: this https URL
14. 【2606.12300】Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition
链接:https://arxiv.org/abs/2606.12300
作者:Sukmin Seo,Geewook Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:grounding remain underexplored, natural-language grounding remain, returning the interval, Temporal grounding, remain underexplored
备注: 10 pages, 6 figures, Code and benchmark: [this https URL](https://github.com/naver-ai/ExtremeWhenBench)
点击查看摘要
Abstract:Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.
15. 【2606.12295】Findings of the MAGMaR 2026 Shared Task
链接:https://arxiv.org/abs/2606.12295
作者:Alexander Martin,Dengjia Zhang,Joel Brogan,Francis Ferraro,Jeremy Gwinnup,Reno Kriz,Teng Long,Kenton Murray,Andrew Yates,Xiang Xiang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Multimodal Augmented Generation, Multimodal Augmented, overview paper presents, Multimodal Retrieval, Augmented Generation
备注: Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: [this https URL](https://github.com/rekriz11/MAGMAR_2026)
点击查看摘要
Abstract:This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.
16. 【2606.12294】Bridging the Modality Gap in Forensic Image Retrieval
链接:https://arxiv.org/abs/2606.12294
作者:Ricardo González-Gazapo,Annette Morales-González,Yoanna Martínez-Díaz,Heydi Méndez-Vázquez,Milton García-Borroto
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:increasingly critical role, Automated image retrieval, Automated image, retrieval, plays an increasingly
备注: 23 pages, 5 figures, paper submitted to Elsevier journal
点击查看摘要
Abstract:Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.
17. 【2606.12286】CellNet -- Localizing Cells using Sparse and Noisy Point Annotations
链接:https://arxiv.org/abs/2606.12286
作者:Benjamin Eckhardt,Dmytro Fishman,Stuart Fawke,Andrew Curtis,Bo Fussing,Constantin Pape
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Wellcome Sanger Institute, biological research workflows, Sanger Institute study, important step, Wellcome Sanger
备注: Conference poster at Biology at Scale: From Variants to Cellular Programs and Functions
点击查看摘要
Abstract:Counting living cells is an important step in many biological research workflows. Our collaborators at the Wellcome Sanger Institute study vital genes in humans via large scale saturation genome editing screening, which requires repeatedly counting cells a great number of times. Computer Vision based automation is crucial for high throughput and resource efficiency. In this work, we develop a regression-based deep learning computer vision algorithm to detect and count cells in phase-contrast microscopy images. To reduce annotation effort, which in practice often becomes a bottleneck, we focus on counting cells only using sparse point annotations, which are fast and easy to acquire. By comparison to state-of-the-art 0-shot methods, we show that regression-based counting is a promising alternative in low data regimes. Through developing methods to automatically count living cells in microscopy images, we contribute to valuable research on the human genome. The code is available at this https URL.
18. 【2606.12278】Finding Sparse Subnetworks in One Training Cycle via Progressive Magnitude-Based Pruning
链接:https://arxiv.org/abs/2606.12278
作者:Romana Qureshi,Hafida Benhidour,Said Kerrache,Nahlah Aljeraisy
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:preserve predictive performance, Lottery Ticket Hypothesis, reduces model size, pruning reduces model, predictive performance
备注:
点击查看摘要
Abstract:Neural network pruning reduces model size by removing less important parameters while aiming to preserve predictive performance. Although the Lottery Ticket Hypothesis (LTH) shows that sparse subnetworks can match dense networks when trained from suitable initializations, its iterative pruning procedure requires multiple complete training cycles. This work evaluates progressive magnitude-based pruning as a single-cycle alternative. The method gradually increases sparsity during training using a linear schedule and updates pruning masks based on active weight magnitudes. We conduct systematic experiments on CIFAR-10 and MNIST across ResNet, VGG-style, and LeNet architectures, comparing the proposed method with representative iterative and initialization-based pruning baselines, including LTH, SNIP, and GraSP. On CIFAR-10, the method achieves 95.12\% accuracy on ResNet-18 at 72.9\% sparsity, compared with 90.5\% reported for LTH. At extreme sparsity, it achieves 93.13\% accuracy on a VGG-like architecture at 97\% sparsity, compared with approximately 92.0\% for SNIP, and 93.44\% accuracy on VGG-19 at 97.97\% sparsity, compared with 92.19\% for GraSP at 98\% sparsity. A sparsity-accuracy analysis on ResNet-18 further shows that accuracy remains within 0.1 percentage points of the dense baseline across 70--85\% sparsity. These results indicate that progressive magnitude-based pruning provides an effective single-cycle approach for neural network sparsification under the evaluated settings.
19. 【2606.12263】VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models
链接:https://arxiv.org/abs/2606.12263
作者:Chunlin Qiu,Ang Li,Tianxiao Huang,Ruilin Gan,Yunjie Ge,Shenyi Zhang,Huayi Duan,Lingchen Zhao,Chao Shen,Qian Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized visual synthesis, increasingly exploited, Latent Diffusion Models, USENIX Security Symposium, USENIX Security
备注: To appear in the 35th USENIX Security Symposium (USENIX Security 2026)
点击查看摘要
Abstract:While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.
Comments:
To appear in the 35th USENIX Security Symposium (USENIX Security 2026)
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2606.12263 [cs.CV]
(or
arXiv:2606.12263v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2606.12263
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2606.12258】Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning
链接:https://arxiv.org/abs/2606.12258
作者:Jiyang Xu,Rui Liu,Hang Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:substantial visual appearance, visual appearance discrepancies, Cross-domain day-night re-identification, nighttime scenes, fundamentally challenged
备注:
点击查看摘要
Abstract:Cross-domain day-night re-identification (ReID) is fundamentally challenged by the substantial visual appearance discrepancies between daytime and nighttime scenes. Existing fully supervised methods rely heavily on labor-intensive annotations, which are costly and exhibit limited generalization across domains. In this work, we investigate unsupervised day-night ReID and propose a novel framework that synergistically combines prompt learning and prototype-based representation learning to associate identities across domains without requiring manual labels. Our approach follows a progressive two-stage training strategy. In the first stage, we exploit the vision-language model to generate instance-specific textual prompts in an annotation-free manner. We employ an instance-level alignment mechanism to embed visual features and textual prompts into a unified semantic space, aligning unlabeled day/night images with learnable prompts via instance-aware dynamic-bias adaptation. In the second stage, we construct domain-specific prototype memory banks and introduce two complementary modules: i) an intra-domain identity association module to enhance feature discriminability within each domain, and ii) a cross-domain prototype matching module to reliably identify positive and negative prototype pairs, thereby establishing robust identity correspondences across day and night. Extensive experiments on public benchmarks validate the effectiveness of our method. Under the unsupervised setting, our framework attains Rank-1 accuracy comparable to state-of-the-art fully supervised methods.
21. 【2606.12248】Damage-TriageFormer: A Foundation-Model Framework for Typology-Based Building Damage Assessment from Mono-Temporal Imagery
链接:https://arxiv.org/abs/2606.12248
作者:Yiming Xiao,Yu-Hsuan Ho,Sanjay Thasma,Junwei Ma,Ali Mostafavi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:require paired pre, single severity scale, Decision-relevant building damage, severity scale, Simple Feature Pyramid
备注:
点击查看摘要
Abstract:Decision-relevant building damage assessment is critical for prioritizing resources and recovery after a disaster, yet most automated methods either flatten damage into a single severity scale (no damage, minor, major, destroyed) or require paired pre- and post-event imagery that is often unavailable for emerging hazards. This paper presents Damage-TriageFormer, a single-image, post-event, footprint-conditioned model that produces a damage typology rather than a severity scale. We contribute: (1) DamageTriage-Bench, a new benchmark built from NOAA Emergency Response Imagery across Hurricane Michael (2018), Hurricane Helene (2024), and the 2025 Los Angeles wildfire complex, with five typology classes that distinguish roof damage from structural damage and, within each, partial from total extent; and (2) Damage-TriageFormer, which extends a DINOv3 ViT-L backbone with a Simple Feature Pyramid for higher-resolution instance pooling, a two-stage gated damage head, and an auxiliary severity-regression objective. Our model achieves macro F1 of 0.624 on validation and 0.619 on a held-out stratified test set, performing strongest where operational triage needs it most, with per-class F1 of 0.91 and 0.84 on undamaged buildings and total structural collapse, respectively. While the rare Total Roof Damage class remains difficult due to its limited examples and an inherently ambiguous label boundary, our results show that single-image post-event imagery can support actionable building damage typing, enabling targeted emergency response and resource allocation without a pre-event reference.
22. 【2606.12236】DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems
链接:https://arxiv.org/abs/2606.12236
作者:Zhongyu Xia,Wenhao Chen,Yongtao Wang,Ming-Hsuan Yang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:handle long-tail scenarios, increasingly incorporating foundation, incorporating foundation models, Large Language Model, long-tail scenarios
备注:
点击查看摘要
Abstract:Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.
23. 【2606.12226】An Electric Potential-Augmented Benchmark Dataset for Physics-Guided Image Reconstruction of Electrical Capacitance Tomography
链接:https://arxiv.org/abs/2606.12226
作者:Xinqi Zhang,Qiming Ma,Lihui Peng
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Electrical Capacitance Tomography, methods map directly, data-driven methods map, Capacitance Tomography, Electrical Capacitance
备注:
点击查看摘要
Abstract:While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field -- the fundamental physical link governing the nonlinear and ill-posed ``soft-field'' effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.
24. 【2606.12218】Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model
链接:https://arxiv.org/abs/2606.12218
作者:Sk Muhammad Asif,Orhun Aydin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Understanding spatial distribution, Cropland Data Layer, USDA Cropland Data, optimizing the food-water, water conservation
备注: 10 pages, 6 figures. Preprint. Submitted to ACM SIGSPATIAL 2026
点击查看摘要
Abstract:Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.
25. 【2606.12217】Making Foresight Actionable: Repurposing Representation Alignment in World Action Models
链接:https://arxiv.org/abs/2606.12217
作者:Lu Qiu,Yizhuo Li,Yi Chen,Yuying Ge,Yixiao Ge,Xihui Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:future scene evolution, offer a promising, promising route, route for robot, scene evolution
备注:
点击查看摘要
Abstract:World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.
26. 【2606.12215】MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching
链接:https://arxiv.org/abs/2606.12215
作者:David Yuchen Wang,Haoying Li,Hailun Xu,Wei Chee Yew,Zirui Zhu,Sanjay Saha,Hao Hei,Kanchan Sarkar,Kun Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:user-generated video content, numerous near-duplicate videos, partial edits, explosive growth, growth of user-generated
备注: Accepted by KDD-2026 ADS track
点击查看摘要
Abstract:The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.
27. 【2606.12213】SHERPA: Seam-aware Harmonized ERP Adaptation for Open-Domain 360$^\circ$ Panorama Generation
链接:https://arxiv.org/abs/2606.12213
作者:Jungwoon Kang,Jaehun Kim,Yiwon Yu,Hyungyum Jang,Sanghoon Lee,Jongyoo Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Panoramic imagery, Circular Latent Encoding, non-photorealistic environments, imagery is increasingly, Dual-Path Training Scheme
备注: 29 pages, 23 figures, 5 tables. Preprint version
点击查看摘要
Abstract:Panoramic imagery is increasingly used in world-generation, games, and simulation, where users may need not only photorealistic scenes but also stylized and non-photorealistic environments. Large-scale text-to-image diffusion and flow models provide broad style and semantic priors for this goal, but planar image training misaligns them with the wrap-around topology and polar regions of $360^\circ$ panoramas represented in equirectangular projection (ERP). We present SHERPA, a lightweight adaptation framework that combines frequency-selective Circular RoPE, Circular Latent Encoding/Decoding, image-side FFN adapters, and a Dual-Path Training Scheme. Circular RoPE replaces only the seam-sensitive high-frequency horizontal RoPE band with integer-periodic harmonics while preserving the pretrained lower-frequency spectrum. The Paired Panorama Path supervises geometry, while the Unpaired Style Path uses self-supervised yaw consistency for target-free stylized prompts. As a result, SHERPA generates $360^\circ$ panoramas across both photorealistic panorama domains and open-domain stylized prompts.
28. 【2606.12195】InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
链接:https://arxiv.org/abs/2606.12195
作者:Ziang Yan,Sheng Xia,Jiashuo Yu,Yue Wu,Tianxiang Jiang,Songze Li,Kanghui Tian,Yicheng Xu,Yinan He,Kai Chen,Limin Wang,Yu Qiao,Yi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, involving multi-step reasoning, agentic behavior involving, behavior involving multi-step, progress in foundation
备注:
点击查看摘要
Abstract:Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.
29. 【2606.12189】DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds
链接:https://arxiv.org/abs/2606.12189
作者:Weirong Chen,Keisuke Tateno,Hidenobu Matsuki,Michael Niemeyer,Daniel Cremers,Federico Tombari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:partial point cloud, partial point, point cloud, point cloud sequences, lack explicit temporal
备注: ICML 2026. Project page: [this https URL](https://wrchen530.github.io/dynatok/)
点击查看摘要
Abstract:We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences. To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens. Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations. Project page: this https URL.
30. 【2606.12171】Beyond Dark Knowledge: Mixup-Based Distillation for Reliable Predictions
链接:https://arxiv.org/abs/2606.12171
作者:José Medina,Paul Honeine,Abdelaziz Bensrhair,Amnir Hadachi
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:captures inherent class, inherent class relationships, class boundaries, inherent class, class relationships
备注:
点击查看摘要
Abstract:Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher's supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.
31. 【2606.12169】OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models
链接:https://arxiv.org/abs/2606.12169
作者:Negin Baghbanzadeh,Pritam Sarkar,Michael Colacci,Abeer Badawi,Adibvafa Fallahpour,Arash Afkanpour,Leonid Sigal,Ali Etemad,Elham Dolatabadi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:correct final answers, High-stakes clinical, large vision-language models, final answers, large vision-language
备注: 42 pages, 9 figures, 24 tables. Dataset and code: [this https URL](https://huggingface.co/datasets/neginb/OpenMedReason)
点击查看摘要
Abstract:High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.
32. 【2606.12153】opoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation
链接:https://arxiv.org/abs/2606.12153
作者:Cheng-Feng Pu,Jia-Peng Zhang,Meng-Hao Guo,Yan-Pei Cao,Shi-Min Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:methods remain brittle, labor-intensive manual rigging, capture methods remain, requiring labor-intensive manual, current motion capture
备注:
点击查看摘要
Abstract:The explosion of generative 3D assets has created a massive demand for animation, yet current motion capture methods remain brittle, restricted to species-specific templates (e.g., SMPL) or requiring labor-intensive manual rigging. We introduce TopoCap, the first unified framework capable of extracting motion from monocular video and retargeting it onto characters with arbitrary, unseen skeletal topologies, i.e., from bipeds to hexapods and inanimate objects, without test-time optimization. Our key insight is that while skeletal structures are combinatorial and discrete, the underlying physics of motion occupy a continuous, low-dimensional manifold. We materialize this insight via a two-stage generative pipeline. First, we learn a Universal Motion Manifold using a Graph CVAE that compresses heterogeneous kinematic chains into a shared, fixed-length latent code. By explicitly conditioning the decoder on a structural embedding of the target rig, we disentangle motion dynamics from skeletal topology. Second, we treat video-to-animation as a conditional flow matching problem, predicting these topology-agnostic codes from visual features. To learn this generalized prior, we introduce Mobjaverse, a massive-scale dataset curated from Objaverse-XL. Comprising over 5,000 unique skeletal topologies and 2 million frames, it exceeds the structural diversity of existing datasets by two orders of magnitude. Extensive experiments demonstrate that \MethodMotion outperforms specialist models on human and quadruped benchmarks while enabling zero-shot retargeting for the long tail of 3D creatures. Dataset is publicly available at this https URL.
33. 【2606.12142】AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents
链接:https://arxiv.org/abs/2606.12142
作者:Ke Li,Jianfei Yang,Luyao Zhang,Guo Yu,Chengwei Yan,Yuan Ding,Di Wang,Nan Luo,Gang Liu,Xiao Gao,Quan Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Unmanned aerial vehicles, Unmanned aerial, search and rescue, environmental monitoring, emergency response
备注:
点击查看摘要
Abstract:Unmanned aerial vehicles (UAVs) are increasingly used in inspection, search and rescue, environmental monitoring, and emergency response. However, most UAV applications still rely on pre-defined command sequences or task-specific pipelines, where developers manually connect perception, planning, flight control, simulation, logging, and safety modules. This limits the flexibility, reproducibility, and extensibility of autonomous aerial systems. This paper presents AerialClaw, an open-source software framework that enables UAVs to operate as decision-making aerial agents rather than merely command-following platforms. Given a natural-language mission, AerialClaw allows an LLM-based agent to understand the task, maintain context, invoke executable aerial skills, observe perception and runtime feedback, and iteratively update its decisions in a closed loop. The framework adopts a modular brain-skill-runtime architecture, combining hard skills for atomic UAV operations, Markdown-based soft skills for reusable task strategies, document-driven agent state and capability boundaries, memory-driven reflection, safety-oriented runtime validation, and platform-agnostic execution adapters. AerialClaw supports lightweight mock execution, PX4 SITL with Gazebo, and AirSim-based simulation, together with a web console, pluggable model backends, example missions, simulation assets, and staged deployment scripts. By combining standardized aerial skills, document-driven agent state, memory, and closed-loop LLM decision-making, AerialClaw provides a reproducible and extensible open-source framework for building UAV systems that can interpret missions, make decisions, execute skills, and adapt their behavior from feedback.
34. 【2606.12140】me-Conditioned and Multi-Time Survival Prediction from 2D PET/CT Projections in Lung Cancer
链接:https://arxiv.org/abs/2606.12140
作者:Ashish Chauhan,Sambit Tarai,Elin Lundström,Johan Öfverstedt,Håkan Ahlström,Joel Kullberg
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:positron emission tomography, computed tomography, Accurate prediction, emission tomography, support personalized treatment
备注: Under review at MIUA 2026
点击查看摘要
Abstract:Accurate prediction of overall survival (OS) from positron emission tomography/computed tomography (PET/CT) can support personalized treatment and follow-up strategies in oncology. However, the impact of temporal modeling on imaging-based survival prediction remains insufficiently explored. We investigate how different temporal formulations influence survival prediction by developing two complementary approaches: Attention-guided Time-Conditioned Survival (ATCS) and Multi-Time Survival (MTS). We retrospectively analyzed pre-treatment PET/CT images from 848 patients with non-small cell lung cancer (NSCLC), including 556 for model development and 292 for held-out testing. A previously proposed Time-Conditioned Survival (TCS) model was used as a baseline. Models were trained using 5-fold cross-validation and evaluated on the test set using time-dependent area under the curve (AUC) at 6-month intervals from 0.5 to 5 years. Both ATCS and MTS outperformed the baseline TCS model, achieving mean AUCs of 0.794 and 0.793, respectively, compared to 0.767. ATCS performed better at earlier time points (0.5-3 years), whereas MTS performed better at later intervals (3.5-5 years). Combining tumor-specific and tissue-wise PET/CT features improved performance over either input alone. Finer temporal discretization improved short-term prediction, while coarser intervals provided more stable long-term estimates. These findings demonstrate that temporal modeling and input design influence PET/CT-based survival prediction. The proposed approaches enable time-specific survival estimation from pre-treatment imaging and may support improved risk stratification and clinical decision-making.
35. 【2606.12126】AGE-MIL: Anchor-Guided Evidence Learning for Patient-Level Prediction
链接:https://arxiv.org/abs/2606.12126
作者:Jiawei Niu,Jian Chen,Di Zhang,Junbo Lu,Zhangcheng Liao,Xuhao Liu,Honglin Zhong,Mireia Crispin-Ortuzar,Chen Li,Zeyu Gao,Yi Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing computational pathology, level multiple instance, multiple instance learning, Existing computational, modeling remains underexplored
备注: 11 pages, 2 figures, MICCAI early accepted
点击查看摘要
Abstract:Existing computational pathology methods predominantly operate within whole-slide image (WSI)-level multiple instance learning (MIL) paradigms, while patient-level modeling remains underexplored. In routine pathological practice, however, pathologists derive diagnostic and prognostic conclusions by integrating evidence across multiple WSIs rather than relying on any single slide. This discrepancy creates a fundamental misalignment when patient-level supervision is directly imposed on conventional MIL frameworks, often leading to unstable optimization and degraded predictive reliability. To address this issue, we propose Anchor-Guided Evidence MIL (AGE-MIL), a weakly supervised framework for patient-level prediction. AGE-MIL constructs a patient-level anchor from slide representations to capture global pathological context and guide the retrieval and integration of diagnostically relevant local patches, enabling robust patient-level modeling. Patient-level risk is further modeled as an evidence accumulation process, promoting stable optimization under weak supervision. AGE-MIL is evaluated on six clinically relevant patient-level prediction tasks from two independent cohorts. Experimental results show that the proposed framework consistently outperforms eight state-of-the-art MIL methods. Code is available at this https URL.
36. 【2606.12125】Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding
链接:https://arxiv.org/abs/2606.12125
作者:Biao Tang,Xu Chen,Shuxiang Gou,Jingyi Yuan,Yuhan Zhang,Chenqiang Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multimodal large language, temporally extended videos, understanding remains challenging, language models
备注: 10 pages, 5 figures, 8 tables. Code will be made publicly available
点击查看摘要
Abstract:Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually construct compact visual inputs from long videos under a limited visual budget. However, most of them still follow a frame-centric paradigm and apply similar representations to retained content regardless of its importance. This makes it difficult to preserve both high-fidelity visual evidence and broad temporal coverage. To address this issue, we propose Q-Fold, a training-free input construction framework for long-video understanding. Instead of treating isolated frames as the basic modeling unit, Q-Fold operates on contiguous temporal segments and constructs a heterogeneous Focus--Context representation under query guidance. Query-relevant segments are preserved as high-fidelity Focus Frames, while less relevant segments are folded into chronology-preserving contextual layouts. In this way, Q-Fold preserves critical visual evidence and broad temporal coverage, while better maintaining local temporal continuity within short segments. Experiments on four long-video benchmarks with multiple Video-MLLMs show that Q-Fold consistently improves performance without increasing the input budget. Notably, it achieves gains of up to 9.1 percentage points on an ultra-long video benchmark. Code will be made publicly available.
37. 【2606.12106】MSUE: Multi-Modal Soccer Understanding Expert
链接:https://arxiv.org/abs/2606.12106
作者:Litao Li,Yibo Yu,Yufeng Hu,Zhuo Yang,Jiali Wen,Yixin Chen,Yixi Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:SoccerNet VQA Challenge, Large Language Model, SoccerNet VQA, paper presents, presents our solution
备注: 6 pages, 1 figures
点击查看摘要
Abstract:This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf{0.95} on the challenge benchmark, securing third place in the leaderboard.
38. 【2606.12105】DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model
链接:https://arxiv.org/abs/2606.12105
作者:Pankhuri Vanjani,Zhuoyue Li,Jakub Suliga,Moritz Reuss,Gianluca Geraci,Xinkai Jiang,Rudolf Lioutikov
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:shared synchronous clock, models inherit, vision-language pretraining, inherit a shared, clock from vision-language
备注: 17 pages, 8 figures
点击查看摘要
Abstract:Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{this https URL}{this http URL}
39. 【2606.12099】ISAP-3D: Identity-Slot Aligned Part-Aware 3D Generation
链接:https://arxiv.org/abs/2606.12099
作者:Junlin Hao,Haoshuai Fu,Xibin Song,Wei Li,Ruigang Yang,Xinggong Zhang,Jinchuan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:semantically meaningful components, synthesize structured objects, meaningful components, identity-layout entanglement, aims to synthesize
备注:
点击查看摘要
Abstract:Part-aware 3D generation aims to synthesize structured objects with semantically meaningful components, yet often suffers from structural ambiguity due to identity-layout entanglement. Existing methods either infer part identity and spatial layout implicitly, which can lead to unstable part allocation (e.g., slot swapping or part merging), or rely on strong layout conditions that are difficult to obtain in practice. We attribute this ambiguity to identity-slot permutation freedom: without explicit identity-slot alignment, the correspondence between semantic parts and generation slots is not identifiable during training, allowing multiple slot assignments to fit the same supervision and leading to inconsistent decomposition. Based on this insight, we argue that stable part-aware generation requires identity-aligned one-to-one slot modelling. We therefore propose an identity-slot aligned framework, ISAP-3D, which anchors each part with semantic identity tokens and performs identity-conditioned one-to-one layout prediction, followed by layout-conditioned geometry synthesis. Structured local-global conditioning maintains identity alignment across semantic, spatial, and geometric stages. We also construct a part-level dataset with a unified semantic protocol to enable learnable and consistent identity-slot alignment. Extensive experiments demonstrate improved structural stability, controllability, and robustness over state-of-the-art part-aware generation baselines.
40. 【2606.12074】Non-frontal face recognition using GANs and memristor-based classifiers
链接:https://arxiv.org/abs/2606.12074
作者:Semih Vazgecen,Cristian Sestito,Spyros Stathopoulos,Themis Prodromakis
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:delivering high performance, deep learning techniques, delivering high, complex scenarios, advanced significantly
备注: 12 pages, 4 figures, 1 Supplementary (22 pages, 16 figures, 6 tables, 4 supplementary notes)
点击查看摘要
Abstract:Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.
41. 【2606.12072】World Model Self-Distillation: Training World Models to Solve General Tasks
链接:https://arxiv.org/abs/2606.12072
作者:Sebastian Stapf,Pablo Acuaviva Huertos,Aram Davtyan,Paolo Favaro
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:promising visual world, textual descriptions limits, emergent task-solving abilities, exhibit emergent task-solving, visual world models
备注:
点击查看摘要
Abstract:Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.
42. 【2606.12069】ac-DINO: Learning Vision-Tactile Features with Patch Alignment
链接:https://arxiv.org/abs/2606.12069
作者:Hong Li,Yankang Dong,Yue Xu,Yihan Tang,Mingzhu Li,Jiamin Qiu,Qihang Yao,Xing Zhu,Yujun Shen,Nan Xue,Yong-Lu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:primary medium, humans interact, Touch, holographic matching, alignment
备注:
点击查看摘要
Abstract:Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.
43. 【2606.12066】Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries
链接:https://arxiv.org/abs/2606.12066
作者:Quoc Thuan Nguyen,Ha Anh Vu,Ngo Dang Thanh Ngan,Minh Phuc Hoang Ngoc
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern vehicular systems, Indian Driving Dataset, Berkeley Deep Drive, Deep Drive Dataset, vehicular systems
备注:
点击查看摘要
Abstract:In modern vehicular systems, robust performance under harsh conditions has become a critical problem of autonomous driving. Our study delivers a comprehensive evaluation of the newest iteration of the YOLO series, which is YOLOv11 Nano architecture benchmarked against the widely adopted YOLOv8 Nano as a baseline on a custom fused dataset that combines the Indian Driving Dataset (IDD) [1] and Berkeley Deep Drive Dataset (BDD100K) [2]. We have analyzed the trade-offs among detection accuracy, inference speed, and computational efficiency in high-entropy scenarios involving dense mixed traffic, rain, and low-light conditions. Specifically, YOLOv11n achieves a mean Average Precision (mAP@50) of 46.6%, with a notable 3.2% improvement in Precision over the baseline, effectively reducing false positives in cluttered scenes. Furthermore, the proposed model exhibits enhanced energy efficiency, requiring 22% fewer FLOPs (6.3G vs. 8.1G) while maintaining real-time inference speed of 70.9 FPS on a Tesla T4 GPU, offering an optimal trade-off for safety-critical edge deployment.
44. 【2606.12051】MFEN:Multi-Frequency Expert Network for Visible-Infrared Person Re-ID
链接:https://arxiv.org/abs/2606.12051
作者:Xulin Li,Yan Lu,Bin Liu,Qinhong Yang,Qi Chu,Tao Gong,Nenghai Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visible-infrared person re-identification, large modality discrepancy, Visible-infrared person, person re-identification, infrared images
备注: CVPR Highlight
点击查看摘要
Abstract:Visible-infrared person re-identification (VI-ReID) is challenging due to the large modality discrepancy between visible and infrared images. We contend that this discrepancy is largely related to differing lighting conditions, including differences in light wavelength and light source type. Recently, frequency-based VI-ReID approaches have achieved notable success because frequency information can better extract identity-relevant contours and details while excluding irrelevant lighting and color. However, existing methods either do not distinguish different frequency bands or focus on only one band, which is insufficient under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different bands through a mixture-of-experts design. We further introduce Random Frequency Augmentation (RFA) and Frequency Auxiliary Optimization (FAO) to better train MFEN. The three modules are complementary and jointly capture critical frequency-domain details for robust representation learning. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.
45. 【2606.12047】Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding
链接:https://arxiv.org/abs/2606.12047
作者:Tarandeep Singh,Soumyanetra Pal,Soham Biswas,Nishanth Chandran
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
关键词:impact event occurs, event occurs, natural language, address the problem, frame it occurs
备注: Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15
点击查看摘要
Abstract:In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.
46. 【2606.12036】Vision Transformers for Face Recognition Need More Registers
链接:https://arxiv.org/abs/2606.12036
作者:Tahar Chettaoui,Guray Ozgur,Eduarda Caldeira,Naser Damer,Fadi Boutros
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, standard CLS-token paradigm, Recent advances, advances in Vision, patch embeddings
备注: Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)
点击查看摘要
Abstract:Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model's attention behavior (this https URL)
47. 【2606.12033】SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection
链接:https://arxiv.org/abs/2606.12033
作者:Min Yang,Mi Zhou,Limin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video understanding, video understanding models, numerous application scenarios, Artificial Neural Networks, computer vision
备注: Accepted by Pattern Recognition
点击查看摘要
Abstract:Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at this https URL.
48. 【2606.12023】ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation
链接:https://arxiv.org/abs/2606.12023
作者:Tahar Chettaoui,Guray Ozgur,Eduarda Caldeira,Naser Damer,Fadi Boutros
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong potential, computer vision, gained significant attention, gained significant, shown strong
备注: Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)
点击查看摘要
Abstract:Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.
49. 【2606.12012】FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control
链接:https://arxiv.org/abs/2606.12012
作者:Yiqun Ning,Ao Shen,Chenhang He,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:impressive visual realism, achieved impressive visual, visual realism, physical plausibility, diffusion-based virtual try-on
备注:
点击查看摘要
Abstract:While diffusion-based virtual try-on has achieved impressive visual realism, most methods treat the task as 2D inpainting, prioritizing texture preservation over physical plausibility. Consequently, they often produce plausible-looking images that fail to reflect authentic garment fit across diverse body shapes. We present FitVTON, a Fit-aware virtual try-on model on different bodies in the wild. FitVTON encodes garment-body size through structured text prompts, and learn from simulated try-on triplets from parameterized garment model. To improve the fitting effects over garment silhouettes, we introduce two auxiliary head to predict the masks for both the garment and the exposed body. We further introduce a texture rectification stage to improve realistic appearance from simulated data. To evaluate the fitting fidelity, we curate a real-world dataset, FittingEffect3K, combining VLM-based scoring protocol. Both subjective and quantitive experiments show that FitVTON demonstrate authentic fitting fidelity, with significant sizing accuracy and shape preservation over state-of-the-art methods while maintaining competitive image quality. Project Page: this https URL.
50. 【2606.11989】From Nominal Intensity to Equivalent Rainfall: A Path-Based Credibility Evaluation Framework for Simulated Rainfall in Autonomous-Driving Perception Tests
链接:https://arxiv.org/abs/2606.11989
作者:Tian Xia,Xin Zhao,Shaolingfeng Ye,Junyi Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:identifying perception-system boundaries, supporting SOTIF-oriented risk, SOTIF-oriented risk assessment, automated driving, essential for identifying
备注: 17 pages, preprint
点击查看摘要
Abstract:Credible simulated-rainfall conditions are essential for identifying perception-system boundaries and supporting SOTIF-oriented risk assessment in automated driving. However, closed-field tests are often described only by nominal rainfall intensity or single-point measurements, making it difficult to align simulated rain fields with real rainfall and map test results to real-world scenarios. This paper proposes a path-based credibility evaluation method for simulated rainfall in autonomous-driving perception tests. Using the drop size and velocity joint distribution of real rainfall as the reference, each candidate path is represented by path-equivalent rainfall intensity, an uncertainty band, and a path-averaged Realism of Raindrop Distribution (RRD) score. Lidar target point-cloud count and mean reflectivity are further used for perception-consistency correction, quantifying the proxy capability of each simulated-rainfall path for real-rainfall perception effects. Experiments are conducted using about 10,000 real-rainfall raindrop-spectrum samples, 728 RainSense perception samples, and 45 spatial sampling points in a 2.4 m x 7.2 m simulated-rainfall area. Results show that spatial non-uniformity remains under the same nominal condition, confirming the need for path-based evaluation. The method identifies Path IV and Path VI as preferable candidates, with results of 11.54 +/- 0.31 mm/h, RRD = 0.43, and 8.28 +/- 0.34 mm/h, RRD = 0.46, respectively. These paths show more balanced performance in rainfall-intensity stability, raindrop-spectrum realism, and perception consistency. The proposed method supports path selection, condition description, and credible interpretation of autonomous-driving perception tests under rainfall.
51. 【2606.11977】ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction
链接:https://arxiv.org/abs/2606.11977
作者:LeKai Yu,Hao Liu,Kun Wang,Zhiran Li,Ruping Cao,Fan Liu,Yupeng Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:DataMFM Challenge Track, DataMFM Challenge, Challenge Track, present our third-place, third-place solution
备注:
点击查看摘要
Abstract:In this report, we present our third-place solution for the DataMFM Challenge Track 1: Document Parsing. This track requires models to recover structured Markdown documents from document page images while preserving textual content and document structure. To address the complementary requirements of accurate content recovery and faithful structure reconstruction, we propose ParseFixer, an agentic framework for backbone parsing and selective correction. ParseFixer consists of two key modules: Full-Page Backbone Parsing (FBP) and Agentic Selective Correction (ASC). FBP produces stable initial Markdown outputs with MinerU2.5 Pro, while ASC detects high-value parsing failures and repairs them through a verify-and-rollback correction process. By placing selective multimodal correction after open-source backbone parsing, ParseFixer improves the recovery of key document elements without rewriting reliable backbone predictions. On the test set, our final system achieves an overall score of 61.78 and ranks third in Track 1, demonstrating its effectiveness for accurate document parsing. Our code will be released at: this https URL.
52. 【2606.11969】SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation
链接:https://arxiv.org/abs/2606.11969
作者:Xu Zhang,Yu Lu,Ruijie Quan,Zhaozheng Chen,Bohan Wang,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Flow Matching, Matching has enabled, enabled robust, Flow, Matching
备注:
点击查看摘要
Abstract:Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).
53. 【2606.11966】Feature extraction for plant growth estimation
链接:https://arxiv.org/abs/2606.11966
作者:Simbarashe Aldrin Ngorima,Albert Helberg,Marelie H. Davel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Precision agriculture requires, Precision agriculture, plant growth stage, growth stage, growth stage estimation
备注: 13 pages
点击查看摘要
Abstract:Precision agriculture requires the estimation of plant growth stages in real-time. When the plant growth stage is known, the wastage of resources in cultivation, such as nutrients and water, is reduced as only the required resources need to be supplied. Plants at different growth stages, however, have similar morphological features, which can make autonomous growth stage estimation difficult. This paper presents two feature extraction methods for growth stage estimation: one that uses a bank of Gabor filters and morphological operations, and the other that uses pre-trained convolutional neural networks (CNNs) and transfer learning. We test these methods on a publicly available plant growth stage dataset (``bccr-segset``) for two species, canola and radish, grown and captured under indoor conditions. The two proposed feature extraction methods are compared, using support vector machines and boosted trees as classifiers. We find that both methods are suitable for real-time applications, and that CNN features outperform the hand-crafted features, both with regard to speed and accuracy. The best system (VGG-19 features, classified with a radial basis function support vector machine) obtained an accuracy of 98.4% for both species, processing an image in 0.08 seconds.
54. 【2606.11930】Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews
链接:https://arxiv.org/abs/2606.11930
作者:Kuo-En Hung,Hung-Yue Suen,Shih-Ching Yeh,Hsiang-Wen Wang
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimedia AVI Challenge, Predicting psychological traits, ACM Multimedia AVI, asynchronous video interviews, verbal signals
备注: 9 pages, 1 figure, 4 tables
点击查看摘要
Abstract:Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging multimodal learning problem because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1\% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.
55. 【2606.11925】Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching
链接:https://arxiv.org/abs/2606.11925
作者:Zsolt Robotka,Ádám Rák,Jalal Al-Afandi,András Horváth,György Cserey
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:converts sign language, Sign language translation, spoken language text, holds significant promise, sign language video
备注:
点击查看摘要
Abstract:Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at this https URL.
56. 【2606.11913】From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations
链接:https://arxiv.org/abs/2606.11913
作者:Yuchen Guan,Xiao Li,Zongyu Guo,Xiaoyi Zhang,Xiulian Peng,Chun Yuan,Yan Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural Knowledge Representation, Knowledge Representation, Neural Knowledge, Agentic Knowledge Distillation, long video
备注:
点击查看摘要
Abstract:We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.
57. 【2606.11894】Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection
链接:https://arxiv.org/abs/2606.11894
作者:Yuto Furutani,Takashi Otonari,Kaede Shiohara,Toshihiko Yamasaki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, time-consuming per-scene optimization, per-scene optimization required, required by traditional, optimization required
备注:
点击查看摘要
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.
58. 【2606.11889】ask-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection
链接:https://arxiv.org/abs/2606.11889
作者:Everett Richards
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:Vision-language models, autonomous driving, understanding in autonomous, analysis often relies, relies on task-agnostic
备注: 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)
点击查看摘要
Abstract:Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.
59. 【2606.11884】Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality
链接:https://arxiv.org/abs/2606.11884
作者:Gregor Grote,Juan E. Tapia,Christian Rathgeb
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:Open Face Image, Open Face, remote verification systems, Face Image Quality, applying capture-related quality
备注: Presented on IWBF 2026 (14th International Workshop on Biometrics and Forensics)
点击查看摘要
Abstract:This paper addresses the challenge of assessing image quality in ID cards in remote verification systems by applying capture-related quality measures from the Open Face Image Quality (OFIQ) standard to ID card images. Our preprocessing pipeline includes corner detection, perspective normalization, and comprehensive foreground masking to ensure accurate and unbiased quality measure computation. We evaluate the effectiveness of these measures by analyzing their correlation with the performance of three presentation attack detection (PAD) algorithms across four diverse ID card datasets, where two datasets contain bona fide, i.e. pristine, images and two contain printed mock ID cards. Our results suggest that quality assessment based on some OFIQ measures can significantly improve PAD performance.
60. 【2606.11880】SG2Loc: Sequential Visual Localization on 3D Scene Graphs
链接:https://arxiv.org/abs/2606.11880
作者:Nicole Damblon,Olga Vysotska,Federico Tombari,Marc Pollefeys,Daniel Barath
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex indoor environments, indoor environments remains, complex indoor, remains a critical, critical challenge
备注: The code will be available at [this https URL](https://github.com/DmblnNicole/sg2loc)
点击查看摘要
Abstract:Visual localization in complex indoor environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract per-patch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be available at this https URL.
61. 【2606.11853】ask-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
链接:https://arxiv.org/abs/2606.11853
作者:Zhirui Chen,Ziwei Chen,Ling Shao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:long multi-modal sequences, Multi-modal large language, large language models, rapid task adaptation, finite context windows
备注: Accepted to ICML 2026
点击查看摘要
Abstract:Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.
62. 【2606.11846】SheafStain: Sheaf-Theoretic Schrödinger Bridge for Spatially and Biologically Coherent Virtual Staining
链接:https://arxiv.org/abs/2606.11846
作者:Hyeongyeol Lim,Hongjun Yoon,Eunjin Jang,Daeky Jeong,Won June Cho,Hwamin Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cost-efficient biomarker quantification, Current virtual staining, Vision Foundation Models, Current virtual, diagnostics and prognostics
备注: 32 pages
点击查看摘要
Abstract:Current virtual staining approaches offer the potential for time- and cost-efficient biomarker quantification in cancer diagnostics and prognostics. However, patch-wise inference for gigapixel whole slide images (WSIs) fails to maintain spatial continuity, yielding artifacts that cause catastrophic mismatches with ground-truth images. Although pathology Vision Foundation Models (VFMs) offer rich representations, their self-attention causes varying global contexts to produce inconsistent embeddings for the same physical region. We formalize and validate this ``context contamination'' as a sheaf-theoretic problem where these embeddings form a presheaf that violates the gluing axiom. To address this, we propose SheafStain, a new approach that reinterprets VFM features as sheaf-like sections for spatially and biologically coherent virtual staining. Specifically, SheafStain integrates class and patch tokens into a Schrödinger Bridge framework as sheaf-like sections. While the class token anchors biological consistency, patch tokens form a per-position spatial map. A backbone co-pretrained on Hematoxylin \ Eosin (H\E) and Immunohistochemistry (IHC) yields non-degenerate cross-stain stalks, so a single VFM feature space supervises both input conditioning and output stain alignment. Departing from prior work that evaluates on isolated $256 \times 256$ patches and either random-crops or resizes the $1024 \times 1024$ ground truth, we translate at $256 \times 256$ and evaluate on the stitched $1024 \times 1024$ outputs across HER2, ER, PR, and Ki-67. SheafStain demonstrates promising results against six prior methods while mitigating patch-boundary stitching artifacts. Code will soon be released.
63. 【2606.11841】Scene-Adaptive Nonlinear Tone Curves for Pseudo Ground-Truth Generation in Low-Light 3D Gaussian Splatting
链接:https://arxiv.org/abs/2606.11841
作者:Mingzhe Lyu,Jinqiang Cui,Hong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:weak structural detail, compressed dynamic range, weak structural, structural detail, dynamic range
备注:
点击查看摘要
Abstract:Low-light novel view synthesis is challenging because dark multi-view images contain noise, weak structural detail, and compressed dynamic range. Recent 3D Gaussian Splatting (3DGS) methods address these challenges by generating pseudo ground-truth (pseudo-GT) images as supervision targets when paired normal-light references are unavailable. Existing pseudo-GT methods apply a uniform linear gain to all pixels, which clips bright regions while providing insufficient enhancement in dark regions, limiting reconstruction quality. We observe that nonlinear tone mappings, long established in 2D low-light enhancement, have not been explored for pseudo-GT generation in 3D reconstruction. Accordingly, we propose a scene-adaptive nonlinear tone-curve framework that replaces linear pseudo-GT with nonlinear alternatives. The framework introduces percentile-based normalisation for scene-agnostic curve application, a scene-adaptive offset for automatic black-level adjustment, and two complementary curves: Adaptive SoftExp (ASE), a bounded exponential curve, and Adaptive Poly3 (AP3), a data-driven cubic polynomial. The module changes only the pseudo-GT computation and leaves the 3DGS backbone unchanged. Experiments on three benchmarks covering 21 scenes show that both curves consistently outperform the linear baseline with PSNR improvements up to +4.34 dB on LOM and +3.25 dB on RealX3D. Both curves achieve similar performance despite their different mathematical forms, suggesting the improvement is curve-agnostic. Code is available at this https URL
64. 【2606.11838】Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding
链接:https://arxiv.org/abs/2606.11838
作者:Hyomin Kim,Junghye Kim,Joanie Hayoun Chung,Yoonjin Oh,Kyungjae Lee,Sungbin Lim,Sungwoong Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation guide post-training, Reward models, video reward model, reasoning-based reward models, guide post-training
备注:
点击查看摘要
Abstract:Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.
65. 【2606.11837】LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation
链接:https://arxiv.org/abs/2606.11837
作者:Liwen Yi,Xianlin Zhang,Yue Zhang,Yue Ming,Xueming Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Open-vocabulary scene sketch, sparse line drawings, line drawings based, flexible category vocabularies, Open-vocabulary scene
备注:
点击查看摘要
Abstract:Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.
66. 【2606.11805】xtHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization
链接:https://arxiv.org/abs/2606.11805
作者:Zixiong Hao,Zhencun Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:physically plausible contact, mesh remains challenging, articulated hand shape, preserve language semantics, cross-view consistency
备注: 11 pages, 8 figures, 3 tables
点击查看摘要
Abstract:Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.
67. 【2606.11792】MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models
链接:https://arxiv.org/abs/2606.11792
作者:Yuansheng Gao,Wenbin Xing,Jiahao Yuan,Kaiwen Zhou,Han Bao,Zonghui Wang,Wenzhi Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Video Large Multimodal, Large Multimodal Models, Large Multimodal, achieved remarkable progress, Video Large
备注: Preprint
点击查看摘要
Abstract:Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.
68. 【2606.11783】A Comprehensive Ecosystem for Open-Domain Customized Video Generation
链接:https://arxiv.org/abs/2606.11783
作者:Jingxu Zhang,Yuqian Hong,Daneul Kim,Kai Qiu,Qi Dai,Jianmin Bao,Yifan Yang,Xiaoyan Sun,Chong Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual synthesis capabilities, shown impressive visual, impressive visual synthesis, Recent progress, synthesis capabilities
备注: 5 pages, 3 figures, 4 tables. Accepted by ICASSP 2026
点击查看摘要
Abstract:Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated identity, text, video triplets across 8,000+ categories. Leveraging this, we propose CustoMDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art. However, benchmarks such as DreamBooth cover only 100 classes, which is insufficient for real-world applications. To overcome this, we construct OpenCustom, a new benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. Extensive experiments confirm the advantages of both our dataset and model. We will open-source the entire ecosystem--including dataset, pipeline, benchmark, and implementations--to support further research.
69. 【2606.11782】Seeing What Matters: Perceptual Wrapper with Common Randomness for 3D Gaussian Splatting
链接:https://arxiv.org/abs/2606.11782
作者:He-Bi Yang,Jing-Zhong Chen,Yen-Kuan Ho,Sang NguyenQuang,Fan-Yi Hsu,Yun-Yu Lee,Jui-Chiu Chiang,Wen-Hsiao Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieves impressive real-time, impressive real-time rendering, limitation heavily exacerbated, Gaussian Splatting, synthesize high-frequency textures
备注: 18 pages, 9 figures
点击查看摘要
Abstract:While 3D Gaussian Splatting (3DGS) achieves impressive real-time rendering, it frequently struggles to synthesize high-frequency textures, a limitation heavily exacerbated in memory-constrained and rate-distortion-optimized (RDO) pipelines. To address this, we propose a versatile 2D perceptual wrapper that enhances the rendered outputs of existing 3DGS representations in a content- and view-dependent manner. Our method leverages a lightweight synthesis network conditioned on pseudo-random Gaussian noise to synthesize perceptually plausible textures. Supervised by Wasserstein Distortion, the network learns to match local feature statistics rather than strictly enforcing pixel-wise reconstruction fidelity, effectively mitigating the blurriness inherent in standard frameworks. We demonstrate the broad applicability of our plug-and-play approach across vanilla, memory-constrained, and RDO 3DGS methods. Comprehensive subjective and objective experiments confirm that our method significantly improves over existing baselines, yielding superior perceptual quality at sharply reduced file or model sizes.
70. 【2606.11779】Battery detection of XRay images using transfer learning
链接:https://arxiv.org/abs/2606.11779
作者:Nermeen Abou Baker,David Rohrschneider,Uwe Handmann
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:detecting and sorting, drastically increasing, sorting batteries, cylindrical Lithium-Ion Batteries, transfer learning
备注: Published at the European Symposium on Artificial Neural Networks (ESANN 2022)
点击查看摘要
Abstract:The need for detecting and sorting batteries is drastically increasing for many applications. This study proves the potential of transfer learning in predicting whether the image contains a battery or not, the location and identifying three types of batteries, namely: prismatic, pouch, and cylindrical Lithium-Ion Batteries (LIB). Particularly, it focuses on the transfer learning method in two applications: Training a large-scale dataset to detect electronic devices using a pre-trained YOLOv5m, then using these latter trained weights to detect and classify the batteries. The precision of battery detection achieves 94%, which outperforms the pretrained YOLOv5m weights with 5%, in 22 ms inference time.
71. 【2606.11751】AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory
链接:https://arxiv.org/abs/2606.11751
作者:Hang Xu,Xiaoxiao Ma,Guohui Zhang,Yu Hu,Siming Fu,Jie Huang,Lin Song,Haoyang Huang,Nan Duan,Feng Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:iterative design, successive steps, essential for iterative, current models, models often struggle
备注: Code: [this https URL](https://github.com/xuhang07/AnchorEdit)
点击查看摘要
Abstract:Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.
72. 【2606.11745】From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning
链接:https://arxiv.org/abs/2606.11745
作者:Haoping Yu,Yuanxi Li,Jing Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:physical world, requiring identification, Visual causal reasoning, essential for understanding, understanding and intervening
备注:
点击查看摘要
Abstract:Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).
73. 【2606.11740】UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA
链接:https://arxiv.org/abs/2606.11740
作者:Mengzhuo Chen,Yan Shu,Chi Liu,Hongming Piao,Xidong Wang,Derek Li,Bryan Dai
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:medical VQA, interleaved textual reasoning, reasoning, input types, types are aligned
备注:
点击查看摘要
Abstract:We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at this https URL.
74. 【2606.11739】Multi-View In-Cabin Monitoring System for Public Transport Vehicles
链接:https://arxiv.org/abs/2606.11739
作者:Evgeny Gorelik,Kenny Dean Karrow,Fikret Sivrikaya,Sahin Albayrak,Christian Baumann
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:German city bus, partly automated German, automated German city, rotating LiDAR covering, in-cabin monitoring dataset
备注: Submitted to ICDM2026
点击查看摘要
Abstract:We introduce a multi-view in-cabin monitoring dataset for public transportation with synchronized RGB and depth images from four inward-facing cameras and a rotating LiDAR covering the vehicle interior of a digitalized and partly automated German city bus. The dataset contains 9.136 synchronized samples with annotations and is accompanied by a calibration and pseudo-labeling pipeline that generates 3D human pose estimates and oriented 3D bounding boxes for occupants. We further provide a nuScenes-format conversion and benchmark representative multi-view 3D detection models (e.g., Lift-Splat-Shoot and BEVFusion), supporting comparative evaluation and small-scale training of multi-view in-cabin perception models. The dataset and tools are available at this https URL.
75. 【2606.11719】Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning
链接:https://arxiv.org/abs/2606.11719
作者:Enhan Zhao,Wei Wu,Yuanrui Zhang,Xueliang Zhao,Di He
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multimodal large language, large language models, remains a persistent, persistent challenge, challenge for multimodal
备注:
点击查看摘要
Abstract:Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.
76. 【2606.11710】ERN-Net : Evolving Reason Node-Net for Document Binarization
链接:https://arxiv.org/abs/2606.11710
作者:Hsin-Jui Pan,Sheng-Wei Chan,Jen-Shiung Chiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Evolving Reason Node-Net, efficient document image, paper presents ERN-Net, document image binarization, evolving reason nodes
备注:
点击查看摘要
Abstract:This paper presents ERN-Net, an Evolving Reason Node-Net for efficient document image binarization. ERN-Net enhances degradation-sensitive regions, such as faint strokes, broken characters, and noisy backgrounds, through evolving reason nodes and multi-scale reasoning. We further compare ResNet-101, ConvNeXt-Tiny, and ConvNeXt-Base, and find that ConvNeXt-Tiny provides the best practical trade-off between accuracy and memory usage. In addition, DIBCO-based pretraining improves binarization performance without increasing model memory consumption, requiring only about 1.5 additional training hours. Experiments on DIBCO-style benchmarks show that ERN-Net is effective under low-data and low-memory settings.
77. 【2606.11702】MedCTA: A Benchmark for Clinical Tool Agents
链接:https://arxiv.org/abs/2606.11702
作者:Tajamul Ashraf,Hyewon Jeong,Fida Mohammad Thoker,Bernard Ghanem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:clinically grounded decisions, make clinically grounded, evidence acquisition, make clinically, simple recognition
备注: Project Page: [this https URL](https://ivul-kaust.github.io/MedCTA/) Code: [this https URL](https://github.com/IVUL-KAUST/MedCTA) Data: [this https URL](https://huggingface.co/datasets/IVUL-KAUST/MedCTA)
点击查看摘要
Abstract:To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at this https URL
78. 【2606.11689】RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval
链接:https://arxiv.org/abs/2606.11689
作者:Jiale Huang,Zixu Li,Zhiheng Fu,Zhiwei Chen,Qinlei Huang,Yupeng Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Composed Image Retrieval, Composed Image, Image Retrieval, pivotal paradigm requiring, perform joint reasoning
备注: Accepted by ICMR 2026
点击查看摘要
Abstract:Composed Image Retrieval (CIR) constitutes a pivotal paradigm requiring models to perform joint reasoning on reference images and modification texts. However, the prevalence of Noisy Triplet Correspondence (NTC) in large-scale datasets severely constrains model performance. Existing denoising methods either target binary mismatches or rely on scalar-based point-wise estimation, neglecting rich global structural correlations among sample populations and dynamic value variations during training, thereby yielding suboptimal results. This paper identifies two critical unresolved challenges: Global Structural Inconsistency of Semantic Correlations and Hard Sample Discrimination Uncertainty. To address these, we propose RankVR, a framework designed to construct a robust CIR model via global structure consistency and dynamic value perception. Specifically, we introduce the Global Structure Consistency Perception (GSCP) module, which utilizes the Effective Rank of the Correlation Matrix to decouple clean samples from structural noise. By measuring rank difference, GSCP identifies samples disrupting macroscopic semantic symmetry. Furthermore, we develop the Adaptive Semantic Value Calibration (ASVC) module to distinguish high-value hard clean samples. By integrating training potential and reliability, it dynamically quantifies the semantic value of each triplet, ensuring effective utilization of hard samples while suppressing noise characterized by logical conflicts. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate that RankVR significantly outperforms existing state-of-the-art methods, validating its superior robustness in noisy environments.
79. 【2606.11687】DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace
链接:https://arxiv.org/abs/2606.11687
作者:Marius Bayizere
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, defining security challenge, Swarm Intelligence Module
备注: 23 pages, 6 figures, 11 tables. Code available at [this https URL](https://github.com/bayizeremarius/DroneShield-AI)
点击查看摘要
Abstract:Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately $500-$780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.
80. 【2606.11683】Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
链接:https://arxiv.org/abs/2606.11683
作者:Chaofan Ma,Zhenjie Mao,Yuhuan Yang,Fanqin Zeng,Yue Shi,Yingjie Zhou,Xiaofeng Cao,Jiangchao Yao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:camera trajectory, inherently challenging, Spatial reasoning, observable evidence, Reason Phase
备注: ICML 2026
点击查看摘要
Abstract:Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: this https URL
81. 【2606.11682】Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning
链接:https://arxiv.org/abs/2606.11682
作者:Jiaqi Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:multimodal learning aims, structured tabular attributes, improve predictive modeling, visual data, learning aims
备注:
点击查看摘要
Abstract:Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.
82. 【2606.11670】ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation
链接:https://arxiv.org/abs/2606.11670
作者:Zijie Meng,Jiwen Liu,Yufei Liu,Chengzhuo Tong,Xiaoqiang Liu,Yuanxing Zhang,Yulong Xu,Pengfei Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:expression shifts, scale variation, recognizable across motion, large viewpoint, Identity Mosaic Injection
备注: 13 pages, 3 figures
点击查看摘要
Abstract:Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.
83. 【2606.11661】Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification
链接:https://arxiv.org/abs/2606.11661
作者:Dong-Woo Kim,Tae-Kyun Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Clothes-changing person re-identification, Clothes-changing person, person re-identification, aims to recognize, recognize individuals
备注: Accepted to the ICML 2026 Workshop on CoLoRAI
点击查看摘要
Abstract:Clothes-changing person re-identification (CC-ReID) aims to recognize individuals despite drastic appearance changes caused by clothing variation. While existing methods rely on adversarial learning to disentangle clothing features, we propose Ortho-ReID, which explicitly models a low-rank clothing subspace from VLM text descriptions and extracts clothing-invariant representations via direct geometric constraints. A critical component is our transformer-based Basis Maker, which refines a shared, low-dimensional clothing prior into an instance-adaptive low-rank subspace through cross-attention with image patches, enabling robust clothing feature extraction even under varying visibility conditions. This instance-adaptive subspace is supervised via alignment with clothing text embeddings, while identity features are extracted via a learnable projection head and geometrically constrained to be strictly orthogonal to it. Extensive experiments demonstrate state-of-the-art performance on PRCC (+5.9% top-1), Celeb-reID-light (+3.5%), and LaST (+5.3%), with competitive results on LTCC.
84. 【2606.11645】Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition
链接:https://arxiv.org/abs/2606.11645
作者:Jialin Liu,Xinwen He,Pengyu Liu,Jiale Shi,Huaijuan Zang,Yanbin Hao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:subtle body movements, analysis attracts increasing, attracts increasing attention, Micro-gesture analysis attracts, inferring spontaneous emotion
备注: 13 pages, 2 figures
点击查看摘要
Abstract:Micro-gesture analysis attracts increasing attention for inferring spontaneous emotion from subtle body movements. Micro-gesture online recognition, which localizes and classifies each gesture instance in untrimmed videos, is a core task in the 4th EI-MiGA-IJCAI Challenge. Compared with typical temporal action detection, MGR emphasizes the localization and classification of actions, requiring the model to output the start time, end time, and category of each micro-gesture. Moreover, since micro-gestures are highly spontaneous, relying solely on a single modality makes it difficult to capture the complete and accurate multi-modal cues. In this work, we propose DyFADet+, which extends DyFADet into a dual-stream RGB-skeleton framework. In our model, both modalities are projected into shared multi-scale temporal embeddings and fused through a gated residual module, which adaptively injects skeleton motion into the RGB representation rather than using naive concatenation. Finally, these fused features are decoded by a Dynamic TAD head for online classification and boundary regression. On the SMG dataset, our method achieves an F1 score of 40.88, ranking 2nd in the Micro-gesture Online Recognition track.
85. 【2606.11626】Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels
链接:https://arxiv.org/abs/2606.11626
作者:Cheng Chen,Jingyu Zhou,Yifan Zhao,Jia Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, remains a challenging, challenging task, task in computer, multi-label images remains
备注:
点击查看摘要
Abstract:Understanding multi-label images remains a challenging task in computer vision. With the rapid progress of vision-language multimodal learning, vision-language models (VLMs) enable zero-shot recognition without labeled data. However, due to their intrinsic design, these models often prioritize the most iconic object and omit other contextual positives. This intrinsic bias conflicts with the nature of multi-label learning, thereby limiting their applicability. In this work, we propose an unsupervised framework that adapts VLMs from iconic recognition toward inclusive understanding, enabling label-free multi-label image recognition. Our approach consists of two key stages, ``cutting'' and ``sewing'': In the cutting stage, we present the multi-sampling response estimator to prevent the model from concentrating only on one single object. In the second sewing stage, the multi-object blend adaptation is introduced to adjust the labels to better conform to the multi-label distribution while preserving the intrinsic characteristics of the original model within only one epoch. Extensive experiments show that our framework significantly outperforms existing unsupervised approaches on four public datasets, even surpassing several representative weakly supervised baselines. These results demonstrate the potential of adapting pre-trained VLMs for more comprehensive visual understanding without manual annotations. Our code is publicly available at this https URL.
86. 【2606.11619】Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation
链接:https://arxiv.org/abs/2606.11619
作者:Zongwu Xie,Yifan Yang,Yonglong Zhang,Guanghu Xie,Yang Liu,Shuo Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:estimation remains difficult, Vision sensors provide, spacecraft proximity operations, pose estimation remains, specular reflection
备注: 11 pages, 7 figures
点击查看摘要
Abstract:Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose this http URL proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.
87. 【2606.11615】Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks
链接:https://arxiv.org/abs/2606.11615
作者:Omid Ahmadieh,Nima Karimian
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:technologies raises, privacy concerns, exploited without consent, face recognition systems, widespread adoption
备注:
点击查看摘要
Abstract:The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity-attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a single-step denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by +6.25 points, diffusion-based makeup method DiffAIM by +3 points, and noise-based P3-Mask by +16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 27.15 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).
88. 【2606.11614】Information-Theoretic Decomposition for Multimodal Interaction Learning
链接:https://arxiv.org/abs/2606.11614
作者:Zequn Yang,Yake Wei,Haotian Ni,Zhihao Xu,Di Hu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:collectively constitute multimodal, constitute multimodal interactions, Multimodal learning hinges, learning, Multimodal learning
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at this https URL.
89. 【2606.11606】Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation
链接:https://arxiv.org/abs/2606.11606
作者:Raajitha Muthyala,Zhenan Yin,Alekhya Jilla,Frank Li,Theo Dapamede,Bardia Khosravi,Mohammadreza Chavoshi,Judy Gichoya,Saptarshi Purkayastha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:foundation-model embeddings increasingly, embeddings increasingly serve, pretraining domains, forward pass, frozen forward pass
备注:
点击查看摘要
Abstract:Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC = 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.
90. 【2606.11602】On Aligning Hierarchical Standardized Embedding for Audio-visual Generalized Zero-shot Learning
链接:https://arxiv.org/abs/2606.11602
作者:Zihan Zhang,Jie Hong,Siyuan Fan,Yanghao Zhou,Pengfei Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Audio-visual Generalized Zero-shot, Generalized Zero-shot Learning, Audio-visual Generalized, Generalized Zero-shot, audio and visual
备注:
点击查看摘要
Abstract:Audio-visual Generalized Zero-shot Learning (AV-GZSL) is a challenging task that aims to classify both seen and unseen objects or scenes by integrating data from audio and visual modalities. Recent studies primarily focus on fusing or aligning audio and visual features to generate more informative audio-visual embeddings. Also, aligning the audio-visual and textual features of most existing methods relies solely on the optimization objectives. However, those methods neglect the inherent distributional and structural differences between audio-visual and textual modalities. To address this limitation, we propose a method termed Aligning Hierarchical Standardized Embedding (AHSE), which enables hierarchical alignment of standardized audio-visual and textual embeddings within a shared embedding space. Specifically, we first apply Z-score standardization to the fused audio-visual and textual embeddings to reduce distributional mismatches. We then introduce a hierarchical alignment strategy that minimizes discrepancies at the semantic, class, and batch levels, thereby constructing a more robust and well-structured embedding space. This strategy not only preserves semantic and inter-class relationships but also maintains spatial consistency within each batch. Extensive experiments on three benchmark datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, demonstrate that AHSE achieves competitive performance in zero-shot learning.
91. 【2606.11601】Spatially Coupled Phase-to-Depth Calibration for Fringe Projection Profilometry
链接:https://arxiv.org/abs/2606.11601
作者:Sehoon Tak,Jae-Sang Hyun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fringe projection profilometry, projection profilometry, relation independently, fringe projection, commonly recovered
备注:
点击查看摘要
Abstract:In fringe projection profilometry (FPP), depth is commonly recovered by fitting a phase-to-depth relation independently at each camera pixel. Although such pixel-wise calibration achieves high local accuracy, neighboring pixels can acquire markedly different calibration functions even when they observe the same smooth surface, producing spatially inconsistent geometry and structured surface artifacts. We propose a spatially coupled phase-depth transformation in which all pixels share a single low-dimensional mapping-global phase scalars combined with affine spatial terms on the undistorted reference-camera grid-rather than independent per-pixel fits, optionally augmented by a bounded, spatially smooth correction field. We further introduce a native-grid pairing scheme that constructs phase-depth calibration pairs directly on the reference-camera grid: when depth supervision comes from a rectified active-stereo pipeline, planes are fitted in stereo 3D and sampled back onto the camera grid along native rays, so the phase maps are never rectified. On a dental target with high-resolution scanner ground truth, the proposed model attains point-to-surface RMSE comparable to an active-stereo reference (about 12{\mu}m aggregate) while substantially improving spatial coherence over pixel-wise polynomial and rational calibration, and reduces the runtime mapping to a few element-wise operations per pixel with negligible parameter storage.
92. 【2606.11578】Contactless 3D Human Body Measurement Using Depth Cameras for Smart Health Monitoring
链接:https://arxiv.org/abs/2606.11578
作者:Martha Asare,Xuan Wang,Juan Lopez Alvarenga,Lois Akosua Serwaa,Jinghao Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remote patient assessment, Contactless body measurement, Contactless body, patient assessment, increasingly significant
备注: 6 pages, 4 figures. Depth camera-based framework for contactless anthropometric measurement and geometric analysis using 3D point clouds
点击查看摘要
Abstract:Contactless body measurement technologies are becoming increasingly significant for smart health monitoring, digital health applications, and remote patient assessment. Traditional anthropometric measurements typically necessitate physical contact and trained personnel, which may constrain scalability in remote healthcare settings. In this study, we introduce a depth camera-based framework for estimating human body measurements utilizing 3D point cloud data. An Orbbec Astra 2 depth camera was employed to capture RGB images, depth maps, and 3D point clouds of participants. The captured point cloud was processed using Python-based tools, including Open3D, NumPy, and OpenCV, to segment the human body from the background. Key anthropometric measurements, such as height and arm span, were computed. The measurements were obtained through a combination of spatial filtering and landmark selection on the 3D point cloud, followed by the projection of the computed measurements onto the corresponding RGB image using camera intrinsic parameters. In addition to linear measurements, the approximate body volume and visible surface area were estimated using voxel-based occupancy analysis and mesh-based surface reconstruction methods. The experimental results from a single depth capture demonstrated that accurate body measurements and geometric estimates could be obtained from depth camera data without physical contact. This study provides a foundation for future real-time systems that integrate depth sensing with intelligent health monitoring and generative AI models for smart healthcare applications.
93. 【2606.11576】AVIS: Adaptive Test-Time Scaling for Vision-Language Models
链接:https://arxiv.org/abs/2606.11576
作者:Ahmadreza Jeddi,Minh Ngoc Le,Amirhossein Kazerouni,Hakki Can Karaimer,Hue Nguyen,Iqbal Mohomed,Michael Brudno,Alex Levinshtein,Konstantinos G. Derpanis,Babak Taati,Radek Grzeszczuk
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Modern Vision-Language Models, long decoding chains, Visual Context Scaling, large visual contexts, Modern Vision-Language
备注: Project page: [this https URL](https://avis-vlm.github.io/)
点击查看摘要
Abstract:Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy--compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.
94. 【2606.11573】Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception
链接:https://arxiv.org/abs/2606.11573
作者:Xin Qiu,Wenjie Liu,Fuyuan Ai,YuChen Tan,Zhiwei Xu,Chunyi Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Radar-camera BEV perception, internal fused representations, sensor configurations, evaluated across datasets, perception often suffers
备注:
点击查看摘要
Abstract:Radar-camera BEV perception often suffers from degraded performance when evaluated across datasets, as changes in driving scenes, sensor configurations, and environmental conditions can alter both the input observations and the internal fused representations. This work studies this issue from the perspective of source-domain variation modeling, aiming to improve the robustness of BEV-based 3D detectors without relying on target-domain samples. We introduce a framework that characterizes visual scene variations in the frequency domain and uses them to synthesize diverse source-domain views. By comparing the resulting fused BEV representations, the framework further captures how image-level variations influence multi-modal BEV features. These variation patterns are then used to regularize the detector, encouraging the learned fusion space to remain stable under latent scene changes. The proposed method is applied only during training and leaves the inference pipeline unchanged. Experiments on cross-dataset radar-camera 3D detection between View-of-Delft and TJ4DRadSet demonstrate consistent improvements over multiple BEV fusion backbones, and the gains remain effective when a small amount of target-domain data is available.
95. 【2606.11572】FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection
链接:https://arxiv.org/abs/2606.11572
作者:Keval Thaker,Venkatraman Narayanan,Abdalmalek Aburaddaha,Samir A. Rawashdeh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale RGB foundation, RGB foundation models, remains challenging due, image formation physics, large-scale RGB
备注:
点击查看摘要
Abstract:Transfer learning from large-scale RGB foundation models to infrared (IR) imagery through knowledge distillation (KD) remains challenging due to fundamental differences in image formation physics. We investigate the spectral structure of the RGB--IR modality gap and observe that feature divergence is not uniform across spatial frequencies: low-frequency components (shape, layout) show greater cross-modal alignment than high-frequency components (texture, fine edges), which reflect modality-specific characteristics. Based on this analysis, we propose FreqKD, a frequency-decoupled distillation framework that applies asymmetric supervision adapted to each band's cross-modal consistency. The method employs strict mean squared error (MSE) on the low-frequency band to preserve shared structural information and a relaxed log-MSE loss (weighted at 0.1) on the high-frequency band to provide edge guidance while tolerating texture differences. Spectral divergence analysis on 500 paired samples shows that high-frequency divergence exceeds low-frequency divergence by a factor of 2.4x on average across all analysed transformer layers. On KAIST multispectral pedestrian detection, FreqKD achieves 64.1 mAP50, improving 2.4 points over the DINOv2 baseline. The learned representation transfers across datasets (FLIR ADAS, +2.1 mAP50), tasks (MFNet segmentation, +1.85 mean intersection-over-union), and architectures (ResNet-50, +1.0 mAP50). Code is available at: this https URL
96. 【2606.11568】4DP-QA: Scalable QA for 4D Perception in Vision Language Models
链接:https://arxiv.org/abs/2606.11568
作者:Seokju Cho,Abhishek Badki,Hang Su,Jindong Jiang,Ziyao Zeng,Seungryong Kim,Sifei Liu,Orazio Gallo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Vision Language, recent advances, Language Models, struggle to grasp
备注: Project page: [this https URL](https://research.nvidia.com/labs/lpr/4dpqa)
点击查看摘要
Abstract:Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.
97. 【2606.11563】Cross-Modal Benchmarking for Robotic Perception in Natural Environments
链接:https://arxiv.org/abs/2606.11563
作者:David Hall,Joshua Knights,Mark Cox,Peyman Moghadam
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:robotics perception systems, Natural environments present, present a complex, complex challenge, Natural environments
备注: Accepted to the IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026
点击查看摘要
Abstract:Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.
98. 【2606.11546】VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio
链接:https://arxiv.org/abs/2606.11546
作者:Hao Zhang,Qinran Lin,Linqi Song,Yong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:provide rich semantic, open-vocabulary object detection, provide rich, CLIP vision-language knowledge, CLIP visual knowledge
备注:
点击查看摘要
Abstract:Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.
99. 【2606.11529】XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer
链接:https://arxiv.org/abs/2606.11529
作者:Steve Rhyner,Sankeerth Durvasula,Aleksandr Kovalev,Hansel Jia,Adrian Zhao,Mrutunjayya Mrutunjayya,Nilesh Ahuja,Selvakumar Panneer,Christina Giannoula,Nandita Vijaykumar
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
关键词:written backward passes, requires extensive low-level, manually written backward, rendering underpins modern, Point-based differentiable rendering
备注:
点击查看摘要
Abstract:Point-based differentiable rendering underpins modern 3D reconstruction, novel-view synthesis, and learning-based graphics pipelines, but developing new rendering methods often requires extensive low-level implementation, hardware-specific kernels, and manually written backward passes. This limits rapid prototyping, reproducibility, exploration, and deployment, especially across diverse hardware platforms. This paper presents XPR, an extensible cross-platform framework for point-based differentiable rendering. XPR introduces a high-level programming interface that separates method-specific logic from the shared rendering pipeline, allowing users to implement new methods in a few lines of code. Its pipeline decomposes rendering into modular, statically shaped parallel operations that can be lowered by a cross-platform compiler to GPUs, TPUs, CPUs, and other ML accelerators. We demonstrate implementations of 3DGS, 3DGUT, and LinPrim, with only a few 100s lines of Python code, each of which can be compiled to a range of hardware platforms with the XLA compiler. These results show that XPR enables fast experimentation and portable execution for emerging point-based differentiable rendering systems.
100. 【2606.11507】SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining
链接:https://arxiv.org/abs/2606.11507
作者:Abdalmalek Aburaddaha,Venkatraman Narayanan,Keval Thaker,Samir A. Rawashdeh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:semantic rarity suffices, trajectory ambiguity, difficulty labels, driving logs, logs is bottlenecked
备注:
点击查看摘要
Abstract:Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: this https URL
101. 【2606.11505】On the Study of Biometric Spoofing Detection using Deep Learning
链接:https://arxiv.org/abs/2606.11505
作者:Kumar Kartikey,Nikos Komninos
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:gain unauthorized access, attackers exploit counterfeit, Spoof Trace Disentanglement, exploit counterfeit biometric, counterfeit biometric data
备注:
点击查看摘要
Abstract:Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attackers exploit counterfeit biometric data to gain unauthorized access. This research evaluates the effectiveness of state-of-the-art machine learning models, MobileNetV2, DenseNet-121, Inception-v3, and Spoof Trace Disentanglement (STD) in detecting spoofing attacks within facial recognition systems. Using the CelebA-Spoof dataset, the study evaluates model effectiveness using metrics such as accuracy, precision, recall, and F1 Score. Cross-dataset validation is carried out on the MSU-MFSD dataset to assess generalizability. The results show MobileNetV2 as the most efficient model, achieving 92% accuracy while balancing computational effectiveness, making it appropriate for real-life applications. Inception-v3 shows moderate robustness, while DenseNet-121 and STD struggle with generalization. The findings highlight the need for advances in domain adaptation and hybrid architectures to enhance biometric security systems.
102. 【2606.11477】owards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models
链接:https://arxiv.org/abs/2606.11477
作者:Hartwig Grabowski
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Correcting handwritten exams, closed question formats, Correcting handwritten, digital exams tend, time-consuming and error-prone
备注: 11 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.
103. 【2606.11466】PT-WNO: Point Transformer with Wavelet Neural Operator for 3D Point Cloud Semantic Segmentation
链接:https://arxiv.org/abs/2606.11466
作者:Nhut Le,Maryam Rahnemoonfar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:semantic segmentation requires, segmentation requires architectures, fine-grained local geometry, cloud semantic segmentation, global scene structure
备注:
点击查看摘要
Abstract:Point cloud semantic segmentation requires architectures that capture both fine-grained local geometry and broad global scene structure. Transformer-based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder-decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene-level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT-WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder-decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi-scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large-scale 3D point cloud benchmarks demonstrate the effectiveness of PT-WNO. On S3DIS (Area 5), PT-WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT-WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).
104. 【2606.11450】Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition
链接:https://arxiv.org/abs/2606.11450
作者:Shengkai Sun,Zhiyong Cheng,Zefan Zhang,Jianfeng Dong,Zhihui Li,Meng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:driving significant progress, action representation learners, strong action representation, self-supervised skeleton-based action, representation learners
备注: Accepted by CVPR2026. The code is available at [this https URL](https://github.com/AshenOne1005/AMR)
点击查看摘要
Abstract:Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.
105. 【2606.11446】3D-CBM: A Framework for Concept-Based Interpretability in Generative 3D Modeling
链接:https://arxiv.org/abs/2606.11446
作者:Ahmad Al-Kabbany
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:incorporating Concept Bottleneck, Concept Bottleneck Models, deep geometric learning, Bottleneck Models, semantic gap
备注:
点击查看摘要
Abstract:This research introduces a framework for incorporating Concept Bottleneck Models (CBMs) into 3D generative architectures to address the inherent 'semantic gap' in deep geometric learning. As deep models become central to 3D content creation, explainability shifts from a peripheral feature to a fundamental requirement for trust and accountability in safety-critical domains such as healthcare and manufacturing. CBMs provide an intrinsic interpretability solution by constraining latent representations to align with human-defined concepts, yet their application to unstructured 3D data remains largely unexplored. We design, implement, and validate a formal 3D-CBM architecture that maps raw geometric inputs, including point clouds and meshes, into a multi-tiered taxonomy of interpretable primitives and functional attributes. The framework further identifies strategic datasets, such as PartNet and ShapeNet, specialized for concept-based supervision. Experimental results from a 3D part-manipulation proof-of-concept experiment demonstrate the framework's efficacy, achieving a concept prediction accuracy of 88.8\% and a Chamfer Distance of 0.0115. Critically, the model enables precise test-time intervention, allowing for the interactive correction of structural errors. This work establishes a foundation for semantically-steerable 3D generation and invites further exploration into collaborative human-in-the-loop design systems.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Cite as:
arXiv:2606.11446 [cs.CV]
(or
arXiv:2606.11446v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2606.11446
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
106. 【2606.11390】A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting
链接:https://arxiv.org/abs/2606.11390
作者:Matthew Cong,Francis Williams,Jonathan Swartz,Mark Harris,Sanja Fidler,Ken Museth
类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:Gaussian splatting methods, real world, increasingly popular, popular for neural, Gaussian splatting
备注: 14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary material
点击查看摘要
Abstract:Gaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.
107. 【2606.11385】DeceptionX: Explainable Deception Detection with Multimodal Large Language Models
链接:https://arxiv.org/abs/2606.11385
作者:Jiayu Zhang,Shuo Ye,Jiajian Huang,Yawen Cui,Taorui Wang,Wei Xia,Zeheng Wang,Haowen Tang,Hui Ma,Zitong Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly challenging task, behavioral analysis, highly challenging, affective computing, computing and behavioral
备注:
点击查看摘要
Abstract:Deception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model's generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.
108. 【2606.11381】From Simulation to Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting
链接:https://arxiv.org/abs/2606.11381
作者:Woojung Son(1),Won Suk Lee(1),Zijing Huang(1),Daeun Choi(1),Catia Silva(2),Yu She(3),Yan Gu(4) ((1) Department of Agricultural and Biological Engineering, University of Florida, (2) Department of Electrical and Computer Engineering, University of Florida, (3) Edwardson School of Industrial Engineering, Purdue University, (4) School of Mechanical Engineering, Purdue University)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Robotic strawberry harvesting, harvesting requires precise, strawberry harvesting requires, Robotic strawberry, pose ground truth
备注: 7 pages, 6 figures, 1 table
点击查看摘要
Abstract:Robotic strawberry harvesting requires precise 6D pose estimation; however, collecting 6D pose ground truth in real agricultural fields is inherently challenging. Existing 6D pose estimation methods have therefore relied solely on synthetic data that lacks scene-level realism, leaving their performance under real agricultural field conditions unquantified. In this work, we present, to the best of our knowledge, the first real-world 6D pose ground truth dataset of strawberries collected in actual agricultural fields (12,040 images). We also introduce a synthetic dataset rendered in NVIDIA Isaac Sim, featuring scene-level realism and domain randomization. Nevertheless, our experiments reveal that a significant sim-to-real gap persists, underscoring the necessity of real agricultural field data for reliable evaluation. We further quantify the sim-to-real gap through baseline 6D pose estimation results across backbone encoders, serving as a reference for future work. The real-world dataset will be made available upon acceptance.
109. 【2606.11363】NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization
链接:https://arxiv.org/abs/2606.11363
作者:Hao Lu,Yongxin Guo,Onur Koyun,Zhengjie Zhu,Abbas Alili,Metin N. Gurcan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generative modeling pipelines, modern generative modeling, modeling pipelines, central to modern, modern generative
备注:
点击查看摘要
Abstract:Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.
110. 【2606.11326】DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax
链接:https://arxiv.org/abs/2606.11326
作者:Minseong Kweon,Wenyuan Zhao,Nuo Chen,Lulin Liu,Huiwen Han,Zihao Zhu,Srinivas Shakkottai,Chao Tian,Zhiwen Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent feed-forward, reconstruction methods, flexibility in efficient, demonstrated strong performance, methods have demonstrated
备注: Project Page: [this https URL](https://darkvggt.github.io)
点击查看摘要
Abstract:Recent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.
111. 【2606.11320】Semantic Segmentation of Node and Edge Diagrams for Assistive Technology
链接:https://arxiv.org/abs/2606.11320
作者:Michael Cormier,Yichun Zhao,Laura Paul,Cameron Swift,Duc Tri Dang,Miguel Nacenta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:set of related, semantic segmentation, node-link diagrams, diagrams, related models
备注: 8 pages, 6 figures, 1 table. In Proceedings of the 23rd Conference on Robots and Vision (2026)
点击查看摘要
Abstract:In this paper, we present a novel set of related models for semantic segmentation of node-link diagrams. These diagrams are frequently used to represent mathematical graphs, relationships between concepts, and flowcharts. Such diagrams are difficult to access non-visually; while some assistive interfaces have been designed for node-link diagrams, they rely upon a machine-readable representation of the diagram, whereas such diagrams will generally be made available as bitmap images. Our compact deep learning models show excellent quantitative and qualitative performance on a large synthetic dataset of node-link diagrams, reaching per-pixel accuracy over 93\%.
112. 【2606.11314】RON: Tracing Rays to Orchestrate a Neural Renderer for 3D Gaussian Reconstructions
链接:https://arxiv.org/abs/2606.11314
作者:Or Perel,Hassan Abu Alhaija,Zian Wang,Jacob Munkberg,Matan Atzmon,Sanja Fidler,Masha Shugrina
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:dynamic object motion, Gaussian ray tracing, object motion, object insertion, framework that combines
备注: Project page: [this https URL](https://research.nvidia.com/labs/sil/projects/tron/)
点击查看摘要
Abstract:We introduce TRON, a rendering framework that combines 3D Gaussian ray tracing with neural rendering to enable realistic and controllable rendering of real-world 3D scenes under novel lighting, dynamic object motion, object insertion, and material editing. Prior approaches that rely solely on physically based rendering (PBR) of Gaussian representations struggle to achieve realistic relighting due to imperfections in reconstructed geometry, material estimates, and light transport estimation. At the same time, neural rendering methods often lack an explicit scene representation, limiting their ability to support interactive editing with fine-grained manipulation. TRON bridges these two paradigms. We use intrinsic decomposition priors from a learned inverse rendering model to regularize the material properties of a Gaussian field, and repurpose a ray tracer to provide radiometric guidance rather than final pixels. By treating this output as a structured 3D scaffold, we empower a lightweight neural renderer to bridge the domain gap between shading-model constrained estimates and photorealistic output. Our key insight is that the combination of explicit 3D knowledge with robust material priors provides speed and controllability, while neural rendering enables the synthesis of photorealistic images. To support real-world scenarios, we train our neural renderer with a multi-stage strategy consisting of large-scale pretraining and targeted fine-tuning on a newly constructed dataset of 2.1M rendered synthetic and real-world frames from 3D reconstructions. TRON outperforms Gaussian-based relighting methods in realism, and prior neural renderers in editability and speed. To the best of our knowledge, TRON is the first method to enable practical interactive applications in captured 3D environments, offering realistic appearance under dynamic geometric, lighting and material conditions.
113. 【2606.11289】1: A Simple and Fully Open Recipe for Strong Text-to-Image Models
链接:https://arxiv.org/abs/2606.11289
作者:Boya Zeng,Tianze Luo,Shu Pu,Jucheng Shen,Taiming Lu,Gabriel Sarch,Zhuang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:consistently driven progress, consistently driven, models, driven progress, Diffusion
备注: Project page at [this https URL](https://zlab-princeton.github.io/i1)
点击查看摘要
Abstract:Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at this https URL.
114. 【2606.11285】EventRadar: Long-Range Visual UAV Discovery through Spatiotemporal Event Sensing
链接:https://arxiv.org/abs/2606.11285
作者:Zhiting Zhou,Xingchen Liu,Xinglin Yu,Jiashen Chen,Haoyang Wang,Jingao Xu,Yunhao Liu,Xinlei Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unauthorized unmanned aerial, unmanned aerial vehicle, monitoring increasingly important, made protected-airspace monitoring, protected-airspace monitoring increasingly
备注:
点击查看摘要
Abstract:Unauthorized unmanned aerial vehicle (UAV) activity around airports, public venues, and other sensitive sites has made protected-airspace monitoring increasingly important. A practical sensing system must search a wide angular region, find small long-range targets, and return both bearing support and UAV-specific evidence before a restricted perimeter is breached. Existing UAV detection paths often rely on spatially organized evidence, such as body extent, silhouette, or track continuity. At long range, however, these cues become difficult to preserve and verify as the target footprint weakens and its image-plane support shrinks. EventRadar follows a complementary cue: propeller-induced temporal periodicity, which recent event-camera sensing studies have shown can reveal UAV-specific motion after appearance becomes weak. We extend this cue to kilometer-scale active sensing with an event-camera prototype. Scene-Anchored Geometry Evidence (SAGE) fuses scanning events with IMU pose to maintain a bearing-indexed scene memory, separating transient candidate support from persistent background clutter. Comb-guided Harmonic-Group Learned Iterative Shrinkage and Thresholding Algorithm (CHG) then treats each candidate as a weak high-rate timing signal and recovers phase-insensitive harmonic evidence with fixed compute. Compared with related event-camera baselines on 700-1500 m UAV event recordings, EventRadar achieves 0.990 mAP$_{.3}$ and 0.949 F1$_{.3}$, reduces FN$_{.3}$ to 0.009, and shows real-time feasibility in prototype profiling.
115. 【2606.11269】raits Run Deeper: Trait-Specific Asymmetric Fusion for Personality Assessment
链接:https://arxiv.org/abs/2606.11269
作者:Jia Li,Qian Chen,Wei Wang,Xinyu Li,Zhenzhen Hu,Dongsheng Shao,Richang Hong,Meng Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:infer stable personality, stable personality traits, Personality assessment aims, behaviors across language, facial cues
备注:
点击查看摘要
Abstract:Personality assessment aims to infer stable personality traits from dynamic behaviors across language, voice, and facial cues. Since different personality dimensions are revealed through distinct behavioral perspectives, modeling trait-specific evidence is challenging. However, most existing approaches adopt a uniform multimodal fusion strategy across all dimensions, assuming identical modality contributions. This overlooks trait-specific modality preferences and introduces cross-modal interference. To address this issue, we propose a novel personality assessment framework called Traits Run Deeper, which consists of three components. Specifically, the Multimodal Foundation Representation (MFR) module constructs personality-oriented multimodal inputs and leverages psychology-informed semantic templates as anchors, enabling foundation models to capture trait-relevant information. Building upon MFR, the Trait-Specific Modality Fusion (TSMF) module acts as an asymmetric fusion mechanism, allowing each dimension to selectively exploit different modality pathways from modality-specific modeling to complementary fusion. Thus, TSMF captures heterogeneous modality preferences while reducing cross-modal contamination. Furthermore, the Distribution-Calibrated Personality Regression (DCPR) module mitigates label imbalance and central tendency bias through target distribution calibration, improving robustness and stability. Experimental results on the AVI Challenge 2026 validation set demonstrate the effectiveness of the proposed framework, reducing mean squared error (MSE) by approximately 25% compared with the baseline. Consistent improvements are observed on the official test set, where our method achieves the best performance and ranks first in the Personality Assessment Track. The source code will be made available at this https URL.
116. 【2606.11236】A2SG:Adaptive and Asymmetric Surrogate Gradients for Training Deep Spiking Neural Networks
链接:https://arxiv.org/abs/2606.11236
作者:Yechan Kang,Yongjin Kweon,Mingyeong Seo,Sohee Park,Yeonguk Jeon,Jongkil Park,Hyun Jae Jang,Jaewook Kim,YeonJoo Jeong,Suyoun Lee,Seongsik Park
类目:Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:spiking neural networks, remains challenging due, temporal inconsistency caused, deep spiking neural, neural networks
备注: Accepted at ICML 2026
点击查看摘要
Abstract:Training deep spiking neural networks (SNNs) remains challenging due to sharp loss landscapes and temporal inconsistency caused by surrogate gradients. To address these challenges, we propose a unified framework: adaptive and asymmetric surrogate gradients A2SG. The adaptive gradients adjust an effective window for spatio-temporal adaptation, reducing spatial gradient variation and maintaining directional consistency of gradients over time. The asymmetric gradients reflect neuronal dynamics by assigning larger gradients to neurons with higher membrane potentials, and we prove that they yield lower variation than symmetric surrogates. Our analysis further establishes a direct connection between local gradient variation and the curvature of the loss landscape, providing a principled explanation for how A2SG promotes convergence to flatter minima and improves generalization. We conduct extensive experiments on diverse models, including CNN-based and Transformer-based SNNs, across various tasks such as image classification using both static and neuromorphic datasets, as well as segmentation. The results demonstrate that A2SG consistently improves accuracy and energy efficiency, establishing it as a general and reliable solution for training deep SNNs. Our code is available at this https URL.
117. 【2606.11233】OSCS-SupCon: Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning for Robust Feature Disentanglement
链接:https://arxiv.org/abs/2606.11233
作者:Bin Wang,Fadi Dornaika
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved strong performance, Supervised Contrastive Learning, explicitly modeling pairwise, modeling pairwise relationships, Supervised Contrastive
备注:
点击查看摘要
Abstract:Supervised Contrastive Learning (SupCon) has achieved strong performance by explicitly modeling pairwise relationships among samples. However, existing SupCon-based methods suffer from two key limitations: negative-sample dilution induced by the standard InfoNCE loss, and feature-space entanglement caused by the lack of explicit constraints separating category-relevant (common) and category-irrelevant (style) features. These limitations reduce feature discriminability and generalization ability. To address these issues, we propose OSCS-SupCon (Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning), a unified framework that combines a sigmoid-based pairwise contrastive objective with explicit orthogonality constraints. Specifically, we introduce a sigmoid-based contrastive loss with two learnable parameters, temperature and bias, which adaptively modulate pairwise decision boundaries and alleviate negative-sample dilution. Furthermore, we enforce orthogonality between common and style feature subspaces via a linear projection with ReLU nonlinearity, thereby reducing feature overlap and improving disentanglement of style-irrelevant representations. Extensive experiments on six benchmark datasets demonstrate that OSCS-SupCon consistently outperforms state-of-the-art supervised contrastive learning methods across multiple backbone architectures. In particular, on the fine-grained CUB200-2011 dataset with a ResNet-18 backbone, the proposed method achieves a 3.4% improvement in classification accuracy over CS-SupCon, highlighting its robustness and generalization capability. Ablation studies further confirm the effectiveness of each component.
118. 【2606.11231】CFCamo: A Counterfactual Detect-or-Abstain Framework for Camouflaged Object Detection
链接:https://arxiv.org/abs/2606.11231
作者:Suhang Li,Osamu Yoshie,Yuya Ieiri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-language reinforcement learning, Vision-language reinforcement, COD, recently shown strong, reinforcement learning
备注: 10 pages, 7 figures, 5 tables. Code and data: [this https URL](https://github.com/suhang2000/CFCamo)
点击查看摘要
Abstract:Vision-language reinforcement learning has recently shown strong target-present localization for camouflaged object detection (COD). Yet localization is only one side of the decision: when the agent faces an ordinary image with no camouflaged target, will it still claim that a camouflaged object exists? Standard COD training and evaluation data are positive-only, so agents optimized under this setting can acquire an over-detect bias, a task-specific form of object hallucination that standard COD evaluation leaves unmeasured. To quantify this target-absent behavior, we construct Counterfactual COD (CF-COD), a paired benchmark that removes the camouflaged target from each held-out COD evaluation image while preserving a plausible background. CF-COD evaluates whether a model detects the target on the original image and abstains on the target-absent counterfactual, summarized by Pair Accuracy (PA). We further introduce CFCamo, a paired counterfactual framework for COD with abstention. For training, CFCamo optimizes a Qwen3-VL-4B-Instruct agent with Counterfactual Sequence Policy Optimization (CSPO), which samples paired original-counterfactual rollouts and uses a Counterfactual Paired Reward (CPR) to couple original-image detection with counterfactual abstention. On CAMO-test, CFCamo improves S_alpha by +3.7 pp over the prior RL-based COD baseline; across CF-COD, it reaches 80.0-90.8% PA. Ablations show that removing counterfactual coupling reduces PA to 1.4-5.2% despite strong target-present COD scores, showing that target-present evaluation alone does not characterize detect-or-abstain behavior. Overall, these results indicate that CFCamo improves COD agents by coupling target-present detection with target-absent abstention, rather than merely strengthening target-present localization. Code and data are available at this https URL.
119. 【2606.11221】LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment
链接:https://arxiv.org/abs/2606.11221
作者:Huaihai Lyu,Chaofan Chen,Yuheng Ji,Xiansheng Chen,Pengwei Wang,Shanghang Zhang,Changsheng Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:action representations compatible, Lie-algebraic Action Space, Action Space Tokenizer, relational geometry, semantic geometry
备注:
点击查看摘要
Abstract:We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this alignment is non-trivial due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, whereas the physical manifold of robotic action is non-Euclidean and anisotropic. Their disjoint metric structures render direct regression ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer), which reconstructs the action space to establish local metric compatibility with the VL modality via a two-stage transformation: (1) Global Topological Linearization: linearizing the action manifold via Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation. (2) Local Metric Discretization: hierarchically discretizing the representation into schemas and whitened residuals, yielding approximately isotropic local charts that are statistically aligned with the semantic metric. By resolving the structural mismatch at both global and local levels, LAST enables VLA models with superior convergence and generalizability.
120. 【2606.11200】Detecting AI-Generated Content on Social Media with Multi-modal Language Models
链接:https://arxiv.org/abs/2606.11200
作者:Chenyang Yang,Shen Yan,Yibo Yang,Litao Hu,Yuchen Liu,Yuan Zeng,Hanchao Yu,Yinan Zhu,Sumedha Singla,Brian Vanover,Huijun Qian,Zihao Wang,Fujun Liu,Aashu Singh,Jianyu Wang,Xuewen Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:social media, enabled the creation, creation of photorealistic, photorealistic images, images and videos
备注:
点击查看摘要
Abstract:Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.
121. 【2606.11287】Intelligent Skin Cancer Detection Using a Multispectral Metasurface and a Hybrid
链接:https://arxiv.org/abs/2606.11287
作者:Afsane Saee Arezoomand
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Networks, hybrid deep learning, skin lesions Simulation-based, hybrid CNN ViT, Simulation-based evaluations demonstrate
备注: 8 pages
点击查看摘要
Abstract:Skin cancer is among the most prevalent malignancies worldwiAdbe satnradcitts early detection is essential for improving patient survival and reducing treatment costs Conventional dermoscopic and visual imaging techniques are primarily limited to the visible spectrum and often fail to capture subtle spectral signatures associated with early stage malignancies This study proposes an innovative framework that integrates a multispectral metasurface for imaging with a hybrid deep learning architecture based on Convolutional Neural Networks and Vision Transformers The designed metasurface enables noninvasive acquisition of rich spectral information highly sensitive to tissue alterations while the hybrid CNN ViT model simultaneously extracts local and global features to robustly classify skin lesions Simulation-based evaluations demonstrate that the proposed method achieves approximately 98 accuracy 95 percentages sensitivity and 99 perentage specificity surpassing conventional RGB-based and single-architecture approaches Qualitative analyses using attention maps reveal that the model focuses on clinically relevant lesion regions improving interpretability Overall the results indicate that combining metasurface based multispectral imaging with hybrid deep learning can introduce a new generation of diagnostic tools in dermatology and pave the way for portable fast and highly accurate clinical systems

