本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新761篇论文,其中:
- 自然语言处理156篇
- 信息检索20篇
- 计算机视觉147篇
自然语言处理
1. 【2605.31586】Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions
链接:https://arxiv.org/abs/2605.31586
作者:Wesley Scivetti,Ethan Wilcox,Nathan Schneider,Kanishka Misra,Leonie Weissweiler
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:form-meaning pairings, largest LLMs, challenging problem, rare Paired-Focus constructions, Paired-Focus constructions
备注: Conference on Natural Language Learning (CoNLL) 2026
点击查看摘要
Abstract:Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.
2. 【2605.31584】LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
链接:https://arxiv.org/abs/2605.31584
作者:Nianyi Lin,Jiajie Zhang,Lei Hou,Juanzi Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:extensive distracting content, integrate key information, large language models, distracting content, remains a central
备注:
点击查看摘要
Abstract:Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{this https URL}{this https URL}.
3. 【2605.31564】What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation
链接:https://arxiv.org/abs/2605.31564
作者:Qing Wang,Jacob Devasier,Chengkai Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:diffusion language models, masked diffusion language, language models, systematic study, study of masked
备注:
点击查看摘要
Abstract:We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.
4. 【2605.31563】Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection
链接:https://arxiv.org/abs/2605.31563
作者:Benedetta Muscato,Beiduo Chen,Gizem Gezici,Barbara Plank,Fosca Giannotti
类目:Computation and Language (cs.CL)
关键词:well-known in labeling, disagreement is ubiquitous, ubiquitous and well-known, Human disagreement, token-level human rationales
备注: 16 pages
点击查看摘要
Abstract:Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.
5. 【2605.31561】What Am I Missing? Question-Answering as Hidden State Probing
链接:https://arxiv.org/abs/2605.31561
作者:Chu Fei Luo,Samuel Dahan,Xiaodan Zhu
类目:Computation and Language (cs.CL)
关键词:Test-time reasoning, significant field, field of study, large language models, Test-time
备注:
点击查看摘要
Abstract:Test-time reasoning has become a significant field of study since the introduction of chain-of-thought reasoning in large language models (LLMs). However, the mechanisms of this reasoning process are still under-explored -- from the same input prompt, and even the same partial solution, LLMs can produce varied answers if sampled multiple times. We propose to leverage question-asking as an inference-time intervention that articulates information about the model's hidden state. To achieve that, we present a student-teacher setting where a student asks questions to a teacher. We train a probe on the student's hidden state before and after asking a question and find it is predictive of the trajectory's final correctness, even before generating the teacher's answer. This suggests there is a meaningful signal from the self-diagnosis that occurs during question generation rather than information transfer from the teacher. We then frame question-asking as a sequential decision problem, using this probe as a quality score, and define a gating policy to ask questions that maximize likelihood of correctness. We find that the success of question-asking as an intervention is largely dependent on the model's self-consistency. Our empirical results show a gap between detection and recovery; while our gating policy captures model correctness and uncertainty, interventions are equally likely to harm correct trajectories as they are to recover incorrect ones. This gap between diagnosis and correction has broader implications on language models' capacity for self-refinement under uncertainty.
6. 【2605.31556】Vision-Language Models Suppress Female Representations Under Ambiguous Input
链接:https://arxiv.org/abs/2605.31556
作者:Arnau Marin-Llobet,Simon Henniger,Mahzarin R. Banaji
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:Alignment teaches vision-language, expressing demographic biases, avoid expressing demographic, Alignment teaches, teaches vision-language models
备注: 16 pages, 12 figures, 1 table
点击查看摘要
Abstract:Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.
7. 【2605.31550】Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models
链接:https://arxiv.org/abs/2605.31550
作者:Yibin Zhao,Fangxin Shang,Dingrui Yang,Yuqi Wang
类目:Computation and Language (cs.CL)
关键词:relations encoded implicitly, recover semantic relations, semantic relations encoded, answering requires models, Table question answering
备注:
点击查看摘要
Abstract:Table question answering requires models to recover semantic relations encoded implicitly by two-dimensional layout, merged cells, and hierarchical headers. Current pipelines typically use HTML or Markdown as intermediate table representations, but these layout-oriented serializations introduce markup overhead and require large language models to infer header-cell alignments from row and column spans. We propose Semantic Triplet Restoration (STR), a protocol that rewrites each cell as an atomic fact item path, feature path, value, where the item path specifies the row-wise entity, the feature path specifies the hierarchical attribute, and the value contains the cell content. We also present TripletQL, a lightweight query-aware router that uses STR to select an appropriate rendering or filtered subset of triplets for each question. Across four Chinese and English table-QA benchmarks, STR matches or improves upon HTML-based baselines while reducing input tokens. The relative benefit grows for smaller language models and longer table contexts, suggesting that explicit semantic representations are especially useful under constrained inference budgets. Code and data are available at this https URL .
8. 【2605.31545】Preference-Aware Rubric Learning for Personalized Evaluation
链接:https://arxiv.org/abs/2605.31545
作者:Yilun Qiu,Xiaoyan Zhao,Yang Zhang,Yuxin Chen,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yao Hu,Yoko Yamakata,Tat-Seng Chua
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, aligning model behavior, Language Models, personalized evaluation
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods-ranging from automatic metrics to LLM-as-a-judge approaches-fail to capture subjective, user-specific preferences embedded in long-term interaction histories. We identify three essential principles for reliable and effective personalized evaluation: Representativeness, User-Consistency, and Discriminativeness. To address these principles, we introduce Personalized Evaluation as Learning, a paradigm that formulates personalized evaluation as a learning problem rather than a static judgment. Under this paradigm, we propose PARL (Preference-Aware Rubric Learning for Personalized Evaluation), a framework that learns to induce preference-aware evaluation rubrics directly from raw user histories and performs a self-validation mechanism to ensure consistency with the user's preferences. PARL integrates rubric induction with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive personalized model outputs, enabling the learned rubrics to capture precise, user-specific decision boundaries. Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks, while capturing stable stylistic preferences and fine-grained evaluative patterns. To ensure reproducibility, our code is available at this https URL.
9. 【2605.31521】UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception
链接:https://arxiv.org/abs/2605.31521
作者:Yuhan Song,Linhao Zhang,Aiwei Liu,Chuhan Wu,Sijun Zhang,Wei Jia,Yuan Liu,Houfeng Wang,Xiao Zhou
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:strong linguistic alignment, compact single-codebook design, design and strong, linguistic alignment, strong linguistic
备注: 19 pages, 10 figures
点击查看摘要
Abstract:Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at this https URL.
10. 【2605.31514】If LLMs Have Human-Like Attributes, Then So Does Age of Empires II
链接:https://arxiv.org/abs/2605.31514
作者:Adrian de Wynter
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:LLM-powered agentic workflows, large language models, agentic workflows, LLM-powered agentic, Greater Boston Area
备注:
点击查看摘要
Abstract:Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain constant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions, regardless of the experimenter's viewpoint on the subject. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that \textit{Age of Empires II} is functionally- and Turing-complete.
11. 【2605.31512】Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral
链接:https://arxiv.org/abs/2605.31512
作者:Danish Ali,Li Xiaojian,Sundas Iqbal,Farrukh Zaidi
类目:Computation and Language (cs.CL)
关键词:low-resource healthcare settings, language-dependent documentation patterns, mixed scripts, incomplete evidence, healthcare settings
备注:
点击查看摘要
Abstract:Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminology, mixed scripts, incomplete evidence, label imbalance and language-dependent documentation patterns. This article presents a reliability-oriented framework for classifying free-text orthopedic notes in English, Hindi and Punjabi. We compare task-aligned multilingual transformer encoders, a task-fine-tuned DistilBERT baseline, zero-shot instruction-tuned large language models (LLMs) and a domain-adaptive encoder, IndicBERT-HPA. IndicBERT-HPA augments IndicBERT with language-aware orthopedic adapter heads to support clinically relevant multilingual representation learning. Evaluation extends beyond aggregate accuracy to per-class performance, ROC-AUC, AUPRC, expected calibration error, cross-language stability and robustness under controlled balanced and natural-prevalence distributions. The evaluated zero-shot LLMs remain substantially less effective than task-adapted encoders for closed-set classification, with language-dependent instability. Under natural clinical prevalence, IndicBERT-HPA achieves the strongest overall performance, reaching an averaged Macro-F1 of 0.8792, Macro-AUROC of 0.894 and AUPRC of 0.902. We further implement a deterministic selective-verification layer combining confidence gating, evidence-consistency checking and language-risk screening. On a randomly selected held-out 5,000-record subset, it achieves 84.4% selective accuracy and 0.76 selective Macro-F1 at 72.3% coverage, compared with 71.5% accuracy and 0.65 Macro-F1 for accept-all prediction. These results support reliability-oriented multilingual clinical decision support with explicit deferral.
12. 【2605.31506】Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy
链接:https://arxiv.org/abs/2605.31506
作者:Michael R. DeMarco
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Retrieval-Augmented Generation, current industry standard, current industry, Expert Blindness Effect, real-world facts
备注: 15 pages, 7 tables. Preliminary findings; Experiment 3 identified as future work
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.
13. 【2605.31494】Consolidating Rewarded Perturbations for LLM Post-Training
链接:https://arxiv.org/abs/2605.31494
作者:Zheyu Zhang,Shuo Yang,Gjergji Kasneci
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:commonly framed, sampling Gaussian perturbations, loop implemented, Post-training, Post-training of language
备注:
点击查看摘要
Abstract:Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by RandOpt, relocates this loop to weight space, sampling Gaussian perturbations around a pretrained model and ensembling the top-K rewarded specialists at inference. While competitive with PPO and GRPO under matched training compute, this prediction-level ensemble incurs K forward passes per test example and does not extend cleanly to free-form generation. We ask whether the rewarded population can instead be folded into a single deployable model, replacing the inference-time ensemble with one consolidated update. A split-half analysis over 25 model-task pairs reveals reproducible low-rank structure in every case. We turn this geometry into CoRP (Consolidating Rewarded Perturbations), a gradient-free operator that combines reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate, with no gradient flowing through the language model. Across five language models from 0.5B to 8B and five tasks covering math, code, and creative writing, CoRP improves the base model by 8.1 points on average. Using one tenth of RandOpt's perturbation budget, CoRP exceeds single-inference RandOpt by 6.5 points and recovers more than half of the gain of the 50-pass majority-vote ensemble, at one forward pass per test example.
14. 【2605.31490】Are Full Rollouts Necessary for On-Policy Distillation?
链接:https://arxiv.org/abs/2605.31490
作者:Yaocheng Zhang,Jiajun Chai,Songjun Tu,Yuqian Fu,Xiaohan Wang,Wei Lin,Guojun Yin,Qichao Zhang,Yuanheng Zhu,Dongbin Zhao
类目:Computation and Language (cs.CL)
关键词:promising post-training paradigm, dense teacher feedback, OPD, On-policy distillation, teacher feedback
备注: 14 pages, 16 figures
点击查看摘要
Abstract:On-policy distillation (OPD) provides dense teacher feedback along rollouts generated by the student and has emerged as a promising post-training paradigm for long-horizon reasoning. However, standard OPD typically generates full rollouts during training, which is computationally expensive and may expose the student to unreliable teacher feedback at late rollout positions, especially during early training. We identify the rollout horizon as a key bottleneck in OPD that substantially impacts training efficiency. Unlike Reinforcement Learning with Verifiable Rewards (RLVR), OPD does not require a complete trajectory or a final answer reward to provide learning signals. This observation suggests that full rollouts may not always be necessary for effective OPD. Motivated by this insight, we propose two simple horizon-control strategies: Progressive OPD (POPD), which gradually expands the rollout horizon during training, and Truncated OPD (TOPD), which permanently performs distillation on reliable truncated rollouts. Experiments on mathematical reasoning show that POPD improves the training efficiency of OPD by up to 3$\times$, while TOPD matches OPD performance using only 10\% of the rollout horizon, leading to substantial wall-clock and memory reductions. These results demonstrate that controlling the rollout horizon offers a simple and practical path to more efficient OPD.
15. 【2605.31483】BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali
链接:https://arxiv.org/abs/2605.31483
作者:Shefayat E Shams Adib,Ahmed Alfey Sani,Ekramul Alam Esham,Ajwad Abrar,Ishmam Tashdeed,Md Taukir Azam Chowdhury
类目:Computation and Language (cs.CL)
关键词:Generative Question Answering, systematically evaluated hallucination, sixth most spoken, prior work, work has systematically
备注: Preprint. Under review
点击查看摘要
Abstract:Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at this https URL.
16. 【2605.31480】Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task
链接:https://arxiv.org/abs/2605.31480
作者:Bart Evelo,Meaghan Fowlie,Denis Paperno
类目:Computation and Language (cs.CL)
关键词:Large Language Models, genuinely acquire compositional, Large Language, Personal Relation Task, genuinely acquire
备注: A pre-MIT Press publication version. Paper accepted to Transactions of the Association for Computational Linguistics
点击查看摘要
Abstract:Do neural models, such as Large Language Models, genuinely acquire compositional abilities for interpretation of natural language? When we talk about semantic interpretation, we can distinguish two complementary aspects: establishing what an expression refers to in the world (which we call the Extensional task) and representing its sense in a structured way (which we call the Intensional task). We evaluate LLMs and humans on both tasks in the setting of the Personal Relation Task (Paperno 2022) in which, given a universe of people and their relationships with each other, one is asked to interpret a noun phrase such as "Amber's parent's friend". Here, for the Intensional task, the answer is the formula "friend(parent(amber))", and for the Extensional task, the person. We find that humans and LLMs show opposite strengths: humans perform better on Extensional than Intensional tasks, and LLMs vice versa. Our methodology brings greater nuance to the understanding of compositional abilities in modern machine learning models. Our results support the notion that the lack of referential grounding in LLM training is a crucial missing component in mimicking human-like language understanding.
17. 【2605.31478】Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation
链接:https://arxiv.org/abs/2605.31478
作者:Hui Wu,Xiaoyang Wang,Zhong Fan
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Systems and Control (eess.SY)
关键词:energy-research labs require, automate power-system analysis, Large language models, labs require on-premise, require on-premise serving
备注: 43 pages, 12 figures, includes supplementary material
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.
Comments:
43 pages, 12 figures, includes supplementary material
Subjects:
Software Engineering (cs.SE); Computation and Language (cs.CL); Systems and Control (eess.SY)
Cite as:
arXiv:2605.31478 [cs.SE]
(or
arXiv:2605.31478v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2605.31478
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2605.31469】Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus
链接:https://arxiv.org/abs/2605.31469
作者:Máté Gedeon,Piroska Zsófia Barta,Péter Mihajlik,Katalin Mády
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Conversational automatic speech, automatic speech recognition, Conversational automatic, dialogue-style training data, automatic speech
备注:
点击查看摘要
Abstract:Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.
19. 【2605.31463】PithTrain: A Compact and Agent-Native MoE Training System
链接:https://arxiv.org/abs/2605.31463
作者:Ruihang Lai,Hao Kang,Haozhan Tang,Akaash R. Parthasarathy,Zichun Yu,Junru Shao,Todd C. Mowry,Chenyan Xiong,Tianqi Chen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:frontier language models, language models, frontier language, dominant architecture, MoE
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.
20. 【2605.31455】DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
链接:https://arxiv.org/abs/2605.31455
作者:Jian Mu,Tianyi Lin,Chengwei Qin,Zhongxiang Dai,Yao Shu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large language models, provide lightweight feedback, iteratively provide lightweight, Large language, multi-turn interactive settings
备注:
点击查看摘要
Abstract:Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at this https URL.
21. 【2605.31452】ranslation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows
链接:https://arxiv.org/abs/2605.31452
作者:Yuri Balashov,Rex VanHorn,Mingxi Xu,Austin Downes
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:accessible analytic methods, paper develops practical, evaluate translation technologies, smaller language service, language service providers
备注: 20 pages. Accepted at EAMT-2026 (Tilburg, Netherlands, June 2026)
点击查看摘要
Abstract:Building on our previous work, this paper develops practical, low-barrier methods for freelance translators and smaller language service providers to evaluate translation technologies using rigorous yet accessible analytic methods. Here we address a high-stakes, specialized need: offline translation for confidentiality-sensitive domains in which privacy constraints preclude the use of cloud-based engines and commercial LLMs. We expand the Reeve Foundation Trilingual Corpus (RFTC) used in our previous work into a multilingual corpus (RFMC) by adding sentence-aligned German and Simplified Chinese reference translations. We then benchmark several locally runnable language models (via Ollama) across four language directions on 1000+ sentences selected from this corpus. We use consistent single-prompt calls without fine-tuning or domain adaptation, comparing local LLM outputs against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional-grade local NMT systems (OPUS-CAT, NeuralDesktop, Promt). Automatic evaluation is conducted with MATEO. Results reveal substantial variation in local LLM performance across language directions and model sizes. The best local LLMs match or surpass local NMT systems and a frontier LLM, though they remain behind top commercial NMTs. These findings underscore the viability of carefully selected local LLM translation for privacy-constrained professionals and inform future research on model scaling and multilingual capability.
22. 【2605.31446】Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction
链接:https://arxiv.org/abs/2605.31446
作者:Wenna Lai,Haoran Xie,Guandong Xu,Qing Li,S. Joe Qin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:identify aspect terms, Aspect Sentiment Triplet, providing essential inputs, Aspect Sentiment, Sentiment Triplet Extraction
备注: 25 pages, 13 figures, and 6 tables
点击查看摘要
Abstract:Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, providing essential inputs for downstream information system applications such as opinion mining, explainable recommendations, and review summarization. Prior work mainly focuses on end-to-end extraction, while post hoc verification of extracted triplets remains comparatively underexplored. This gap limits the reliability of ASTE systems, since predicted triplets may be locally plausible while being globally invalid. Moreover, candidate invalidity is multi-faceted and candidate usability is inherently graded, motivating a fine-grained verification mechanism that can filter or re-rank outputs from diverse extractors. In this paper, we propose FiVeD, a framework for Fine-grained Verification with Diagnostic reasoning supervision. Specifically, the verifier is trained with multiple complementary objectives, including validity classification and quality score estimation as primary tasks, with error type classification and rationale generation as auxiliary tasks. We define hierarchical error categories and construct plausible incorrect triplets under semantic and syntactic constraints, and leverage an off-the-shelf LLM with task-specific rubrics to produce quality scores and diagnostic rationales. During inference, the resulting quality scores are used to filter candidate outputs, supporting adjustable precision-recall tradeoffs. Experiments across multiple ASTE baselines demonstrate that FiVeD consistently improves extraction performance by up to 3.53 F1 points as a plug-and-play verification module.
23. 【2605.31445】Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information
链接:https://arxiv.org/abs/2605.31445
作者:Antonio Valerio Miceli-Barone,Vaishak Belle,Shay B. Cohen
类目:Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:mutually beneficial trades, negotiate mutually beneficial, beneficial trades, mutual uncertainty, seller communicate
备注: 18 pages, 14 figures
点击查看摘要
Abstract:In this work we study agents in simulated bargaining scenarios, where a buyer and a seller communicate through a text channel and attempt to negotiate mutually beneficial trades, under different information regimes (complete information, information asymmetry or mutual uncertainty). We evaluate their performance w.r.t. game-theoretical solutions and further investigate their honesty (their tendency to disclose or withhold information or to mislead and deceive) as well as their credulity (their tendency to trust or distrust information provided by the other agent). We study zero-shot LLM agents with simple prompting scaffolding as well as fine-tuned agents, in order to investigate whether optimising the agents to maximise financial profits makes them stronger negotiators but also more dishonest and less trusting. We find that off-the-shelf LLMs all substantially deviate from game-theoretical equilibria, they attempt to lie about their private information but cannot efficiently exploit information asymmetries. Fine-tuning on financial utility makes the agents stronger at achieving better deals but also more dishonest, highlighting the risks that optimising agents for a task can have on their safety. We release our code and a dataset of bargaining scenarios.
Comments:
18 pages, 14 figures
Subjects:
Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2605.31445 [cs.GT]
(or
arXiv:2605.31445v1 [cs.GT] for this version)
https://doi.org/10.48550/arXiv.2605.31445
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
24. 【2605.31433】SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks
链接:https://arxiv.org/abs/2605.31433
作者:Wai-Chung Kwan,Aryo Pradipta Gema,Joshua Ong Jun Leang,Pasquale Minervini
类目:Computation and Language (cs.CL)
关键词:train language models, external supervision, train language, Solver, tasks
备注:
点击查看摘要
Abstract:Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.
25. 【2605.31432】DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs
链接:https://arxiv.org/abs/2605.31432
作者:Sara Papi,Luisa Bentivogli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:Speech Large Language, Large Language Models, Speech Large, Large Language, generates translations
备注:
点击查看摘要
Abstract:Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.
26. 【2605.31421】Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm
链接:https://arxiv.org/abs/2605.31421
作者:Fabio Massimo Zanzotto,Federico Ranaldi,Giorgio Satta
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
关键词:neural network architecture, Chomsky Normal Form, recurrent neural network, neural network, network architecture
备注: 9 content pages
点击查看摘要
Abstract:In this paper, we show the possibility of a direct injection of algorithms into neural network architecture. We focus on a complex algorithm, that is, Cocke-Youger-Kasami (CYK) for parsing context-free grammars in Chomsky Normal Form and we propose CYKNN, a simple recurrent neural network architecture for encoding the CYK algorithm in trainable matrix-vector this http URL experimented with a very simple grammar with 4 variations showing that our approach outperforms existing LLMs with more than 20B parameters with an in-context learning setting and smaller LLMs of the Qwen family fine-tuned with LoRA. Our attempt paves the way to a different approach to neuro-symbolic methodologies.
27. 【2605.31408】Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study
链接:https://arxiv.org/abs/2605.31408
作者:Xiaonan Xu,Wenjing Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:documents provide procedural, provide procedural knowledge, Skill documents provide, percentage points, Skill
备注:
点击查看摘要
Abstract:Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task domain-balanced subset validated by official oracle runs, two reasoning-enabled model configurations, six skill conditions, and five trials per task-condition-model cell. Skill availability is the clearest empirical signal. Relative to no skill, skill conditions increase task-mean pass rate by 26.7 to 36.0 percentage points for GPT-5.5 and by 18.0 to 26.0 percentage points for DeepSeek V4-Flash. The final data contain 1,800 rows, with 900 rows for each model. The task is the inference unit. Five trials are aggregated within each task-condition-model cell before paired contrasts are estimated over 30 tasks. The primary presentation contrasts are smaller and uncertain. Low-abstraction guidance differs from high-abstraction guidance by +0.7 percentage points for GPT-5.5 and -6.7 percentage points for DeepSeek V4-Flash, with both 95% bootstrap confidence intervals crossing zero. Adding one worked example to medium-abstraction guidance differs from the no-example variant by +0.7 and +1.3 percentage points. Mean-reward robustness checks preserve the same substantive conclusion. In this controlled subset, skill availability is associated with higher success than no skill, while the tested presentation-granularity changes yield small, uncertain, and model-dependent effects.
28. 【2605.31404】he Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning
链接:https://arxiv.org/abs/2605.31404
作者:Xudong Zhang,Jian Yang,Shengkai Wang,Jiangpeng Tian,Shaowen Chen,Xian Wei,Ke Li,Xiong You
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Model, Large Language, systems commonly construct, commonly construct explicit, based navigation systems
备注:
点击查看摘要
Abstract:Large Language Model (LLM)-based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text-based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual-interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double-edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel -- incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text-based spatial representations in LLM-based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at this https URL.
29. 【2605.31401】"Intelegi Româneşte?'' A Recipe for Romanian Vision-Language Models
链接:https://arxiv.org/abs/2605.31401
作者:Mihai Masala,Marius Leordeanu,Mihai Dascalu,Traian Rebedea
类目:Computation and Language (cs.CL)
关键词:text-only LLM trajectory, largely follow, follow the text-only, sharply degrading, degrading on low-resource
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English VLM training and evaluation corpora into Romanian, applying machine translation to textual annotations and to in-image text, preserving visual grounding while adapting the textual content. Using this data, we train and ablate a series of VLMs to isolate the contribution of (i) vision backbones of varying scale and pretraining, (ii) language backbones from multilingual to Romanian-adapted LLMs, and (iii) OCR-style image-text data. We further curate HoraVQA, a culturally native evaluation set grounded in Romanian everyday scenes. Romanian-adapted VLMs consistently outperform their same-sized counterparts and, across all evaluated benchmarks, even surpass models from the next larger size category.
30. 【2605.31393】arget-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models
链接:https://arxiv.org/abs/2605.31393
作者:Pedro Dal Bianco,Jean Paul Nunes Reinhold,Oscar Stanchi,Facundo Quiroga,Franco Ronchetti,Ulisses Brisolara Corrêa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:limited paired sign-video, heavy-tailed target vocabularies, Sign language translation, Sign language, German Sign Language
备注: Accepted at GenSign ( [this https URL](https://genai4sl.github.io/) ) at CVPR 2026. Non proceedings track
点击查看摘要
Abstract:Sign language translation (SLT) remains constrained by limited paired sign-video/text corpora and heavy-tailed target vocabularies. We study target-side augmentation in which GPT-4o generates controlled paraphrase variants of reference sentences while the sign input remains unchanged. A Signformer-style pose-based Transformer is trained under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate on three datasets spanning complementary challenges: PHOENIX14T (German Sign Language), with moderate lexical diversity; GSL (Greek Sign Language), with highly ontrolled, repetitive recordings; and LSA-T (Argentinian Sign Language), with severe long-tail sparsity. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33. The near-saturated GSL baseline and extremely sparse LSA-T setting reveal the limits of the approach. To our knowledge, this is the first study to apply LLM-generated target-side araphrases and LLM-as-a-Judge evaluation to SLT. The semantic evaluation reveals gains in fidelity that lexical overlap metrics understate.
Comments:
Accepted at GenSign (this https URL) at CVPR 2026. Non proceedings track
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.31393 [cs.CL]
(or
arXiv:2605.31393v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.31393
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
31. 【2605.31387】Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely
链接:https://arxiv.org/abs/2605.31387
作者:Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen
类目:Computation and Language (cs.CL); Robotics (cs.RO)
关键词:diverse environments rely, Robots operating, operating in diverse, diverse environments, environments rely
备注: Preprint
点击查看摘要
Abstract:Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.
32. 【2605.31381】LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories
链接:https://arxiv.org/abs/2605.31381
作者:Krishnapriya Vishnubhotla,Soumya Vajjala,Akriti Vij,Isar Nejadgholi
类目:Computation and Language (cs.CL)
关键词:multi-dimensional safety evaluation, reference-free setup, Large Language Models, evaluate the consistency, conducting a multi-dimensional
备注: 8 pages plus appendices, under review
点击查看摘要
Abstract:We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues related to machine-generated advice in regulated domains such as finance, although they are more reliable at identifying more overt forms of unsafe/harmful content such as violence. The degree of inconsistency in a model's judgments can vary significantly by the chosen safety criteria and can be impacted by the language of the content and its linguistic style as well. Finally, there is high disagreement among different judges for the same output, across domains, safety criteria, and languages. These findings provide new insights on the practice of using LLMs as evaluators and offer several recommendations for practitioners on how to use automated judges in practical scenarios.
33. 【2605.31378】Unlocking Fine-Grained Translation Quality Estimation in LRMs through Synergistically Evolving Implicit and Explicit Reasoning
链接:https://arxiv.org/abs/2605.31378
作者:Renfei Dang,Xinye Wang,Zhejian Lai,Weilu Xu,Shimin Tao,Daimeng Wei,Min Zhang,Shujian Huang
类目:Computation and Language (cs.CL)
关键词:translation quality estimation, Large Reasoning Models, fine-grained translation quality, Large Reasoning, Reasoning
备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) still struggle with fine-grained translation quality estimation (QE), even with long reasoning chains. We argue that LRMs already possess strong multilingual capabilities, while the core challenge stems from the intrinsic difficulty of learning the fine-grained QE task. In this paper, we propose RIEQE (Reasoning both Implicitly and Explicitly for QE), a simple two-stage training framework that enables the co-evolution of implicit (layer-wise) and explicit (token-wise) reasoning capabilities. To make implicit reasoning feasible, we first decompose the complex QE task into straightforward subtasks. Based on this, our two-stage approach applies: (1) NonThinking-SFT, Supervised Fine-Tuning (SFT) without reasoning chains to directly boost the model's implicit reasoning tendency and capability; and (2) Thinking-RLVR, standard Reinforcement Learning with Verifiable Reward (RLVR) to subsequently strengthen explicit reasoning. Results demonstrate that implicit and explicit reasoning synergistically co-evolve under our framework. On the WMT test sets, RIEQE based on Qwen3-4B-Thinking-2507 surpasses all baselines in explicit reasoning performance, while its implicit reasoning capability is also comparable to the best current encoder-based models. We further provide evidence for the synergistic collaboration between implicit and explicit reasoning, showing how they mutually benefit each other.
34. 【2605.31367】rading Complexity for Expressivity Through Structured Generalized Linear Token Mixing
链接:https://arxiv.org/abs/2605.31367
作者:Erwan Fagnou,Paul Caillon,Blaise Delattre,Alexandre Allauzen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:generate long-range dependencies, mixing layers play, Token mixing layers, long-range dependencies, mixing layers
备注: 20 pages, 3 figures, ICML 2026 main
点击查看摘要
Abstract:Token mixing layers play a key role in how language models can learn and generate long-range dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.
35. 【2605.31363】he Latin Substrate: How Language Models Represent and Mediate Script Choice
链接:https://arxiv.org/abs/2605.31363
作者:Daniil Gurgurov,Alan Saji,Katharina Trinley,Josef van Genabith,Simon Ostermann
类目:Computation and Language (cs.CL)
关键词:distinct orthographic forms, generate equivalent linguistic, requiring large language, equivalent linguistic content, requiring large
备注: preprint
点击查看摘要
Abstract:Many languages are written in multiple scripts, requiring large language models (LLMs) to generate equivalent linguistic content in distinct orthographic forms. While prior work suggests that LLMs route information through shared latent representations, how they internally mediate script variation remains poorly understood. We study this question by first examining per-layer output distributions with the logit lens, which reveals consistent latent romanization during transliteration, and then through representational and mechanistic analyses of script generation. At the representational level, we show that scripts of the same language become increasingly separable across layers and that a simple linear steering direction can flip a model's output script while largely maintaining semantic content. The vector generalizes asymmetrically to writing systems unseen during construction, flipping non-Latin output to Latin reliably, but mapping Latin output into varied non-Latin scripts. At the mechanistic level, we localize a small set of late-layer attention heads that causally mediate script choice. These heads transfer across unrelated languages and writing systems, suggesting that script routing is implemented by language-agnostic components. Across both analyses, we observe a consistent directional asymmetry: non-Latin output is produced by a compact, identifiable gate, while Latin-script output emerges from diffuse contributions across the network. Collectively, our findings hint that LLMs organize script variation around shared latent representations while exhibiting a privileged substrate toward Latin script.
Comments:
preprint
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.31363 [cs.CL]
(or
arXiv:2605.31363v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.31363
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2605.31351】A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation
链接:https://arxiv.org/abs/2605.31351
作者:Yi Zhao,Siqi Wang,Zhe Hu,Yushi Li,Jing Li
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visually Impaired Assistance, AI-based Visually Impaired, Impaired Assistance Benchmark, Visually Impaired, Impaired Assistance
备注:
点击查看摘要
Abstract:AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: this https URL
37. 【2605.31349】FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection
链接:https://arxiv.org/abs/2605.31349
作者:Paramananda Bhaskar,Naquee Rizwan,Daksh Jogchand,Saurabh Kumar Pandey,Animesh Mukherjee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Functionality Based Hateful, confounding rhetorical hate, Hateful meme detection, Based Hateful Memes, rhetorical hate mechanisms
备注:
点击查看摘要
Abstract:Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.
38. 【2605.31338】Bundesrecht: An Open Library and Corpus for German Statutory Reference Processing
链接:https://arxiv.org/abs/2605.31338
作者:Harshil Darji,Martin Heckelmann,Christina Kratsch,Gerard de Melo
类目:Computation and Language (cs.CL)
关键词:combine multiple targets, legal language understanding, German statutory reference, German statutory, German
备注: 10 pages, 1 figure. Preprint
点击查看摘要
Abstract:Statutory references are central to legal language understanding, but are difficult to process automatically, as they appear in compact and variable surface forms, may combine multiple targets, use special abbreviations, and often point to lower-level units. Existing tools for German focus either on parsing references from legal documents or accessing statutory text once citations are explicit. This paper introduces bundesrecht, an open resource for German statutory reference processing, consisting of a software library and a structured corpus of German federal law. The library parses, normalizes, and resolves German statutory references, mapping raw citation strings to structured objects, expanding compact references into canonical forms, and linking them to statutory provisions. The accompanying dataset preserves the internal hierarchy of statutes from laws to fine-granular subclauses. We evaluate the parser and normalizer on 2,944 annotated German legal references using strict exact-match and micro information extraction metrics. We further evaluate canonical reference deduplication and show that normalized references group real citation surface variants far more reliably than string matching. bundesrecht is the first open resource that covers German statutory reference processing as an end-to-end pipeline, from raw citation string to resolved statutory provision, and is available on PyPI.
39. 【2605.31328】Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards
链接:https://arxiv.org/abs/2605.31328
作者:Magnus Jørgenvåg,David Kaczér,Lasse Ruttert,Marvin Gülhan,Lucie Flek,Florian Mai
类目:Computation and Language (cs.CL)
关键词:Emergent misalignment, surprising tendency, tendency of language, Emergent, language models
备注:
点击查看摘要
Abstract:Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals. Third, we evaluate in-training mitigations developed for SFT-induced EM and find that they broadly transfer, with interleaving on-policy safety data performing best.
40. 【2605.31312】Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization
链接:https://arxiv.org/abs/2605.31312
作者:Haolin Deng,Xin Zou,Zhiwei Jin,Chen Chen,Haonan Lu,Xuming Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal hallucination remains, Multimodal hallucination, Vision-Language Models, Direct Preference Optimization, visual preference DPO
备注: ICML 2026
点击查看摘要
Abstract:Multimodal hallucination remains a persistent challenge for Vision-Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse-grained negatives that could enable shortcut learning. In this work, we propose In-Context Visual Contrastive Optimization (IC-VCO). By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. We further introduce Visual Contrast Distillation (VCDist), an auxiliary reliability-gated regularizer that encourages consistency between multi-image contrastive training and single-image inference. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC-VCO's best overall performance and the effectiveness of our sample editing strategy. Code and data are available at this https URL.
41. 【2605.31293】Divergence Decoding: Inference-Time Unlearning via Auxiliary Models
链接:https://arxiv.org/abs/2605.31293
作者:Humzah Merchant,Bradford Levy
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, frequently memorize sensitive, creating significant privacy, memorize sensitive training
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) frequently memorize sensitive training data thereby creating significant privacy and copyright risks. Addressing these risks, i.e., removing such knowledge from an existing model checkpoint, has proven challenging as many unlearning methods lead to catastrophic utility loss or are ineffective for complex queries. We introduce Divergence Decoding (DD), a mechanism that uses small auxiliary models to steer the logits of the LLM away from specific data during inference. Training these models is straight forward, i.e., we use standard pre-training and fine-tuning setups. We find the method decisively outperforms state-of-the-art (SOTA) baselines on unlearning benchmarks across a variety of model and training dataset scales consistent with DD being an effective and inexpensive solution to unlearning. We then demonstrate that this steered distribution can be trivially distilled back into the base model. Since the method is generally applicable to any probabilistic model, we explore its efficacy outside of text generation and find evidence of generalization to the domain of images.
42. 【2605.31281】Wind Turbine Maintenance Log Labelling Framework: LLM-Driven Data Correction and Enrichment via Semantic Extraction of Reliability Intelligence
链接:https://arxiv.org/abs/2605.31281
作者:Max Malyi,Jonathan Shek,Alasdair McDonald,Andre Biscaya
类目:Computation and Language (cs.CL)
关键词:service life extension, data-driven reliability engineering, turbine fleets age, wind turbine fleets, fleets age
备注: An adjustable template containing the Python script architecture, applied dynamic prompts, and data schemas is hosted in an open-source GitHub repository: [this https URL](https://github.com/mvmalyi/llm-driven-wind-turbine-maintenance-log-labelling)
点击查看摘要
Abstract:As wind turbine fleets age, data-driven reliability engineering is essential to optimise their operation and maintenance for service life extension and levelised cost of energy reduction. Failure event descriptions within historical maintenance logs are a source of valuable reliability intelligence. However, they typically appear as unstructured natural language entries, rendering them inaccessible for quantitative analysis. This paper presents a novel methodology leveraging a large language model (LLM) to systematically standardise and structure maintenance logs based on their free-text descriptors. Operating on a dataset of 16,316 maintenance logs from 280 turbines monitored over nine years, the developed model-agnostic framework autonomously corrected hierarchical system codes and extracted evidence-based taxonomies of maintenance actions and failure modes. The automated pipeline successfully structured over 70% of the dataset. It resolved pervasive misclassification issues, such as isolating previously unclassified pitch system faults and restoring missing system codes, and enriched the records by applying empirical taxonomies to label specific actions taken and failure modes addressed. By using system-based log batches to construct empirical dictionaries of failure modes, observable symptoms, dominant mechanisms, and candidate causes, this approach reduces the inherent subjectivity of manual failure modes and effects analysis (FMEA). Ultimately, the methodology provides a highly scalable, cost-effective blueprint for translating large sets of qualitative field observations into quantitative reliability metrics, laying the foundation for integrated root-cause analysis across the renewable energy sector, improved FMEA, and advanced predictive maintenance.
43. 【2605.31268】Mellum2 Technical Report
链接:https://arxiv.org/abs/2605.31268
作者:Marko Kojic,Ivan Bondyrev,Aral de Moor,Joseph Shtok,Petr Borovlev,Kseniia Lysaniuk,Madeeswaran Kannan,Ivan Dolgov,Nikita Pavlichenko
类目:Computation and Language (cs.CL)
关键词:dense Mellum model, Sliding Window Attention, present Mellum, general-purpose language model, language model specialized
备注:
点击查看摘要
Abstract:We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.
44. 【2605.31264】COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation
链接:https://arxiv.org/abs/2605.31264
作者:Tianyi Zhou,Dongrui Liu,Leitao Yuan,Jing Shao,Xia Hu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:complete isolated tasks, carry bounded representations, LLM agents, isolated tasks, human expertise
备注: 12 pages, 4 figures
点击查看摘要
Abstract:LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, this http URL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.
45. 【2605.31238】Scaling Multi-Hop Training Data via Graph-Constrained Path Selection
链接:https://arxiv.org/abs/2605.31238
作者:Pengyu Chen,Yonggang Zhang,Mingming Chen,Jun Song,Wei Xue,Yike Guo
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Endowing large language, data rarely exists, curated benchmarks built, large language models, documents requires multi-hop
备注: 21 pages, 5 figures
点击查看摘要
Abstract:Endowing large language models with compositional reasoning over specialized documents requires multi-hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question-answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross-referencing clauses, conditions that characterize most real-world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre-validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram-matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ${\sim}91^{\circ}$, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched-size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4$\times$ expansion of the usable corpus rather than from higher per-chain quality -- reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine-tuning Qwen3-32B on 80K examples constructed from the CUAD legal contract corpus improves closed-book Token F1 from 21.66% to 38.58%. We have released our codes at this https URL.
46. 【2605.31220】Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models
链接:https://arxiv.org/abs/2605.31220
作者:Athina Kyriakou,Dennis Ulmer,Ivan Titov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:attracted great interest, large language models, model prediction, quantifying the reliability, attracted great
备注:
点击查看摘要
Abstract:Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.
47. 【2605.31212】Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education
链接:https://arxiv.org/abs/2605.31212
作者:Junling Wang,Boqi Chen,Heejin Do,Mubashara Akhtar,April Yi Wang,Mrinmaya Sachan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:educational content creation, support educational content, content creation, intended to teach, systems are increasingly
备注:
点击查看摘要
Abstract:AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.
48. 【2605.31201】Learning Whom to Trust: Market-Feedback Adaptive Retrieval for Frozen LLMs in Event-Driven Financial RAG
链接:https://arxiv.org/abs/2605.31201
作者:Zijie Zhao,Roy E. Welsch
类目:Computation and Language (cs.CL)
关键词:typically rank evidence, Financial retrieval-augmented generation, evidence source depends, systems typically rank, forecast horizon
备注:
点击查看摘要
Abstract:Financial retrieval-augmented generation (RAG) systems typically rank evidence by textual relevance, but in financial markets the useful evidence source depends on event type, forecast horizon, and market context. We study news-triggered event-impact prediction as a point-in-time financial RAG problem. For each company-news anchor, the system retrieves related financial news and SEC filing passages, appends a pre-decision market-context card, and predicts multi-horizon residual-return signals. Our method keeps the large language model (LLM) reader frozen and adapts the retrieval layer through an external Bayesian source memory updated from matured residual-return feedback. On a fixed 89-stock Nasdaq-oriented universe derived from the FinRL-DeepSeek/FNSPID task, using original FNSPID news and point-in-time EDGAR filing passages, Frozen Reader with Source Memory improves held-out macro-F1 from 0.438 to 0.471 and downstream portfolio Sharpe from 0.52 to 0.84 relative to Frozen Reader with No Memory. A supervised LoRA reader improves static RAG modestly, but does not improve over the frozen source-memory reader. These results suggest that, for financial RAG, learning where to retrieve from can be as important as learning how to read, offering a simple, modular route to market-feedback adaptation.
49. 【2605.31196】Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
链接:https://arxiv.org/abs/2605.31196
作者:Jun Wang,Xiaohao Xu,Xiaonan Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:Safe human, robot collaboration requires, safely separated, collaboration requires, Safe
备注: 31 pages, 9 figures
点击查看摘要
Abstract:Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.
50. 【2605.31183】Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines
链接:https://arxiv.org/abs/2605.31183
作者:Mikkel Godsk Jørgensen,Lars Kai Hansen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, model output generation, internals of Large, Language Models
备注:
点击查看摘要
Abstract:Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).
51. 【2605.31175】owards Efficient LLMs Annealing with Principled Sample Selection
链接:https://arxiv.org/abs/2605.31175
作者:Yuanjian Xu,Jianing Hao,Wanbo Zhang,Zhong Li,Guang Zhang
类目:Computation and Language (cs.CL)
关键词:ultimately determines final, LLM pre-training, final model quality, determines final model, pre-training that ultimately
备注:
点击查看摘要
Abstract:The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at this https URL.
52. 【2605.31170】Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
链接:https://arxiv.org/abs/2605.31170
作者:Stine Lyngsø Beltoft,William Brach,Federico Torrielli,Jacob Nielsen,Annemette Brok Pirchert,Filippo Tonini,Peter Schneider-Kamp,Lukas Galke Poech
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Monitoring autonomous language, autonomous language model, languages, Moltbook Files dataset, Monitoring autonomous
备注:
点击查看摘要
Abstract:Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.
53. 【2605.31164】D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training
链接:https://arxiv.org/abs/2605.31164
作者:Yuanjian Xu,Jianing Hao,Guang Zhang,Zhong Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, motivating extensive research, Training data plays, language models, motivating extensive
备注:
点击查看摘要
Abstract:Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at this https URL.
54. 【2605.31148】SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
链接:https://arxiv.org/abs/2605.31148
作者:Tianhui Liu,Jie Feng,Zhiheng Zheng,Shengyuan Wang,Yiming Guo,Yanxin Xi,Hangyu Fan,Yong Li,Pan Hui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:form cognitive representations, perceive spatial layouts, effortlessly perceive spatial, form cognitive, cognitive representations
备注:
点击查看摘要
Abstract:Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.
55. 【2605.31142】On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets
链接:https://arxiv.org/abs/2605.31142
作者:Ana Gjorgjevikj,Barbara Koroušić Seljak,Tome Eftimov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:play crucial role, multilingual text embedding, multi-task settings remains, remains insufficiently understood, text embedding models
备注:
点击查看摘要
Abstract:Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.
56. 【2605.31140】EvoDefense: Co-Evolving Black-Box Defense with Large Language Models
链接:https://arxiv.org/abs/2605.31140
作者:Yu Li,Yuenan Hou,Yingmei Wei,Yanming Guo,Chaochao Lu
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, remain highly vulnerable, Language Models, remain highly
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) remain highly vulnerable to diverse attacks, particularly in black-box settings where the internals of target models are inaccessible. Existing black-box defenses typically rely on pre-defined filtering heuristics, which often fail to generalize to unseen attack types and target model architectures. We introduce EvoDefense, an experience-guided co-evolving black-box defense paradigm. EvoDefense employs a guard LLM to detect malicious queries and an experience memory module to accumulate defense knowledge from previous interactions. At the core of EvoDefense is a continuous attack-defense evolution loop, where an attack generator and the guard model iteratively refine their attack strategies and defense policies through experience-guided optimization. This design enables EvoDefense to generalize across unseen attacks and target models without retraining. Experiments on HarmBench, AdvBench, and AlpacaEval show that EvoDefense achieves consistently strong defense performance across seven popular models and five representative LLM attacks, while preserving competitive general capabilities. On HarmBench, EvoDefense reduces the attack success rate (ASR) of AutoDAN-turbo on Gemini-3-flash and LLaMA-3-8B-Instruct from 29.4% and 43.4% to 8.4% and 6.2%, respectively.
57. 【2605.31136】Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages
链接:https://arxiv.org/abs/2605.31136
作者:Gerrit Quaremba,Amy Rechkemmer,Elizabeth Black,Denny Vrandečić,Elena Simperl
类目:Computation and Language (cs.CL)
关键词:check-worthiness detection identifies, requiring verification based, Citation Needed Detection, identifies claims requiring, claims requiring verification
备注:
点击查看摘要
Abstract:In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at this https URL
58. 【2605.31126】Not All Synthetic Data Is Yours to Learn From
链接:https://arxiv.org/abs/2605.31126
作者:Sina Alemohammad,Li Chen,Richard G. Baraniuk,Zhangyang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:plain text sampled, language model improve, improve from plain, model improve, reward model
备注:
点击查看摘要
Abstract:Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.
59. 【2605.31113】SM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices
链接:https://arxiv.org/abs/2605.31113
作者:Gerrit Quaremba,Elizabeth Black,Denny Vrandečić,Elena Simperl
类目:Computation and Language (cs.CL)
关键词:Automatically detecting machine-generated, Automatically detecting, detecting machine-generated text, textit, user-generated content
备注:
点击查看摘要
Abstract:Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.''). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of SOTA MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and \textit{multi-task} benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data -- even across domains -- but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. \textsc{TSM-Bench} therefore provides a critical foundation for developing and evaluating future models.
60. 【2605.31105】GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs
链接:https://arxiv.org/abs/2605.31105
作者:Junjie Peng,You Wu,Haoyi Wu,Jialong Han,Xiaohua Xie,Kewei Tu,Jianhuang Lai
类目:Computation and Language (cs.CL)
关键词:Large language models, extended context lengths, context lengths rely, Large language, language models
备注: 21 pages, 7 figures
点击查看摘要
Abstract:Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.
61. 【2605.31099】KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning
链接:https://arxiv.org/abs/2605.31099
作者:Dominik Soós,Meng Jiang,Jian Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:important medium, medium to communicate, communicate discoveries, research communities, Science
备注:
点击查看摘要
Abstract:Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.
62. 【2605.31086】Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory
链接:https://arxiv.org/abs/2605.31086
作者:Han Zhang,Zihao Tang,Xin Yu,Xiao Liu,Yeyun Gong,Haizhen Huang,Yan Lu,Weiwei Deng,Feng Sun,Qi Zhang,Hanfang Yang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, underlying personas tend, long-term semantic consistency, lack long-term semantic
备注:
点击查看摘要
Abstract:In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios, interactions between users and assistants involve more diverse, heterogeneous data streams, such as documents and emails. These shortcomings significantly limit the realism and effectiveness of current evaluations. To address these limitations, we introduce RHELM (Realistic, Heterogeneous, and Evolving Long-term Memory). Driven by meticulously crafted user profiles and a novel LOOP (pLan-rOllout-evOlve-Prune) module, we construct realistic dialogues across diverse interaction scenarios that exhibit dynamic temporal evolution and long-term coherence. Crucially, these dialogues are deeply integrated with heterogeneous external sources synchronized with the user's temporal event trajectory. The resulting benchmark encompasses challenging question-answer pairs spanning seven inquiry types, with each question mapping to at least one of 27 critical memory characteristics that we identify as essential yet underexplored in current research. Comprehensive experiments across full-context models, retrieval-augmented generation (RAG) methods, and representative memory frameworks reveal that contemporary approaches still expose critical weaknesses in complex, real-world settings, particularly in resolving multi-source aggregation and real-world contextual reasoning.
63. 【2605.31080】A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
链接:https://arxiv.org/abs/2605.31080
作者:Iosif Tsangko,Andreas Triantafyllopoulos,George Margetis,Ioana Crihana,Björn W. Schuller
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Blind and low-vision, visual art descriptions, on-premise vision-language models, audiences remain underserved, vision-language models
备注: 7 pages, 2 figures, 3 tables. Preprint
点击查看摘要
Abstract:Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.
64. 【2605.31073】ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails
链接:https://arxiv.org/abs/2605.31073
作者:Yan Wang,Zhixuan Chu,Zihao Xue,Zhen Bi,Bingyu Zhu,YueFeng Chen,Zeyu Yang,Jungang Lou,Longtao Huang,Ningyu Zhang,Kui Ren,Hui Xue
类目:Computation and Language (cs.CL)
关键词:generating explicit rationales, moderation by generating, generating explicit, Reasoning-based LLM guardrails, Reasoning-based LLM
备注: 18 pages, 9 figures
点击查看摘要
Abstract:Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.
65. 【2605.31069】owards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
链接:https://arxiv.org/abs/2605.31069
作者:Bo Peng,YuanJie Lyu,PengGang Qin,Tong Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Accurately predicting future, Accurately predicting, Large Language Models, long-video event prediction, predicting future events
备注:
点击查看摘要
Abstract:Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.
66. 【2605.31062】AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering
链接:https://arxiv.org/abs/2605.31062
作者:Yuxin Wang,Jiahao Lu,Qifeng Wu,Shicheng Fang,Chuanyuan Tan,Yining Zheng,Xuanjing Huang,Xipeng Qiu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, achieved remarkable performance, achieved remarkable
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved remarkable performance in complex reasoning tasks through Chain-of-Thought (CoT) prompting. However, this approach often leads to ``over-thinking,'' where models generate unnecessarily long reasoning traces for simple queries and incur avoidable inference cost. While recent work has explored adaptive reasoning, existing methods typically make a single query-level decision about whether to reason. This overlooks the dynamic nature of multi-step tasks, where the need for explicit reasoning varies across intermediate stages. To address this limitation, we introduce AdaptR1, a Reinforcement Learning (RL) based framework for adaptive interleaved thinking in multi-hop Question Answering (QA). Unlike previous approaches that require Supervised Fine-Tuning (SFT) for cold-start initialization, AdaptR1 uses a fully RL-based strategy with a quality-gated efficiency reward to dynamically allocate reasoning budgets at each step. Under the Graph-R1 setting, AdaptR1 reduces average think tokens by 69.71\%, with a 90.35\% reduction on HotpotQA, while maintaining performance comparable to or better than standard baselines. Furthermore, our analysis reveals that overthinking in multi-hop reasoning is not uniformly distributed but occurs predominantly during the initial planning stages, highlighting the effectiveness of step-wise adaptive budget allocation.
67. 【2605.31058】Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination
链接:https://arxiv.org/abs/2605.31058
作者:Jiasheng Zheng,Boxi Cao,Boxi Yu,Yuzhong Zhang,Jialun Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:Large Language Models, Reinforcement Learning, Large Language, remarkable coding abilities, Language Models
备注: Work in progress
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.
68. 【2605.31056】How Much Do LLMs Know About Chinese Zero Pronouns?
链接:https://arxiv.org/abs/2605.31056
作者:Yifei Li,Guanyi Chen,Tingting He
类目:Computation and Language (cs.CL)
关键词:pervasive linguistic phenomenon, language processing systems, Large Language Models, natural language processing, Chinese ZPs
备注:
点击查看摘要
Abstract:Zero Pronouns (ZPs) are a pervasive linguistic phenomenon in pro-drop languages such as Chinese and have long posed a challenge for natural language processing systems. Although Large Language Models (LLMs) perform well on many Chinese language tasks, their ability to process ZPs remains poorly understood. We conduct a systematic investigation of LLMs' handling of Chinese ZPs through a sequence of linguistically motivated tasks, including identification, referentiality classification, referential type classification, resolution, and translation. A diverse set of LLMs is evaluated across all tasks. Our results show that Chinese ZPs remain highly challenging for current LLMs, particularly for upstream tasks such as identification and referentiality classification. Performance on downstream tasks, such as ZP translation, is also consistently low: even state-of-the-art reasoning-oriented LLMs correctly translate fewer than half of Chinese ZPs into English.
69. 【2605.31042】From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
链接:https://arxiv.org/abs/2605.31042
作者:Jiejun Tan,Zhicheng Dou,Xinyu Yang,Yuyang Hu,Yiruo Cheng,Xiaoxi Li,Ji-Rong Wen
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:evolving from conversational, conversational chatbots, chatbots to operational, LLM, LLM agents
备注: Code and data are available at [this https URL](https://github.com/RUC-NLPIR/ClawTrojan)
点击查看摘要
Abstract:LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.
70. 【2605.31025】RACE: Discovering Task-Specific Parameter via Adaptation-Aware Probing for Continual Fine-Tuning
链接:https://arxiv.org/abs/2605.31025
作者:Xiaosong Han,Ke Chen,Xindi Dai,Di Liang,Minlong Peng,Wei Pang,Fausto Giunchiglia,Xiaoyue Feng,Yonghao Liu,Renchu Guan
类目:Computation and Language (cs.CL)
关键词:previously learned skills, preserve previously learned, real-world deployment, learned skills, adapted continually
备注: KDD2026
点击查看摘要
Abstract:In real-world deployment, LLMs are often adapted continually across tasks to keep LLMs up-to-date in production, where new fine-tuning should preserve previously learned skills. However, indiscriminately mixing tasks can dilute task specialization, while sequential fine-tuning (full-parameter or low rank adaptation) often causes catastrophic forgetting due to destructive overwriting. Replay-based continual tuning and maintaining separate task-specific adapters can mitigate forgetting, but introduce additional compute, storage, and management overhead. Recognizing the redundancy of LLM parameters for any single task, we reframe continual task adaptation as task-specific parameter discovery via adaptation-aware probing: a short warm-start probe exposes a task's adaptation trace, enabling us to identify and isolate the small subset of parameters essential for each task to mitigate catastrophic forgetting. Building on this view, we introduce TRACE, a novel approach for discovering Task-specific paRameters via Adaptation-aware probing for Continual finE-tuning. We perform a short warm-start fine-tune to derive task-specific core parameters by comparing the warm-started and pre-trained models. Core parameters are identified via two strategies: importance scoring (L$_2$ norm and Fisher Information) and specificity analysis (cosine similarity of parameter updates). In continual fine-tuning settings, only the active task's core parameters are updated while others remain frozen, preserving prior knowledge. We conduct extensive experiments across multiple standard benchmarks to demonstrate the superior performance of our proposed method. Additionally, we validate the generalization of our method through a cross-model and scale transferability study, demonstrating a "small-to-large" paradigm that guides the fine-tuning of large-scale models under resource constraints.
71. 【2605.31021】A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI
链接:https://arxiv.org/abs/2605.31021
作者:Atahan Karagoz
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:aggregated statistical baselines, Current alignment paradigms, artificial intelligence rely, intelligence rely predominantly, generative artificial intelligence
备注:
点击查看摘要
Abstract:Current alignment paradigms for generative artificial intelligence rely predominantly on monolithic benchmarking frameworks that reduce the plurality of human judgment to aggregated statistical baselines, thereby obscuring cultural, demographic, and contextual variability in evaluation. We introduce a state-space constrained emulation framework for AI evaluation that replaces singular assessment functions with a structured manifold of synthetic cognitive profiles representing diverse human perspectives. We show that modern generative architectures can instantiate and maintain these evaluative personas with high consistency, enabling a form of pluralistic, perspective-dependent benchmarking that more closely reflects real-world consensus variability. However, we further analyze the stability of these simulated evaluators under sequential inference and stochastic prompt perturbations, revealing systematic degradation in persona coherence that manifests as state-space drift and semantic inconsistency. These findings suggest that static alignment constraints are insufficient for sustaining robust evaluative behavior over time. Instead, we argue for the necessity of embedding dynamic, viability-driven regulatory mechanisms within generative systems to preserve coherent cognitive emulation. By framing persona-based evaluation as a structured dynamical system over latent representation manifolds, this study provides a foundation for more adaptive, human-aligned, and context-sensitive approaches to AI evaluation.
72. 【2605.31010】MoG: Mixture of Experts for Graph-based Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2605.31010
作者:Zheng Yuan,Chuang Zhou,Linhao Luo,Siyu An,Di Yin,Xing Sun,Xiao Huang
类目:Computation and Language (cs.CL)
关键词:ground large language, large language models, Retrieval-augmented generation, textbf, intensively studied
备注:
点击查看摘要
Abstract:Retrieval-augmented generation is intensively studied to ground large language models on external evidence. However, retrieving from a unified knowledge base could inevitably introduce irrelevant information that may mislead generation for complex reasoning. Inspired by the conditional computation of mixture of experts (MoE), where a router sparsely selects specialized experts alongside shared ones for each input, we propose \textbf{M}ixture \textbf{o}f experts for \textbf{G}raph-based Retrieval-Augmented Generation, i.e., \textbf{MoG}. It organizes knowledge into two core components: (i) diverse, always-accessible hub graphs that encode semantically and structurally central knowledge and provide contextual clues for expert activation, and (ii) sparsely activated expert graphs that contain domain-specific evidence. MoG first accesses hub graphs to identify general evidence and derive contextual clues. Then, a topology-aware router dynamically activates a limited set of expert graphs conditioned on the query, thereby confining retrieval to a focused evidence subspace. Extensive experiments on challenging benchmarks show that MoG consistently outperforms strong baselines, with over 20\% relative improvement on MuSiQue. Our code is available in this https URL.
73. 【2605.30995】raceable by Design: An LLM Pipeline and Dashboard for EU Regulatory Consultation Analysis
链接:https://arxiv.org/abs/2605.30995
作者:Thales Bertaglia,Haoyang Gui,Catalina Goanta,Gerasimos Spanakis
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:generate large volumes, Public consultations generate, consultations generate large, Digital Fairness Act, European Commission Digital
备注:
点击查看摘要
Abstract:Public consultations generate large volumes of data in the form of stakeholder submissions that are practically unfeasible to analyse manually. We present an end-to-end LLM-based pipeline and interactive dashboard for structured topic extraction from regulatory consultation submissions, demonstrated on the European Commission's Digital Fairness Act (DFA) public call for evidence as a case study. The system processes raw PDF attachments and web-form responses, extracts topic annotations, and grounds every extraction in a verbatim quote from the source text. Applied to 4,322 DFA submissions, the pipeline produced 15,368 topic annotations supported by 20,951 verbatim evidence quotes. Three principles govern the proposed design: verbatim grounding, full traceability, and transparency by design. The dashboard exposes the full extraction dataset through five analytical views, from dataset-level topic overviews to individual paragraph drill-downs, with every result traceable to its source. Beyond the predefined DFA topic categories, the pipeline generated certain stakeholder concerns, such as Age Verification, Payment Processor Censorship, and Digital Ownership, that a fixed-taxonomy approach would have missed. The pipeline is domain-generic; adapting it to a new consultation requires only a prompt update and a new dataset. A live demo is available at this https URL. The code and processed data are publicly available at this https URL.
74. 【2605.30984】Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation
链接:https://arxiv.org/abs/2605.30984
作者:Tom Maye-Lasserre,Yitong Li,Bailiang Jian,Morteza Ghahremani,Benedikt Wiestler,Christian Wachinger
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:exhibit critically low, critically low pathology, generate fluent radiology-style, fluent radiology-style text, Template Collapse
备注:
点击查看摘要
Abstract:Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.
75. 【2605.30981】Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement
链接:https://arxiv.org/abs/2605.30981
作者:Riju Marwah,Ritvik Garimella,Vishal Pallagani,Atishay Jain,Michael Stewart,Amit Sheth
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:producing repetitive text, losing instruction adherence, Autoregressive language models, Autoregressive language, exhibiting unstable entropy
备注: 9 pages, 7 figures. Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
点击查看摘要
Abstract:Autoregressive language models frequently degrade during long-horizon generation, producing repetitive text, losing instruction adherence, and exhibiting unstable entropy. Despite the prevalence of these failures, practitioners lack online diagnostics to detect them in real-time as they occur. We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axioms (monotonicity, boundedness, interpretability) enabling reliable runtime monitoring. Across nine models (1B-13B parameters), FI trajectories exhibit structured temporal dynamics, predict task degradation (AUROC = 0.95) and repetition (Spearman rho = 0.94), and reveal non-monotonic scaling behavior: instruction-tuned models below 3B exhibit faster collapse than base models, with this trend reversing at 7B. Stress analyses further show that FI onset accelerates under longer contexts, middle-positioned evidence, and reduced numerical precision. These results establish cognitive fatigue as a coherent and measurable phenomenon, and position FI as a principled tool for runtime reliability monitoring in production LLM systems.
76. 【2605.30966】Reading Between the Citations: A Typed Claim Network for Scientific Literature
链接:https://arxiv.org/abs/2605.30966
作者:Ning Ding,Sergio J. Rodríguez Méndez,Pouya G. Omran
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Knowledge graphs, legal opinions, encode the topology, graphs over corpora, scholarly inter-referencing documents
备注:
点击查看摘要
Abstract:Knowledge graphs over corpora of inter-referencing documents - scholarly papers, legal opinions, policy briefs - encode the topology of reference but not its stance. The standard representation collapses a rich evaluative relation into an untyped edge, losing the very content that supports community-level queries about how one document is received by another. We propose the claim network: a representational pattern in which each cross-document reference is reified as a typed claim, carrying source, target, claim text, and a four-class stance label grounded in the citation-intent literature. We give a construction pipeline applicable to any corpus of scholarly inter-referencing documents and instantiate it on a corpus of 127 papers in 3D point cloud semantic segmentation, producing a network of 8,260 typed claims. Three downstream task families demonstrate what the network enables: retrieval signal augmentation, aggregated-stance summarisation, and topological analytics. Head-to-head evaluation against standard Retrieval-Augmented Generation (RAG) baselines shows that the gain over flat retrieval is the gain from the right intermediate representation rather than the wrong one.
77. 【2605.30961】EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation
链接:https://arxiv.org/abs/2605.30961
作者:Xu Li,Hanzhe Tu,Xinyi Li,Kuncheng Zhao,Xun Han,Zhonghui Liu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, scientific progress, Generating
备注: 21 pages, 6 figures
点击查看摘要
Abstract:Generating novel research ideas is fundamental to scientific progress. While Large Language Models (LLMs) show promise in assisting this process, existing approaches often exhibit semantic convergence, resulting in limited diversity and novelty. To address this, we introduce EvoGens, an evolution-inspired framework that recasts scientific idea generation as an evolutionary search over a population of ideas. EvoGens iteratively applies rank-based mutation with differentiated retrieval planning to incorporate external knowledge, and semantic-aware crossover to fuse complementary concepts for conceptual reorganization. A lightweight evaluation signal guides the selection process, encouraging sustained exploration while mitigating premature convergence. Extensive experiments demonstrate that EvoGens substantially enhances exploration capabilities compared to state-of-the-art baselines. Specifically, it improves the Novelty from 0.1 to 0.4 and the Diversity from 0.24 to 0.55, while maintaining comparable idea quality under the current automatic evaluation protocol. These findings suggest that evolutionary mechanisms can serve as a useful framework for exploration-oriented research ideation, especially for broadening the novelty and diversity of candidate ideas under a shared automatic evaluation setting.
78. 【2605.30947】Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship
链接:https://arxiv.org/abs/2605.30947
作者:Yating Pan(1 and 2),Jiajun Zhang(2),Jun Wang(1, 2 and 4),Qi Su(3 and 4) ((1) Department of Information Management, Peking University, (2) Research Center for Digital Humanities, Peking University, (3) School of Foreign Languages, Peking University, (4) Institute for Artificial Intelligence, Peking University)
类目:Computation and Language (cs.CL)
关键词:LLM-based research agents, science and engineering, executable experiments, quantitative signals, advanced rapidly
备注: 28 pages, 3 figures. Code, data catalogues, and reproduction scripts: [this https URL](https://github.com/YatingPan/SPIRE) . Lead corresponding author: Jun Wang; corresponding author: Qi Su
点击查看摘要
Abstract:LLM-based research agents have advanced rapidly in science and engineering, where research is organized around executable experiments, code, and quantitative signals. Humanities scholarship, however, requires a different mode of reasoning: interpretive, evidence-grounded argument over primary sources, where scholarly value depends on faithful quotation, verifiable provenance, and close reading. Existing research agents remain largely optimized for execution and retrieval, not evidence-grounded interpretive reasoning. To address this gap, we introduce SPIRE (Scholarly-Primitives-Inspired Research Engine), a multi-agent framework for evidence-grounded humanities scholarship. Drawing on Scholarly Primitives theory, SPIRE casts recurring humanities operations as cooperating agent roles (source discovery, evidence annotation, comparison, provenance checking, sampling, citation binding, and argumentative synthesis) over a multi-scale close-reading substrate of passages, intra-context graph communities, and cross-context semantic clusters. On a peer-reviewed-paper benchmark over classical Chinese and Greco-Roman Latin scholarship, SPIRE recovers cited primary-source evidence more reliably than Naive LLM, Text RAG, and GraphRAG, and receives higher blind-judge scores on answer accuracy, depth, coverage, and evidence quality. Ablations show that both the scholarly-operation agents and close-reading retrieval contribute to evidence-grounded essays. Code, data catalogues, and reproduction scripts are released at this https URL.
79. 【2605.30934】Do Large Language Models Encode Institutional Experience? Evidence from Cross-Linguistic Moral Reasoning Under Ambiguity
链接:https://arxiv.org/abs/2605.30934
作者:Nattavudh Powdthavee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, variation remains unclear, exhibit systematic differences, Large language, exhibit systematic
备注: 44 pages
点击查看摘要
Abstract:Large language models (LLMs) exhibit systematic differences in moral reasoning across languages, yet the source of this variation remains unclear. We test the hypothesis that languages encode aspects of the institutional environments in which they are spoken, allowing LLMs to inherit institution-specific moral priors through training. Across nine languages spanning a broad gradient of institutional quality, six frontier LLMs, and two preregistered studies, we examine moral dilemmas whose acceptability depends on institutional functioning. In Study 1, explicit institutional framing produced uniformly null results: cross-linguistic moral divergence did not increase in institutionally contingent scenarios, nor did it track institutional differences between language communities. In Study 2, we introduced institutionally ambiguous scenarios in which institutional stakes were present but not explicitly stated. Under these conditions, cross-linguistic moral divergence increased relative to institutionally inert controls and, with one theoretically informative exception, was associated with real-world institutional differences between language communities. Explicit framing again attenuated these effects. These findings suggest that institutional experience may leave detectable traces in language that shape LLM moral reasoning, while also indicating that explicit institutional cues can suppress the expression of those differences.
80. 【2605.30931】MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft
链接:https://arxiv.org/abs/2605.30931
作者:Tianjie Ju,Yueqing Sun,Zheng Wu,Wei Zhang,Yaqi Huo,Xi Su,Qi Gu,Xunliang Cai,Gongshen Liu,Zhuosheng Zhang
类目:Computation and Language (cs.CL)
关键词:Multimodal large language, Multimodal large, large language models, action generation, large language
备注: Working in progress
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at this https URL.
81. 【2605.30930】UX: Measuring Human--AI Tacit Understanding
链接:https://arxiv.org/abs/2605.30930
作者:Yueshen Li,Hanyi Min,Vedant Das Swain,Koustuv Saha
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:explicit task success, large language models, increasingly act, reward optimization, large language
备注:
点击查看摘要
Abstract:As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.
82. 【2605.30924】EMBGuard: Constructing Hazard-Aware Guardrails for Safe Planning in Embodied Agents
链接:https://arxiv.org/abs/2605.30924
作者:Dongwook Choi,Taeyoon Kwon,Bogyung Jeong,Minju Kim,Yeonjun Hwang,Hyojun Kim,Byungchul Kim,Young Kyun Jang,Jinyoung Yeo
类目:Computation and Language (cs.CL)
关键词:MLLM-powered embodied agents, embodied agents deployed, MLLM-powered embodied, environments encounter physical, real-world environments encounter
备注: Accepted at ICML 2026
点击查看摘要
Abstract:MLLM-powered embodied agents deployed in real-world environments encounter physical hazards. However, existing approaches lack explicit mechanisms for identifying hazards and reasoning about action-conditioned risks, leading agents to either miss risky interactions or over-identify risks. To address this, we propose EMBGuard, the first MLLM-based safety guardrail for embodied agents designed to decouple physical risk reasoning from agent policy. By evaluating a (visual observation, action) pair, EMBGuard identifies hazardous configurations and provides natural language explanations of potential risks. Alongside EMBGuard, we contribute EMBHazard, a training dataset of 15.1K action-conditioned pairs, and EMBGuardTest, a benchmark of 329 manually curated real-world scenarios spanning seven physical risk categories. Through compositional variation of hazards and actions, we generate diverse risky and benign scenarios that agents may encounter during planning. Despite its compact size (2B, 4B), EMBGuard achieves performance competitive with proprietary MLLMs (e.g., GPT-5.1, Gemini-2.5-Pro) while significantly reducing the false-positive rates that hinder real-time deployment. We make the code, data, and models publicly available at this https URL
83. 【2605.30913】oxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits
链接:https://arxiv.org/abs/2605.30913
作者:Soorya Ram Shimgekar,Agam Goyal,Amruta Parulekar,Joshua Chen,Yian Wang,Navin Kumar,Hari Sundaram,Eshwar Chandrasekharan,Koustuv Saha
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:Large language models, Large language, semantically equivalent prompts, user tone ranges, degrade factual reliability
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.
84. 【2605.30912】Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR
链接:https://arxiv.org/abs/2605.30912
作者:Ruina Hu,Chen Wang,Lai Wei,Jionghao Bai,Bin Yu,Weiran Huang,Kai Wang,Yue Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Reinforcement learning, improves vision-language models, outcome rewards derived, optimizing outcome rewards, improves vision-language
备注:
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.
85. 【2605.30907】BlueFin: Benchmarking LLM Agents on Financial Spreadsheets
链接:https://arxiv.org/abs/2605.30907
作者:Srivatsa Kundurthy,Clara Na,Colton Moraine,Anoushka Mohta,Case Winter,George Fang,John Ling,Emma Strubell,Zach Kirshner
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:tasks large language, professional finance domain, large language model, professional finance roles, professional finance
备注: 26 pages
点击查看摘要
Abstract:We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global population of paying users of spreadsheet software range in the hundreds of millions -- an order of magnitude more than the estimated global population of professional developers -- comparatively fewer resources have been devoted to exploring and expanding LLM capabilities in the spreadsheet domain, with fewer still dedicated to mirroring real occupational tasks encountered by those in professional finance roles. In response, we curate a set of 131 challenging, complex tasks with real-world relevance in the domain, containing 3,225 granular rubric criteria; notably, our rubric criteria and LM judge evaluations are validated by a team of expert human annotators, resulting in high-quality, granular evaluations of complex tasks that are difficult to verify programmatically but can be reliably evaluated by an LM judge agent. Our judge achieves parity with expert consensus ($\alpha=0.826$) with a macro-F1 score of 0.839. Frontier LLMs demonstrate poor performance on the challenging benchmark, with the strongest LLMs achieving less than 50\% average scores across tasks -- models exhibit particular weaknesses in dynamic correctness. Our contributions include a dataset of examples across three categories of spreadsheet tasks, an open source harness and agentic evaluation framework, and a characterization of existing frontier models' performance on our benchmark.
86. 【2605.30898】UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling
链接:https://arxiv.org/abs/2605.30898
作者:Kaiyu Huang,Xingyu Wang,Mingze Kong,Zhubo Shi,Yuqian Hou,Hong Xu,Zhongxiang Dai,Minchen Yu,Qingjiang Shi
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:balancing inference quality, large language models, central challenge, real-world deployments, deployments of large
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
点击查看摘要
Abstract:In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.
87. 【2605.30888】he Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
链接:https://arxiv.org/abs/2605.30888
作者:Xiaobo Wang,Tong Wu,Min Tang,Jiaqi Li,Qi Liu,Zilong Zheng
类目:Computation and Language (cs.CL)
关键词:Building strong reward, reliable preference data, Building strong, language model alignment, strong reward models
备注:
点击查看摘要
Abstract:Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.
88. 【2605.30880】PatchWorld: Gradient-Free Optimization of Executable World Models
链接:https://arxiv.org/abs/2605.30880
作者:Jiaxin Bai,Yue Guo,Yifei Dong,Jiaxuan Xiong,Tianshi Zheng,Yixia Li,Tianqing Fang,Yufei Li,Yisen Gao,Haoyu Huang,Zhongwei Xie,Hong Ting Tsang,Zihao Wang,Lihui Liu,Jeff Pan,Yangqiu Song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:partially observable Markov, Markov decision processes, simulator latent state, observable Markov decision, observable Markov
备注: 40 pages
点击查看摘要
Abstract:Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at this https URL.
89. 【2605.30876】dMoE: dLLMs with Learnable Block Experts
链接:https://arxiv.org/abs/2605.30876
作者:Sicheng Feng,Zigeng Chen,Gongfan Fang,Xinyin Ma,Xinchao Wang
类目:Computation and Language (cs.CL)
关键词:Diffusion Large Language, Large Language Models, Diffusion Large, Large Language, naturally supporting parallel
备注: Working in progress. Code is available at: \url{ [this https URL](https://github.com/fscdc/dMoE) }
点击查看摘要
Abstract:Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14$\times$ to 1.66$\times$ end-to-end latency speedup. Code is available at: this https URL
90. 【2605.30857】MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning
链接:https://arxiv.org/abs/2605.30857
作者:Yi Bai,Wenhao Zhang,Yao Chen,Jiao Xue,Zhumin Chen,Pengjie Ren
类目:Computation and Language (cs.CL)
关键词:core set, instruction fine-tuning data, large language models, Instruction fine-tuning, Diverse Core Set
备注:
点击查看摘要
Abstract:Instruction fine-tuning is employed to enhance the instruction-following ability of large language models (LLMs). As the amount of instruction fine-tuning data increases, selecting the optimal core set becomes particularly important. However, ensuring the diversity of the core set remains a significant challenge. Existing methods predominantly distinguish different training data based on the text features themselves, decoupled from LLMs' own understanding and representation of the data. To address this issue, we propose a Model-Aware Diverse Core Set Selection method, which distinguishes data features based on the neural activation states during LLM inference. This approach serves as an efficient instantiation of coverage-based selection using model-intrinsic activation features to ensure the diversity in the core set. We extensively evaluate our method on six benchmarks that cover five distinct tasks. In our method, the core set selected by the 3B-parameter LLM performs effectively when utilized to fine-tune larger models with 7B, 8B, and 13B parameters. Experimental results on the Alpaca-GPT4 dataset, which comprises 52K instruction-response pairs, show that the core set, sized at 15\% of the original dataset and selected by Llama-3.2-3B-Instruct, achieves an average improvement of 2.5\% when fine-tuning four larger base models compared with training on the full dataset. The experimental results demonstrate that our method enhances model performance on multiple downstream tasks while reducing data requirements.
91. 【2605.30852】Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism
链接:https://arxiv.org/abs/2605.30852
作者:Yijiong Yu,Huazheng Wang,Shuai Yuan,Ruilong Ren,Ji Pei
类目:Computation and Language (cs.CL)
关键词:low-concurrency LLM inference, accelerates low-concurrency LLM, inference by employing, Speculative Pipeline Decoding, low-concurrency LLM
备注:
点击查看摘要
Abstract:Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at this https URL
92. 【2605.30848】LLM Anonymization Against Agentic Re-Identificatio
链接:https://arxiv.org/abs/2605.30848
作者:Ziwen Li,Jianing Wen,Tianshi Li
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:web search change, carry downstream analytic, weak contextual cues, textbf, LLMs with web
备注: 32 pages, 7 figures
点击查看摘要
Abstract:Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become cross-referenceable evidence for re-identification, yet those same details also carry downstream analytic value of the text. Existing defenses either remove explicit identifiers, perturb text for formal privacy, or test rewritten text against non-web inference models, leaving underexplored the operating region between resistance to agentic web-search re-identification and utility retention. We introduce AURA (\textbf{A}nonymization with \textbf{U}tility-\textbf{R}etention \textbf{A}daptation), an LLM-powered \textit{mask-reconstruct} framework that decouples privacy localization from utility-preserving reconstruction and selects candidates with adversarial privacy and utility-retention checks. We evaluate AURA on real-user interview transcripts using re-identification attacks carried out by web-search agents, along with a utility evaluation based on interviewee-profile facts, codebook facts, and the joint contextual utility grid. Our results show that AURA improves the privacy-utility frontier by using adaptive privacy scope to strengthen resistance to agentic re-identification and using a mask-reconstruct anonymization method to better preserve contextual utility under fixed privacy scope.
93. 【2605.30844】Fine-Tuning Improves Information Conveyance in Language Models
链接:https://arxiv.org/abs/2605.30844
作者:Yuwei Cheng,Weiyi Tian,Haifeng Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
关键词:existing analyses overlook, analyses overlook output, entire generation rollout, overlook output length, key confounder
备注:
点击查看摘要
Abstract:Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy ($\mathrm{CE}^\star$), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making $\mathrm{CE}^\star$ naturally quantify the effective size of the generation space. $\mathrm{CE}^\star$ jointly captures uncertainty in both the output length $N$ and the generated sequence $Y_{1:N}$ -- indeed, we show that it equals to total Shannon entropy $H(N, Y_{1:N}\mid X)$, where $X$ denotes the prompt. This formulation yields interpretable metrics, including a length-entropy correlation term $\rho(N, r_N)$, where $r_N$ is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine-tuned models consistently exhibit stronger positive correlation $\rho(N, r_N)$, even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output-length effects, we find that fine-tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at this https URL.
94. 【2605.30833】Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation
链接:https://arxiv.org/abs/2605.30833
作者:Yanjiang Liu,Jie Lou,Xinyan Guan,Yuqiu Ji,Hongyu Lin,Ben He,Xianpei Han,Le Sun,Xing Yu,Yaojie Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:On-policy distillation transfers, transfers reasoning capabilities, Supervision Fidelity Decay, On-policy distillation, distillation transfers reasoning
备注:
点击查看摘要
Abstract:On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.
95. 【2605.30826】Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage
链接:https://arxiv.org/abs/2605.30826
作者:Shuheng Cao,Ruiqi Chen,Renjie Cao,Zhenhao Zhang,Siyu Zhang,Tingting Dan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:corpus-convention correctness depends, plausible biomedical mentions, span boundaries, entity granularity, modern LLMs
备注:
点击查看摘要
Abstract:Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.
96. 【2605.30813】Incremental BPE Tokenization
链接:https://arxiv.org/abs/2605.30813
作者:Shenghu Jiang,Ruihao Gong
类目:Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
关键词:Byte Pair Encoding, Pair Encoding, incremental Byte Pair, Byte Pair, BPE
备注: Accepted to ICML 2026 (Spotlight)
点击查看摘要
Abstract:We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: this https URL
97. 【2605.30804】Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit
链接:https://arxiv.org/abs/2605.30804
作者:Jiwoo Choi,Seonwoo Ahn,Tongxin Zhang,Seohyon Jung
类目:Computation and Language (cs.CL)
关键词:audit six large, large language models, East Asian, Chinese, English
备注:
点击查看摘要
Abstract:We audit six large language models (LLMs) for gender stereotyping across English, Korean, Chinese, and Japanese. Three were developed primarily for English-language use (Claude, GPT, Gemini) and three for East Asian use (DeepSeek, Syn-Pro, HyperCLOVA X). We adopt the HEXACO-100 personality inventory and anchor each model against a cross-cultural human dataset spanning 48 countries to ask not whether LLMs are biased, but how far their gender attributions drift from the populations they are deployed among. Our findings show that their stereotyping spans a range roughly 2.5 times wider than the entire cross-country range found in humans, and the effect can compound across languages. One English-centric model, prompted in Korean, reached 5 times the local baseline, even when the prompt stated the candidate had already been hired, which often dampens human stereotyping. To characterize such behaviors without ranking them, we introduce a four-pattern framework -- concordance, suppression, reorganization, and amplification -- across 24 (model x language) cells. Item-level analysis reveals that translation does not just rescale stereotypes, but changes the attributes tied to it, hiding significant rearrangement under the surface while appearing well-calibrated. Our results ultimately suggest that no single debiasing pipeline is likely to address bias evenly across linguistic boundaries.
98. 【2605.30790】On the impact of retrieved content representations in RAG Pipelines
链接:https://arxiv.org/abs/2605.30790
作者:Jonathan J Ross,Bevan Koopman,Anton van der Vegt,Guido Zuccon
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:RAG pipelines inherit, RAG pipelines, language model input, pipelines inherit retrieval, inherit retrieval components
备注: 23 pages, 15 figures, submitted to ACL May 2026 ARR
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.
99. 【2605.30788】XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks
链接:https://arxiv.org/abs/2605.30788
作者:Purvam Jain,Preethi Jyothi,Vihari Piratla,Suvrat Raju
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:detect cross-lingual gaps, large language models, introduce a set, set of synthetic, abilities of large
备注: 8+37pages
点击查看摘要
Abstract:We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.
100. 【2605.30771】Eywa: Provenance-Grounded Long-Term Memory for AI Agents
链接:https://arxiv.org/abs/2605.30771
作者:Resham Joshi
类目:Computation and Language (cs.CL)
关键词:agents that persist, persist across sessions, retrieved context, evidence, memory
备注: 29 pages, 3 figures, 16 tables. Benchmark artifacts available at [this https URL](https://eywa.to/research)
点击查看摘要
Abstract:AI agents that persist across sessions need memory they can retrieve, audit, update, and erase. Existing memory systems often collapse source evidence, extracted facts, retrieved context, and answer policy into one opaque prompt path, making failures difficult to diagnose: a wrong answer may come from missing evidence, unsupported extraction, stale state, retrieval loss, or answer-model behavior. We present Eywa, a provenance-grounded memory architecture built around evidence before belief. Eywa stores immutable source evidence before deriving canonical facts, validates extracted memories against typed signals and source support, and retrieves bounded memory context through a deterministic multi-route read path with zero LLM calls inside retrieval. Retrieved context is returned separately from answer instructions, allowing the same memory substrate to be evaluated across frontier, budget, and local answer models. Under a frozen, artifact-recorded retrieval configuration, Eywa reaches 90.19% judge accuracy on the LoCoMo C1-C4 split with Claude Sonnet 4.6 write and QA roles. On LongMemEval-S, it reaches 88.2% retrieval-sufficiency accuracy. On BEAM, a 700-question technical-memory stress benchmark, it reaches 81.45% mean nugget score and 85.29% pass@score = 0.5. Full per-question artifacts, including questions, gold answers, model answers, retrieved context, and labels, are published at this https URL.
101. 【2605.30758】Pairwise Reference Alignment as a Model-Level Ordinal Observable
链接:https://arxiv.org/abs/2605.30758
作者:Mujing Li
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Pairwise preference data, pairwise reference alignment, reward modeling, data is widely, language-model evaluation
备注:
点击查看摘要
Abstract:Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference distribution of pairwise preferences, what model-level quantity is estimated when we test whether a model ranks preferred responses above rejected responses? We define pairwise reference alignment as an ordinal observable induced by a model scoring function. Given a reference pair distribution $P_{\mathrm{pair}}$ over triples $(x,y^+,y^-)$, and a scalar model score $S_M(x,y)$, we define the alignment observable as the probability that the model-induced ordering agrees with the reference preference ordering. We further define a centered order-parameter-like statistic and discuss a margin-based extension. The resulting quantities admit simple finite-sample estimators and concentration bounds under independent sampling assumptions. This note does not introduce a new benchmark. It provides a conceptual and statistical formulation for pairwise reference alignment, clarifies the role of the reference pair distribution, and distinguishes the general ordinal observable from scoring choices such as normalized log-probability or energy-based scores. We also provide an initial empirical study on Qwen2.5 models and RewardBench, where the proposed statistics increase with model size and instruction tuning and vary across reference-pair subsets as predicted by the formulation.
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2605.30758 [cs.CL]
(or
arXiv:2605.30758v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.30758
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
102. 【2605.30753】Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation
链接:https://arxiv.org/abs/2605.30753
作者:Zekai Li,Ji Liu,Yiqing Huang,Ziqiong Liu,Dong Li,Emad Barsoum
类目:Computation and Language (cs.CL)
关键词:Diffusion-based large language, large language models, inference remains latency-heavy, parallel text generation, Diffusion-based large
备注:
点击查看摘要
Abstract:Diffusion-based large language models (dLLMs) support parallel text generation via iterative denoising, yet inference remains latency-heavy because many steps are spent on redundant refinement and repeated remasking of tokens whose final values are already determined. Prior acceleration methods mainly depend on step-local confidence heuristics or fixed schedules, which are sensitive to prompt and task variation and ignore strong positional effects within a sequence. We cast diffusion decoding as a dynamic control problem and show that token-wise denoising trajectories provide the key signal for reliable control. We propose a trace-aware decoding framework with two components. First, Temporal-Spatial Parallel Decoding (TSPD) uses a lightweight temporalspatial controller that consumes per-token trajectory features, including confidence, entropy, and momentum, together with token position, to decide when a token has converged and can be safely fixed. Second, we introduce Confidence Extrapolation (CE), a training-free state-space module that forecasts future logit trends with uncertainty to support proactive decisions, including safe look-ahead and targeted stabilization when trajectories are oscillatory or underconfident. Together, TSPD and CE reduce unnecessary denoising iterations while preserving output quality, and they compose cleanly with system optimizations such as KV caching.
103. 【2605.30736】OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning
链接:https://arxiv.org/abs/2605.30736
作者:Zhenghua Bao,Fengya Tian,Chris Zhang,Zhenjun Chen,Xile Ma,Yi Shi
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:practical deployment question, large language models, raises a practical, incoming request, rapid development
备注: 6 pages, 1 table. Technical report
点击查看摘要
Abstract:The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.
104. 【2605.30727】MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
链接:https://arxiv.org/abs/2605.30727
作者:Alexander Gurung,Spandana Gella,Alexandre Drouin,Issam H. Laradji,Perouz Taslakian,Rafael Pardinas
类目:Computation and Language (cs.CL)
关键词:Deep research, agent external queries, research agents increasingly, external queries, Deep research agents
备注:
点击查看摘要
Abstract:Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information from its local context. This risk is amplified by the mosaic effect, where individual queries may appear harmless but become revealing in aggregate. We introduce MosaicLeaks, a benchmark of 1,001 multi-hop deep research tasks that chain private enterprise documents and a public web corpus, forcing agents to make external queries that depend on local information. We evaluate leakage with an adversary LLM that observes only the agent's external queries and attempts to infer private information at three levels: the agent's research intent, answers to specific private questions and verifiable claims about the enterprise documents. We find that models across families and sizes frequently leak at all three levels, that zero-shot privacy prompting reduces but does not eliminate leakage and that reinforcement learning for task performance alone worsens leakage. To address this, we propose Privacy-Aware Deep Research (PA-DR), an RL framework that combines situational rewards for task success with a learned privacy classifier to provide dense credit assignment over both per-query and mosaic-level leakage. Training Qwen3-4B-Instruct with PA-DR improves accuracy from 48.7% to 58.7% and reduces answer and full-information leakage from 34.0% to 9.9%.
105. 【2605.30723】Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents
链接:https://arxiv.org/abs/2605.30723
作者:Jianxiang Yu,Jiapeng Zhu,Bochen Lin,Qier Cui,Zichen Ding,Xiang Li
类目:Computation and Language (cs.CL)
关键词:increasingly retrieve externally, retrieve externally curated, externally curated skills-procedural, curated skills-procedural instructions, skills-procedural instructions retrieved
备注:
点击查看摘要
Abstract:LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model-dependent: a skill that benefits one backbone can harm another. Motivated by this observation, we propose MASA Model-Aware Skill Alignment, a framework that adapts skills to each target backbone without modifying agent weights. MASA operates in two stages: (1) a hierarchical skill evolution pipeline that iteratively rewrites general and task-specific skills using hill climbing and UCB-driven tree search, guided by environment feedback and model capability profiles; and (2) a lightweight model-conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to 25.8 points over the strongest baseline. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost.
106. 【2605.30717】Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models
链接:https://arxiv.org/abs/2605.30717
作者:Zhiwen You,Nafiseh Nikeghbal,Jana Diesner
类目:Computation and Language (cs.CL)
关键词:produce gendered language, gendered language, neutral prompts, Language, language and stereotypes
备注:
点击查看摘要
Abstract:Language models (LMs) can produce gendered language and stereotypes even when given neutral prompts. Most prior work on gender bias in LMs primarily examines gender through a binary lens (feminine vs. masculine), with limited attention to gender-neutral forms, such as they/them pronouns or neutrally phrased job titles. How gender-related signals are encoded in the internal representations of LMs remains an open question. In this work, we study gender-specific neurons in LMs across three categories: feminine, masculine, and gender-neutral. We propose a neuron-level intervention method to identify neurons that are strongly tied to each gender category. We then test these neurons through controlled generation, showing that activating or masking gender-related neurons can steer a sentence toward a target gender form while preserving its original meaning. To evaluate the effectiveness of our gender-intervention approach, we curate two datasets with controlled sentences labeled across all three gender categories and validate the data quality through human evaluation. Experiments on two open-source LMs show that gender-specific neurons are not evenly distributed across model layers; instead, they concentrate heavily in the earliest layers with smaller contributions from later layers. Compared to existing methods, our method achieves more precise gender control, with less leakage into non-target gender categories and stable output quality through two evaluation criteria. Overall, our work examines how gender is encoded in LMs and provides a simple yet effective approach toward controlled gender intervention for both neuron intervention evaluation and gender bias mitigation. Code and datasets are available at: this https URL
107. 【2605.30712】ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents
链接:https://arxiv.org/abs/2605.30712
作者:Tao Feng,Chongrui Ye,Tianyang Luo,Jingjun Xu,Xueqiang Xu,Haozhen Zhang,Zhigang Hua,Yan Xie,Shuang Yang,Ge Liu,Jiaxuan You
类目:Computation and Language (cs.CL)
关键词:Large language model, shown strong capabilities, reuse successful strategies, Large language, agents have shown
备注:
点击查看摘要
Abstract:Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine-tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self-evolving experience graph, and retrieves useful experiences through graph diffusion and utility-aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi-step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph-structured experience, utility-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.
108. 【2605.30711】SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs
链接:https://arxiv.org/abs/2605.30711
作者:Sijia Wang,Dhanajit Brahma,Ricardo Henao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:newly extracted facts, principled write-side control, Spherical Adaptive Gate, merged with existing, existing memories
备注:
点击查看摘要
Abstract:Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.
109. 【2605.30693】riaging Threats to Specialized Guardrails
链接:https://arxiv.org/abs/2605.30693
作者:Wenjie Jacky Mo,Xiaofei Wen,Rui Cai,Boyu Zhu,Sicong Jiang,Zihan Wang,Minglai Yang,Zhe Zhao,Muhao Chen
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:deploying Large Language, Large Language Models, Building robust safety, Large Language, diverse real-world applications
备注:
点击查看摘要
Abstract:Building robust safety guardrails is essential for deploying Large Language Models across diverse real-world applications. However, this goal remains challenging because safety risks span heterogeneous threat domains, while existing datasets cover only fragmented risk subsets and rely on inconsistent taxonomies. Consequently, it remains unclear whether current guardrails can generalize beyond narrow evaluation settings. To better understand the robustness of guardrail models, we first introduce GuardZoo, a unified human-annotated benchmark with 32,460 samples covering 15 distinct unsafe categories. Evaluation on GuardZoo reveals that monolithic guardrails suffer from task interference: different threat domains require distinct decision boundaries that are difficult to compress into a single model. We therefore propose RouteGuard, a router-expert framework that triages each conversation to specialized expert guardrails for threat-specific detection. Experiments show that RouteGuard improves fine-grained threat detection over strong guardrail baselines, generalizes better under out-of-domain evaluation, and supports flexible modular expansion to emerging threats.
110. 【2605.30690】ElasticMem: Latent Memory as a Learnable Resource for LLM Agents
链接:https://arxiv.org/abs/2605.30690
作者:Tao Feng,Chongrui Ye,Tianyang Luo,Jingjun Xu,Xueqiang Xu,Haozhen Zhang,Ge Liu,Jiaxuan You
类目:Computation and Language (cs.CL)
关键词:reuse past experience, Long-term memory, personalize responses, extended interactions, past experience
备注:
点击查看摘要
Abstract:Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource: text-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent-space approaches reduce textual cost but still rely on rigid retrieval or fixed-capacity memory interfaces. This creates a mismatch between query-dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory-augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory-intensive QA and embodied agent control. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at this https URL.
111. 【2605.30685】How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language
链接:https://arxiv.org/abs/2605.30685
作者:Madeleine I. G. Daepp,Isaac Slaughter
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:people globally, Abstract, countries, globally, empirically characterize differences
备注:
点击查看摘要
Abstract:AI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identified, and privacy-scrubbed interactions with a widely available and free AI chatbot, we empirically characterize differences in early adopters' usage across countries. Schooling is the most common domain of use in most countries, particularly low-income countries, with a strong inverse association evident between schooling and country-level GDP. Leisure-related use, by contrast, is positively associated with country-level income. Language, we find, also shapes use: English-language interactions are overrepresented in places where the predominant languages were not well-served by existing models during the period of the study. Improving performance across languages may be a key factor, our work suggests, in whether this technology expands digital divides or enables leapfrogging.
112. 【2605.30675】Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
链接:https://arxiv.org/abs/2605.30675
作者:Kyle Moore,Jesse Roberts,Daryl Watson,William Ward,Grayson Heyboer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:model behavioral analysis, large language model, language model behavioral, Uncertainty Quantification, behavioral analysis
备注:
点击查看摘要
Abstract:Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the accuracy of uncertainty judgments to task efficacy. In this work, we investigate the relatively underexplored question of how similar large language model uncertainty is to human uncertainty. We investigate the presence and strength of human-similar uncertainty signals, deemed uncertainty alignment, in large language model overt behavior and internal activation patterns. We identify whether the models show evidence of simultaneous alignment and calibration on a variety of datasets covering both multiple choice and open ended factual recall. And we characterize the effect of instruct fine-tuning on each of these facets.
113. 【2605.30673】achObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation
链接:https://arxiv.org/abs/2605.30673
作者:Yeil Jeong,Youngjin Yoo,Seobin Sohn,Hyejin Han,Jinseo Lee,Scott Howard,Unggi Lee
类目:Computation and Language (cs.CL)
关键词:observable teaching practices, Classroom videos, signals are rarely, rarely organized, organized in forms
备注:
点击查看摘要
Abstract:Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes. Seven researchers annotated each scene with 39 binary observation codes, covering 20 visual codes, such as gesture, board work, pointing, and visual materials, and 19 nonvisual codes, such as instruction, monitoring, questioning, feedback, and reflection. Gold segment labels are constructed using reliability- and prevalence-aware rules based on Krippendorff's alpha. In addition to segment-level labels, three expert raters produced lesson-level ratings and qualitative evaluations of instructional design, instructional delivery, learner response, learning materials, and lesson closure across the 30 lessons, with rater coverage detailed in the body. Using these two human reference layers, we evaluate five vision-capable frontier LLMs across three tracks - text-only segment coding, text + frame segment coding, and lesson-level coverage scored under an LLM-as-judge protocol - and find that no single model consistently outperforms others across all three tracks, that adding a mid-frame inflates both true and false attributions per scene, and that model evaluations over-rate procedurally clear lessons relative to expert raters. \textit{TeachObs} therefore supports both fine-grained annotation benchmarking and whole-lesson evaluation, showing where AI systems can assist classroom video analysis and where expert judgment remains necessary across varied subjects, classroom formats, and annotation difficulty levels.
114. 【2605.30668】CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation
链接:https://arxiv.org/abs/2605.30668
作者:Sijin Sun,Liangbin Zhao,Jiaxiang Cai,Ming Deng,Mingyu Luo,Xiuju Fu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Dialogue topic segmentation, human-AI collaborative applications, requires identifying heterogeneous, identifying heterogeneous boundary, Dialogue topic
备注: 8 pages with appindx. Under review
点击查看摘要
Abstract:Dialogue topic segmentation is critical in many human-AI collaborative applications which requires identifying heterogeneous boundary cues, including lexical transitions near utterance edges and semantic discontinuities across utterances. Existing utterance models often dilute these local lexical signals. We propose CobSeg, a novel multi-branch architecture that separates coherence-level semantic continuity from lexical boundary transitions and recovers both through directional boundary prediction. CobSeg further uses boundary informativeness weighting to emphasize high-utility utterance positions, and incorporates a corpus-derived topic coherence cue with learned combination weights. While CobSeg is evaluated as a compact trainable segmenter under supervised gold-boundary training and a pseudo-label setting with automatically induced boundaries, it performs enhanced boundary prediction without LLM calls during inference. Across five benchmarks, it improves $P_k$ and $W_d$ particularly when local lexical cues are prominent: under gold supervision, it reduces $P_k$ by 0.7 points and $W_d$ by 0.6 points on VHF, and reaches $P_k$ of 1.0 on DialSeg711; with induced boundaries, it reduces $P_k$ by 14.8 points on VHF, by 1.5 points on DialSeg711, and by 1.1 points on TIAGE, outperforming prior non-LLM approaches.
115. 【2605.30654】EUDAIMONIA: Evaluating Undesirable Dynamics in AI
链接:https://arxiv.org/abs/2605.30654
作者:Jun Rui Huang,Wang Bill Zhu,Ziyi Liu,Nathanael Fast,Ravi Iyer,Robin Jia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Large language models, traditional safety evaluations, Large language, emotional disclosure, partners for companionship
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.
116. 【2605.30653】Counterfactual Graph for Multi-Agent LLM Calibration
链接:https://arxiv.org/abs/2605.30653
作者:Jiatan Huang,Mingchen Li,Ziming Li,Sunjae Kwon,Hong Yu,Chuxu Zhang
类目:Computation and Language (cs.CL)
关键词:Multi-agent LLM systems, systems often treat, panel give, LLM systems, CAGE-CAL
备注:
点击查看摘要
Abstract:Multi-agent LLM systems often treat agreement as evidence: when many agents in a panel give the same answer, that answer is assumed to be more reliable. We show that this assumption can fail after agents communicate. Communication can induce correlated failures and false consensus, so the same vote share may reflect reliable agreement in one topology but over-confidence in another. We propose CAGE-CAL, a counterfactual agent-graph calibration framework for multi-agent LLMs. For each query, CAGE-CAL compares an observed post-communication agent graph with a matched counterfactual no-communication graph, capturing both pairwise failure correlations and group-level dependencies. Rather than simply counting how many agents agree, CAGE-CAL estimates the counterfactual shift between observed and no-communication dependence, and calibrates confidence accordingly. Across five benchmarks, CAGE-CAL improves reliability discrimination with competitive ECE, and its calibrated confidence further improves topology selection over the best fixed-topology strategy.
117. 【2605.30646】Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
链接:https://arxiv.org/abs/2605.30646
作者:Mahdi Alkaeed,Adnan Qayyum,Nabeel Abo Kashreef,Muhammad Bilal,Junaid Qadir
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Natural Language Inference, Large, clinical applications
备注: 14 pages, 5 figures
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (\Delta C), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.
118. 【2605.30641】COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
链接:https://arxiv.org/abs/2605.30641
作者:Arya Fayyazi,Mehdi Kamal,Massoud Pedram
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, amplify societal biases, reveal and amplify, amplify societal
备注: Proceeding of ICML 2026
点击查看摘要
Abstract:Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at decode time, with distribution-free marginal validity guarantees (under exchangeability) for any frozen causal language model. COFT operates in three stages. First, it creates a masked counterfactual prompt by replacing sensitive spans with neutral tokens. Second, it compares the factual and masked logit distributions through lightweight logit fusion to attenuate attribute-driven biases. Third, it uses dual-branch split-conformal calibration to certify per-step candidate token sets at a user-chosen risk level. We evaluate COFT across six models and multiple bias benchmarks. Our method reduces standard bias metrics by 30-55% (median 38%) while preserving task utility and language quality. Reasoning accuracies remain unchanged within run-to-run noise margins. The computational overhead is modest, equivalent to one additional cached forward pass (=11%). COFT offers a clear, auditable path to safer CoT generation with significant bias reduction, negligible utility loss, and no requirement for retraining, auxiliary classifiers, or weight access.
119. 【2605.30640】CSULoRA: Closest Safe Update Low-Rank Adaptation
链接:https://arxiv.org/abs/2605.30640
作者:Oleksandr Marchenko Breneur,Adelaide Danilov,Aria Nourbakhsh,Salima Lamsiyah
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Low-rank adaptation, large language models, large language, small amounts, Low-rank
备注: 10 pages, 3 figure
点击查看摘要
Abstract:Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned models. Existing safety-preserving LoRA methods often rely on hard interventions such as projection, pruning, thresholding, or additional training objectives. While these methods can suppress unsafe update directions, they may also remove task-relevant information or require extra tuning. We introduce CSULoRA, a post-hoc method for correcting trained LoRA adapters through closest safe update estimation. CSULoRA estimates a safety-aligned subspace from the weight displacement between a safety-aligned model and its corresponding base checkpoint. It then decomposes each LoRA update into fully aligned, partially aligned, and off-subspace components. Instead of discarding components outside the estimated safety subspace, CSULoRA solves a closed-form penalized minimum-change problem that preserves the fully aligned component while smoothly attenuating potentially unsafe directions according to their relative energy. In adversarial fine-tuning experiments, CSULoRA substantially reduces attack success rate while preserving most of the utility gains obtained from standard LoRA fine-tuning.
120. 【2605.30628】he Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability
链接:https://arxiv.org/abs/2605.30628
作者:Mikhail L. Arbuzov,Lee Mosbacker,Sisong Bei,Ziwei Dong,Dmitri Kalaev,Alexey Shvets
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Universal LLM reliability, Universal LLM, guarantee bounded residual, bounded residual error, knowledge sources
备注: 25 pages, no figures
点击查看摘要
Abstract:Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention-distinguishable failure modes can appear without bound, so no finite intervention dictionary can guarantee bounded residual error for every such mode. But deployed systems do not operate over the whole universe. They operate inside operationally bounded patches (legal review, medical RAG, code repair, customer-support agents, contract extraction) with recurring tasks, schemas, tools, and evaluator expectations. Within such patches, empirical evidence suggests failures are sparse, repetitive, and concentrated in a small recurring catalogue, so reliability becomes a local catalogue-discovery and intervention-coverage problem rather than an exponential token-length problem. We formalize this transition with two propositions and one corollary. Proposition 1 is the worst-case-mode-wise negative result: no finite intervention dictionary covers every distinguishable failure mode of an unbounded domain. Corollary 1 is the inverse-discovery implication: the logarithmic upper bound on mode discovery cannot accommodate linearly more distinct tail modes without exponentially more observed hard-failure events. Proposition 2 is the positive patch-local result: under log active-mode exposure and head-heavy coverage, a sufficient per-hard-decision intervention budget grows polylogarithmically in sequence length and becomes domain-constant once the patch catalogue saturates. The framework relocates rather than dissolves long-context difficulty: where the number of hard decisions itself grows with task length, reliability remains hard; the contribution is to identify the on-axis intervention rather than to make those regimes easy.
121. 【2605.30611】Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
链接:https://arxiv.org/abs/2605.30611
作者:Haozhe Zhao,Shuzheng Si,Zhenhailong Wang,Zheng Wang,Liang Chen,Xiaotong Li,Zhixiang Liang,Maosong Sun,Minjia Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:complex research ideas, communicating complex research, producing publication-quality illustrations, publication-quality illustrations remains, research ideas
备注: 24 pages, 11 figures
点击查看摘要
Abstract:Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at this https URL.
122. 【2605.30608】Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures
链接:https://arxiv.org/abs/2605.30608
作者:Varsha Suresh,Mohammad Mahdi Abootorabi,Mohamed Salman,M. Hamza Mughal,Christian Theobalt,Ashwin Ram,Jürgen Steimle,Vera Demberg
类目:Computation and Language (cs.CL)
关键词:Learning a shared, shared representation, central to co-speech, remains challenging, communicative intent
备注:
点击查看摘要
Abstract:Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.
123. 【2605.30604】An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations
链接:https://arxiv.org/abs/2605.30604
作者:George Fatouros,Georgios Makridis,George Kousiouris,John Soldatos,Dimosthenis Kyriazis
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:enforces organization-level scope, cybersecurity workflows lack, Regulated cybersecurity workflows, regulated security operations, locally deployable
备注: 8 pages, 3 figures
点击查看摘要
Abstract:Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent large language model (LLM) agent systems report strong results on isolated cybersecurity tasks, yet they do not by themselves define an auditable platform architecture for regulated security operations centre (SOC) and compliance workflows, where a single analyst may trigger actions that bind the organization, and where the runtime must integrate with existing SIEM/XDR stacks as a primary source of context and alert-driven triggers rather than operate as a standalone analytical layer. This paper proposes an organization-scoped LLM agent runtime architecture for financial cybersecurity. The contribution is a typed Security Context that is created at every entry point, including SIEM/XDR notifications ingested as first-class triggers, and enforced at every component boundary, combined with a shared Runtime Core, logical specialist subagents, a governed Tool Adapter Layer exposing SIEM/XDR query, enrichment, and response primitives under uniform policy and audit, structured findings with evidence references, tiered human-in-the-loop (HITL) gates, and append-only audit. Model Context Protocol (MCP), extended telemetry, digital twins for pentesting, graph retrieval, and federated knowledge sharing are treated as optional extension paths rather than mandatory runtime assumptions. We describe an implementable slice as the architecture's testability surface, and we propose a falsifiable evaluation plan with metric-level pass criteria for architecture readiness, security-policy enforcement, evidence traceability, output quality, and operational observability.
124. 【2605.30599】AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis
链接:https://arxiv.org/abs/2605.30599
作者:Saeedeh Davoudi,Reihaneh Iranmanesh,Ophir Frieder,Nazli Goharian
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:continuously evolving, unlearning, Medical, Abstract, evolving
备注:
点击查看摘要
Abstract:Medical knowledge is continuously evolving. This creates a need to update or selectively forget information encoded in already-trained medical LLMs. Machine unlearning aims to remove the influence of specific training data from a model without full retraining. Yet, existing unlearning benchmarks rely on synthetic or small-scale general data, leaving clinical unlearning understudied. We introduce AMNESIA, the first large-scale, open source benchmark for medical unlearning, with 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. AMNESIA includes both factual questions testing direct recall and reasoning questions testing clinical inference. We use it to evaluate four widely used unlearning methods at both random patient and disease-level, and introduce a new metric for detecting leakage of medical terminology. We show that unlearning individual patients erodes knowledge of others with the same condition, calling for methods that can better separate patients from shared clinical knowledge.
125. 【2605.30590】Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
链接:https://arxiv.org/abs/2605.30590
作者:Matt Turk
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:behave radically differently, Causal Sensitivity Score, patient inputs change, Consensus Match Score, rubrics yet behave
备注: Accepted to RLEval @ ACM CAIS 2026 (Workshop on Methods and RL Environments for Evaluating AI Agents) and selected for an invited talk based on reviewer ratings. 4-page short paper + appendix
点击查看摘要
Abstract:Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale. Benchmarked against the Consensus Match Score (CMS), a coverage-based weighted recall metric, six frontier models from three labs evaluated in single-shot inference across 224 cases rank in nearly opposite orders: all six models change rank, the CMS-worst model becomes CSS-best, and one upper-mid CMS model ranks last on CSS. We further surface a universal safety blind spot: every frontier model fails on surgery-status interventions (at most 17.2% CSS on Family D), a finding CMS does not expose. The metric also transfers to tool-using agents: in a ReAct-style experiment, tool use improves CSS for five of six models (+2.5 to +20.3 percentage points), yet the lowest-CSS model retrieves the same chart sections and still fails to update its recommendations - revealing a structural responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness that coverage metrics miss and offer a candidate dense reward signal for future agentic RL systems.
126. 【2605.30589】ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
链接:https://arxiv.org/abs/2605.30589
作者:Nazarii Shportun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:law spans thousands, carry high stakes, immigration law spans, lack legal representation, USCIS Policy Manual
备注: 12 pages, 4 tables. Dataset (17,058 QA pairs), fine-tuned model, and code are publicly released
点击查看摘要
Abstract:U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community QA -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.
127. 【2605.30582】AI for Monitoring and Classifying Data Used in Research Literature
链接:https://arxiv.org/abs/2605.30582
作者:Rafael Macalaba,Aivin V. Solatorio
类目:Computation and Language (cs.CL)
关键词:Semantic Scholar track, Scholar track citations, Google Scholar, Semantic Scholar, comparable infrastructure exists
备注:
点击查看摘要
Abstract:While platforms like Google Scholar and Semantic Scholar track citations for academic papers, no comparable infrastructure exists for monitoring dataset usage in research literature, leaving the landscape of data use largely opaque. Addressing this gap is critical for transparency, reproducibility, and monitoring of impact, yet progress is hindered by inconsistent citation practices, scarce labeled data, and ambiguous references to datasets in the wild. Traditional NLP approaches struggle with these challenges, motivating the shift toward more adaptive, semantically rich models. Building on prior work using LLMs for data mention detection and synthetic data for bootstrapping training, this paper presents an updated methodology for scalable dataset monitoring. We introduce a multitask GLiNER-based framework that jointly performs dataset mention extraction, relation identification, and usage-context classification. To address label scarcity, the pipeline leverages synthetic data generation to produce training examples and LLM-based revalidation to filter incorrect mentions and enforce labeling consistency, together improving reliability, coverage, and output consistency across the training pipeline. This work advances the development of open-source tools for monitoring data use in research literature, contributing to the broader goal of generalizable, unconstrained dataset citation tracking.
128. 【2605.30580】Speculative Decoding Across Languages
链接:https://arxiv.org/abs/2605.30580
作者:Nirajan Paudel,Michael Ginn,Luc De Nardi,Alexis Palmer
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:drafting multiple tokens, crucial component, drafting multiple, multiple tokens, tokens and verifying
备注: 10 pages, 11 figures, submitted to ACL ARR May 2026
点击查看摘要
Abstract:Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective. We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation. We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.
Comments:
10 pages, 11 figures, submitted to ACL ARR May 2026
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2605.30580 [cs.CL]
(or
arXiv:2605.30580v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.30580
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
129. 【2605.30574】Probing the Prompt KV Cache: Where It Becomes Dispensable
链接:https://arxiv.org/abs/2605.30574
作者:Vinayshekhar Bannihatti Kumar,Manoj Ghuhan Arivazhagan,Disha Makhija,Rashmi Gangadharaiah
类目:Computation and Language (cs.CL)
关键词:compression schemes empirically, schemes empirically demonstrate, cache compression schemes, Prior KV cache, dropping or summarising
备注:
点击查看摘要
Abstract:Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which layers, after how many decoding steps, and in what form can the prompt span KV cache be replaced without breaking the task. A controlled splice intervention swept over layer cutoff and decoding steps shows this redundancy is about form (chat template scaffolding) rather than content. Replacing the upper layer prompt span KV cache with KV cache from a chat template scaffold whose user content is a neutral filler recovers near clean accuracy, while zeroing the same slots collapses accuracy. The dissociation replicates across the Qwen3, Gemma 3, and Llama 3 families on multiple datasets.
130. 【2605.30568】Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge
链接:https://arxiv.org/abs/2605.30568
作者:Zijie Wang,Eduardo Blanco
类目:Computation and Language (cs.CL)
关键词:rubric-based methods rely, scalable alternative, rely on human-annotated, human-annotated data, reference answers
备注:
点击查看摘要
Abstract:LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained evaluation rubrics without any human annotation. Our training-free method generates rubrics at dataset-specific and instance-specific granularities, achieving performance competitive with existing methods across four benchmarks. We further present a method that iteratively fine-tunes a rubric generator model via meta-judge reward signals. The fine-tuned generator outperforms all existing baselines in both pairwise and pointwise evaluation. Notably, a fine-tuned 14B rubric generator outperforms a much larger proprietary model at rubric generation, showing the effectiveness of our fine-tuning strategy.
131. 【2605.30557】Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
链接:https://arxiv.org/abs/2605.30557
作者:Yue Zhang,Zun Wang,Han Lin,Yonatan Bitton,Idan Szpektor,Mohit Bansal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:deployed in real-world, real-world environments, fundamental capability, capability for vision-language, Spatial reasoning
备注: Website: [this https URL](https://zhangyuejoslin.github.io/spatialuncertain/)
点击查看摘要
Abstract:Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.
132. 【2605.30545】Refining Word-Based Grammatical Error Annotation for L2 Korean
链接:https://arxiv.org/abs/2605.30545
作者:Jungyeul Park,Kyungtae Lim,Wonjun Oh,Benjamin Nguyen,Zihao Huang,Mengyang Qiu,Jayoung Song
类目:Computation and Language (cs.CL)
关键词:Korean, presents a structural, structural mismatch, Korean GEC, K-GEC
备注:
点击查看摘要
Abstract:Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they encode grammatical relations that must be represented in correction and evaluation. This paper refines word-based grammatical error annotation for L2 Korean by addressing three connected problems in existing resources: surface target realization, Korean-specific edit annotation, and single-reference evaluation. We reconstruct target sentences from the National Institute of Korean Language (NIKL) L2 corpus under morphologically constrained realization rules and convert its morpheme-level annotations into word-level \texttt{m2} edits. We then define a Korean ERRANT-style annotation scheme that preserves the MRU core while distinguishing functional morpheme errors, spelling errors, word boundary errors, and word order errors. We also augment the KoLLA corpus with an additional reference correction, yielding a multi-reference evaluation setting for Korean GEC. Empirical validation shows that the refined NIKL targets yield lower perplexity, the converted \texttt{m2} files achieve higher agreement with source-target edit representations, and the refined resources improve KoBART-based correction under the same model setting. Multi-reference KoLLA evaluation further reduces the penalty imposed on valid corrections that diverge from a single reference, especially for neural and prompted GEC systems. These results show that Korean GEC evaluation depends not only on correction models, but also on reference data and edit annotations that reflect Korean morphology, spacing, and correction variability.
133. 【2605.30529】Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
链接:https://arxiv.org/abs/2605.30529
作者:David Rey-Blanco,Roberto Cruz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Sentence-embedding models, semantic search, search are overwhelmingly, overwhelmingly developed, developed and evaluated
备注: 24 pages, 12 figures, 6 tables
点击查看摘要
Abstract:Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.
134. 【2605.30526】Measuring, Localizing, and Ablating Alignment Signatures in LLMs
链接:https://arxiv.org/abs/2605.30526
作者:Aniket Anand,Janvijay Singh,Zhewei Sun,Dilek Hakkani-Tür,Nick Feamster
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:remains poorly understood, representations remains poorly, poorly understood, Post-training Alignment Signature, internal representations remains
备注:
点击查看摘要
Abstract:Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies AI-like stylistic regularities and whether these regularities have a localized internal signature. To this end, we compare human text, base-model generations, and aligned-model generations under matched human-source prefixes. Aligned generations show lower human-corpus affinity and higher AI-detection rates than base generations, suggesting that post-training shifts generated text away from human-corpus style and toward detector-visible AI-like text. We then introduce PASTA (Post-training Alignment Signature Targeted Ablation), a training-free method that estimates a post-training alignment signature from aligned-base residual contrasts and ablates the corresponding direction during decoding. Across 11 aligned models and 6 AI detectors, PASTA lowers the detection rate for most aligned models; this effect transfers well across detectors and is not reproduced by random directions. Qualitative analysis suggests that PASTA generations remain relevant and coherent while exhibiting greater stylistic variation. Together, these results show that AI-like stylistic effects of post-training can be measured, localized, and causally tested through activation ablation.
135. 【2605.30523】Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
链接:https://arxiv.org/abs/2605.30523
作者:Anej Svete,William Merrill,Ryan Cotterell,Ashish Sabharwal
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
关键词:Recent work describes, lack exact characterizations, Recent work, existing results lack, results lack exact
备注:
点击查看摘要
Abstract:Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.
136. 【2605.30521】Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
链接:https://arxiv.org/abs/2605.30521
作者:David Gros,Adam Gleave
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, frequently process untrusted, adversarial pressure, process untrusted inputs
备注:
点击查看摘要
Abstract:Large language models must frequently process untrusted inputs, such as judging an answer from another model or running tasks like spam and harm classifiers while under adversarial pressure. These inputs are often string-formatted directly into a prompt template, leaving systems fragile to manipulation. Current LLM specs from major providers like OpenAI distinguish trustworthiness along an Instruction Hierarchy, from System messages (most trusted) to Tool Results (least trusted). A possible natural mitigation is to wrap untrusted content in a mock tool call as a quarantine. We explore this hypothesis with an automated redteaming search over static attack strings across seven models and three LLM-as-a-Judge tasks. Counter to our hypothesis, tool-wrapping does not broadly improve robustness. On a binary evaluation task (GSM8K grading) it typically increases attack success rates, an apparent inversion of the instruction hierarchy. On scalar and pairwise tasks the effect is smaller and model-dependent, with no tested model reliably helped, and several showing inversion. We recommend evaluating this limitation in deployed systems, and longer-term, pursuing stronger Instruction Hierarchy training or new untrusted-input primitives.
137. 【2605.30514】MAAT: Multi-phase Adapter-Aware Targeted Unlearning
链接:https://arxiv.org/abs/2605.30514
作者:Suryash Yagnik,Shubham Gaur,Saksham Thakur,Vinija Jain,Aman Chadha,Amitava Das
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Machine unlearning evaluation, Machine unlearning, structurally skewed, Why-type questions, MUSE
备注: 16 pages, 4 figures, 10 tables
点击查看摘要
Abstract:Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.
138. 【2605.30504】Auditing LLM Benchmarks with Item Response Theory
链接:https://arxiv.org/abs/2605.30504
作者:Sander Land,Daniel M. Bikel
类目:Computation and Language (cs.CL)
关键词:LLM benchmark labels, LLM benchmark, Item Response Theory-based, frozen at release, release and silently
备注:
点击查看摘要
Abstract:LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.
139. 【2605.30501】Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs
链接:https://arxiv.org/abs/2605.30501
作者:Zhihao Wu,Gracia Gong,Qinglin Zhu,Yudong Chen,Runcong Zhao
类目:Computation and Language (cs.CL)
关键词:embeds statistical signatures, signatures in AI-generated, AI-generated text, Watermarking embeds statistical, watermarks trivially fail
备注:
点击查看摘要
Abstract:Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.
140. 【2605.30497】CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law
链接:https://arxiv.org/abs/2605.30497
作者:Ethan Zhao,Maksym Taranukhin,Wei Cui,Moira Aikenhead,Vered Shwartz
类目:Computation and Language (cs.CL)
关键词:potentially undermines justice, RAG-based legal assistants, LLM hallucinations remain, growing in popularity, undermines justice
备注:
点击查看摘要
Abstract:RAG-based legal assistants have been growing in popularity, but LLM hallucinations remain a key issue and potentially undermines justice. While benchmarks have been developed to evaluate progress, many rely on synthetic queries rather than realistic legal scenarios. Moreover, Canadian law remains underrepresented in existing evaluations. To address this gap, we introduce CanLegalRAGBench, a Canadian legal QA benchmark based on realistic queries and expert-annotated answers grounded in case law. Our evaluation shows that retrieval performance is sensitive to design choices and that open-source embedding models are competitive with closed source models. However, it also reveals the limitation of automatic evaluations that penalize systems for retrieving alternative relevant documents. We also find that generated answers often diverge from gold responses, either with hallucinations or by producing overly detailed or irrelevant content, with 8-29% of claims not being supported by the retrieved documents. We hope this benchmark will help drive continued progress in addressing limitations of legal RAG systems.
141. 【2605.30487】Configurable Reward Model for Balanced Safety Alignment
链接:https://arxiv.org/abs/2605.30487
作者:Zhengping Jiang,Mehran Khodabandeh,Akash Bharadwaj,Manik Bhandari,Mayur Srungarapu,Anqi Liu,Benjamin Van Durme,Li Chen
类目:Computation and Language (cs.CL)
关键词:Aligning large language, large language models, Aligning large, Safety Reward Model, rapidly evolving safety
备注:
点击查看摘要
Abstract:Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety configurations, motivating the need for Reward Models (RMs) that are explicitly configurable to changing specifications. We introduce the Configurable Safety Reward Model (CSRM), which is jointly optimized for calibrated safety compliance and reward modeling. Our approach is supported by configuration-targeted data augmentation that enforces instruction adherence while preserving relative severity structure. The resulting RM is sensitive to fine-grained safety configurations and conversational nuances, substantially improving generalization to previously unseen safety configurations. CSRM achieves state-of-the-art performance on recent configurable safety benchmarks, including CoSApien (94.6% F1) and DynaBench (75.8% F1), without requiring additional human annotation. When used for downstream safety alignment, CSRM yields LLMs with a significantly improved helpfulness-safety tradeoff compared to existing baselines.
142. 【2605.30481】When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models
链接:https://arxiv.org/abs/2605.30481
作者:Md Arid Hasan,Ruwad Naswan,Farhan Samir,Sharifa Sultana,Syed Ishtiaque Ahmed
类目:Computation and Language (cs.CL)
关键词:Large language models, cross-lingual knowledge interfaces, Large language, knowledge interfaces, perspective coverage
备注: Submitted to ARR
点击查看摘要
Abstract:Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts. We study this failure mode as \textit{global narrative dominance} in Bangla, a low-resource cultural context. We introduce \texttt{CulturalNB}, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla--English question--answer pairs and supporting evidence, metadata, and sociocultural annotations. Using question-only and evidence-based prompting, we evaluate nine state-of-the-art LLMs with human and two independent LLM judges across metrics for cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage. Local evidence improves factual consistency and perspective coverage, but does not eliminate language-induced epistemic shifts. These findings suggest that cultural failures in LLMs are not only missing-knowledge errors but also failures of grounding and narrative prioritization.
143. 【2605.30478】Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback
链接:https://arxiv.org/abs/2605.30478
作者:Egor Skopin,Evgeny Kotelnikov
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:programmatically checkable signals, trains language models, Reinforcement learning, enabling direct optimization, Python code generation
备注: Accepted for AINL-2026 conference
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) trains language models using programmatically checkable signals such as unit-test outcomes, enabling direct optimization for functional correctness in code generation. We conduct an empirical study of RLVR for Python code generation on the MBPP benchmark using two small models (Qwen3-0.6B and Llama3.2-1B) with LoRA fine-tuning. Across multiple reward formulations such as: unit-test-only rewards, static-analysis-only shaping via the Ruff linter, and a combined reward, we compare group-based policy optimization variants (GRPO and GSPO) and evaluate both functional correctness and behavioral diagnostics. In our experimental setting, RLVR improves pass@1 on MBPP test by up to 13 percentage points under proposed combined reward configuration. However, we find that reward shaping can induce systematic behavioral shifts: using only static-analysis penalties may bias the policy toward shorter completions that reduce lint errors without reliably improving functional correctness. In contrast, combined rewards mitigate this degeneration and yield more stable trade-offs between correctness and style constraints. Overall, our results highlight that RLVR effectiveness for code generation is highly sensitive to reward design and optimization granularity, and that diagnostics beyond pass@1, including generation length, Ruff severity profiles, and execution error types are useful for identifying failure modes.
144. 【2605.30472】Your Multimodal Speech Model Says I Have a Face for Radio
链接:https://arxiv.org/abs/2605.30472
作者:Maya K. Nachesa,Vlad Niculae,Vagrant Gautam
类目:Computation and Language (cs.CL)
关键词:increasingly building multi, language tasks, researchers are increasingly, building multi, large neural models
备注:
点击查看摘要
Abstract:As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to audio-visual data for noise mitigation and multimodal subtitling. While performance and bias have been studied extensively in the single-modality regime, it is unknown how new modalities affect this, even though they produce biases in humans. We therefore propose the first bias evaluation of multimodal speech recognition, where we create videos pairing different faces with the same audio, and measure changes in speech transcription accuracy. We find large quality-of-service differences across mWhisper-Flamingo and Gemini models, with drops of up to 4.05 word error rate points, across self-declared gender, ethnicity, and their intersection. Our findings point to a priority for developers to evaluate, fix, and communicate such limitations, as providing more signals through additional modalities is not necessarily better, and may even lead to biased outcomes.
145. 【2605.30465】Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study
链接:https://arxiv.org/abs/2605.30465
作者:Shahana Akter,Yatharth Vohra,Ankita Shukla,Souvika Sarkar
类目:Computation and Language (cs.CL)
关键词:Multi-label topic classification, labeled training data, challenging task, Multi-label topic, labeled training
备注: 15 pages, 1 figure, ACL format. This paper proposes a KG-augmented zero-shot multi-label topic classification framework and evaluates multiple strategies
点击查看摘要
Abstract:Multi-label topic classification without labeled training data is a challenging task, specially when documents contain complex relational information. We present a zero-shot multi-label topic classification framework and systematically investigate how per-article knowledge graph augmentation affects its performance. The base framework classifies topics in documents without labeled training data and has four variants: article-only classification, keyword-enhanced classification, and self-consistency decoding variants of both. Then, we augment each base variant with per article knowledge graph. This graph is extracted from the input document through a pipeline similar to KGGen based on subject-predicate-object triples. We test all eight methods, four base and four graph augmented on fifteen LLMs and eight multi-label datasets across different domains. For the base framework, keyword-enhanced classification (AK) is the best performing method, and six out of fifteen LLMs surpass the sentence-encoder baseline. Graph augmentation has positive and negative impacts on small and large models, respectively. This shows that larger models already contain enough relational information from pretraining. Furthermore, the self-consistency decoding variant does not show performance improvements in any experiment while increasing computation costs about fivefold.
146. 【2605.30459】Can LLM Teams Play What? Where? When?
链接:https://arxiv.org/abs/2605.30459
作者:Anastasia Kotelnikova,Viktor Byzov,Maria Dolzhenkova,Evgeny Kotelnikov
类目:Computation and Language (cs.CL)
关键词:Large language models, coordinated hypothesis testing, tasks requiring indirect, Large language, requiring indirect reasoning
备注: Accepted for Dialogue-2026 conference
点击查看摘要
Abstract:Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team-based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning. We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales). To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025. Using six recent large-scale open models, we show that team-based strategies outperform single-model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44.23% accuracy, and approaches human team performance on questions with available human statistics. Analysis of inter-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self-preference bias; access to peer rationales improves captain judgments. Overall, LLM teams function primarily as answer selection and error-filtering mechanisms rather than generators of novel solutions. Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi-agent systems.
147. 【2605.30448】Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
链接:https://arxiv.org/abs/2605.30448
作者:Munawar Hasan
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:output-matching problem, Black-box LLM distillation, considered successful, responses are semantically, semantically similar
备注:
点击查看摘要
Abstract:Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(\epsilon,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $\epsilon$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2605.30448 [cs.LG]
(or
arXiv:2605.30448v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.30448
Focus to learn more
arXiv-issued DOI via DataCite</p>
148. 【2605.30443】Cross-Lingual Steering for Figurative Language Generation
链接:https://arxiv.org/abs/2605.30443
作者:Linfeng Liu,Tiffany Zhan,Louie Hong Yao,Saptarshi Ghosh,Tianyu Jiang
类目:Computation and Language (cs.CL)
关键词:large language models, internal signals driving, Multilingual large language, models can generate, Multilingual large
备注: 40 pages, 7 figures
点击查看摘要
Abstract:Multilingual large language models can generate figurative language, but whether the internal signals driving this behavior are language-specific or reusable across languages is unclear. Using activation steering as a probe, we estimate a direction for a figurative category from figurative--literal activation differences in one language and apply it during generation. Across five figurative categories, six languages, and four multilingual LLMs, these directions steer reliably within their own language, most robustly for metaphor and simile. More importantly, they transfer across languages: a direction learned in one increases the target behavior when applied to another, with German among the most receptive targets. Going further, directions assembled from other languages can match or even surpass a target language's own native direction, while removing this shared component weakens native steering. Together, these results provide direct evidence of a reusable but target-dependent cross-lingual signal for figurative generation.
149. 【2605.30434】LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
链接:https://arxiv.org/abs/2605.30434
作者:Kewei Xu,Xiaoben Lu,Shuofei Qiao,Zihan Ding,Haoming Xu,Lei Liang,Ningyu Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:leaving agents' ability, long horizons untested, track evolving analytical, evolving analytical context, inherently iterative
备注: Ongoing work
点击查看摘要
Abstract:Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at this https URL.
150. 【2605.30415】Domain Adaptation and Reasoning Frameworks in Language Models: A Controlled Experiment with Historical Cosmology
链接:https://arxiv.org/abs/2605.30415
作者:Francesco De Bernardis
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Phase, historical cosmology, small language model, reshapes explanatory behavior, explanatory
备注: 17 pages, 3 figures
点击查看摘要
Abstract:We investigate how domain adaptation reshapes explanatory behavior in language models using historical cosmology as a controlled setting. In Phase 1, we train a small language model from scratch on a pre-Copernican corpus from which explicit heliocentric references were removed, and evaluate whether Earth-motion or heliocentric continuations nevertheless emerge. In Phase 2, we fine-tune a larger pretrained model using QLoRA on the same corpus in order to study how adaptation modifies explanatory framing and cosmological stance. Model outputs are evaluated using an LLM-as-judge framework that labels both cosmological stance (geocentric, heliocentric, or ambiguous) and explanatory frame (premodern versus modern). In the constrained setting of Phase 1, the smaller models occasionally generate local Earth-motion continuations, but these remain globally unstable and insufficient to support coherent cosmological reasoning. In Phase 2, fine-tuning induces a large and statistically significant shift toward premodern explanatory framing, while the conditional cosmological stance distributions remain comparatively stable within those frames. As a result, increases in geocentric outputs arise primarily from redistribution over explanatory regimes rather than from direct modification of stance. These results suggest that domain adaptation may primarily reshape the linguistic frameworks from which continuations are generated, with changes in stance emerging secondarily from those shifts.
151. 【2605.30407】Exploring Autonomous Agentic Data Engineering for Model Specialization
链接:https://arxiv.org/abs/2605.30407
作者:Yujie Luo,Xiangyuan Ru,Jingsheng Zheng,Jingjing Wang,Yuqi Zhu,Jintian Zhang,Runnan Fang,Kewei Xu,Ye Liu,Zheng Wei,Jiang Bian,Zang Li,Shumin Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, high-quality domain-specific data, demonstrated strong performance, Language Models
备注: Work in progress
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize \textbf{Autonomous Agentic Data Engineering}, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by \textbf{57.29\%}, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization\footnote{Code will be released at this https URL.}.
152. 【2605.30400】Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow
链接:https://arxiv.org/abs/2605.30400
作者:Ahmed Abdeen Hamed,Luis M. Rocha
类目:Computation and Language (cs.CL)
关键词:evaluate ChatGPT ability, generate disease-centric biomedical, disease-centric biomedical associations, generate disease-centric, disease-centric biomedical
备注: Main Manuscript and Supplementary Information. Both are equally important
点击查看摘要
Abstract:We present a protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. It outlines how we generate the associations, validate the biological entities using biomedical ontologies, and verify associations using literature. The protocol includes a self-consistency strategy to assess generative reliability across ChatGPT models. To address ontology exact-match limitations, we provide a use case performing semantic verification through a workflow enabled by Retrieval-Augmented Generation (RAG) powered by open-source large language models (LLMs). This enables LLMs to establish truth over content generated by other LLMs and expose hallucination.
153. 【2605.30391】Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate
链接:https://arxiv.org/abs/2605.30391
作者:Tom Pecher
类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Argumentative Theory, isolated individual cognition, Human reasoning, collective adversarial discourse, operate socially
备注: Master's thesis
点击查看摘要
Abstract:Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth-seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever-greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi-agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM-MAD can significantly improve truth-seeking performance on questionnaire-based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM-MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.
154. 【2605.30965】ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
链接:https://arxiv.org/abs/2605.30965
作者:Jun-Hak Yun,Seung-Bin Kim,Seong-Whan Lee
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:including sound effects, Recent advancements, text-guided audio generation, yielded promising results, diverse domains
备注: Accepted to ACL 2026 main conference. Code is available at [this https URL](https://github.com/jjunak-yun/ImmersiveTTS)
点击查看摘要
Abstract:Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.
155. 【2605.30743】A Padding Method for Enhanced Encoding of Inorganic Structures with Varying Chemical Compositions
链接:https://arxiv.org/abs/2605.30743
作者:Thang Dang,Haderbache Amir,Tzanakakis Alexandros,Yoshimoto Yuta
类目:Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
关键词:expansive chemical compositions, inorganic, inorganic materials, structural landscape, generative models remains
备注:
点击查看摘要
Abstract:Designing novel inorganic materials through generative models remains an important challenge for material science, driven by the complexity and diversity of inorganic structures across expansive chemical compositions and structural landscape. The vast combinatorial space of inorganic compounds demands innovative, AI-driven approaches to overcome limitations in generative accuracy and efficiency. To address this, we introduce a novel method that redefines the encoding and generation of inorganic materials by utilizing domain-specific symmetry-aware representation. Our approach not only refines the representation of intricate inorganic structures but also contributes to the field of material discovery by enhancing the precision and stability of generated candidates. Central to our methodology is a novel padding technique that exploits crystal symmetry information to enhance the encoding process. By integrating Wyckoff position length-aware padding into an encoder architecture, we achieve a more robust informed representation of inorganic materials. This symmetry-driven enhancement improves deep learning models to generate stable, previously unexplored inorganic structures with superior accuracy and computational efficiency. Furthermore, we introduce an end-to-end system that leverages the machine learning potential models to seamlessly generate novel, even those unseen in the training data, and stable inorganic materials from initial data to validated output. This pipeline integrates advanced generative models with stability analysis, marking a significant leap forward in the automated exploration and design of next-generation inorganic materials. Our method improved reconstruction accuracy 5.3% in proton conductor data, and generated 63.5% more novel stable inorganic material to baseline model on the perov-5 dataset.
156. 【2605.30457】Extracting accent features in spoken Brazilian Portuguese without sociolinguistic labels
链接:https://arxiv.org/abs/2605.30457
作者:Pedro H. L. Leite,Pedro Benevenuto Valadares,Luiz W. P. Biscainho
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:Brazilian Portuguese, classification in Brazilian, Regional accent classification, Portuguese, reliable labeling
备注: This work was submitted to the XLIV Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2026)
点击查看摘要
Abstract:Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reliable or are not used in training objectives. This work introduces a novel workflow for feature extraction using only acoustic labels. By isolating explicit regional accent landmarks and using a phoneme-based forced aligner (ZIPA), our targeted feature set captures dialectal variance more effectively than utterance embeddings, demonstrating that localized features can outperform general-purpose architectures on accent-related tasks using minimal and objective data labels.
信息检索
1. 【2605.31575】SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics
链接:https://arxiv.org/abs/2605.31575
作者:Eric Liang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Scalable information retrieval, information retrieval testing, collections remain expensive, Scalable information, human-judged test collections
备注:
点击查看摘要
Abstract:Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, and evaluation tooling, yet human-judged test collections remain expensive and may be unavailable when documents are private or still under design. This paper introduces SPECTRA, a reproducible framework for generating synthetic text corpora and retrieval test collections through a separation of latent topical structure, surface text realization, metadata controls, query intent generation, and deterministic relevance oracles. The framework is intended as a diagnostic complement to Cranfield-style and TREC-style evaluation, not as a replacement for human assessment. A single-process Python prototype generated corpora up to 60,000 documents and 9.61 million tokens while preserving controllable long-tail vocabulary growth and producing graded relevance labels for 96 queries. In the local simulation study, generation remained close to linear at roughly 12K to 14K documents per second, estimated Zipf slopes stayed near 0.86 in absolute value, and increasing cross-topic distractor text reduced BM25 nDCG@10 from 1.00 at 2% distractors to 0.43 at 36% distractors. These results show that lightweight synthetic corpora can expose retrieval-system scaling and failure modes before costly collection construction begins.
2. 【2605.31555】Effects of Vertex Merging Splitting on Large Coauthorship Networks: A Counterfactual Analysis
链接:https://arxiv.org/abs/2605.31555
作者:Jinseok Kim
类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
关键词:Researchers analyze coauthorship, Researchers analyze, network data remains, distorting network properties, coauthorship networks
备注: 12 pages, 3 figures, 2 tables, ComplexNetworks2025
点击查看摘要
Abstract:Researchers analyze coauthorship networks, but author name ambiguity in their network data remains a significant challenge as it can change the number of vertices, distorting network properties. Although many scholars use straightforward heuristics for author name disambiguation using author's forename initials, these techniques can skew our understanding of network properties by merging or splitting vertices, raising concerns about the reliability and validity of these methods. This study investigates how different levels of vertex merging and splitting errors that are induced by name ambiguity impact network measures, using three large coauthorship networks with highly accurate algorithmic author name disambiguation. As a counterfactual scenario, two initial-based disambiguation methods widely used in coauthorship network research were applied to these datasets. Nine coauthorship network metrics were computed while varying randomly the numbers of merged or split vertices. Results show that initial-based disambiguation generates coauthorship networks with specific network properties underestimated, leading to the discovery of coauthorship networks that are smaller and more closely connected than they genuinely are. In contrast, other network metric values increase, making authors appear more collaborative and embedded within less fragmented research communities than they are. The study emphasizes the importance of careful disambiguation of vertex names in analyzing coauthorship networks for rigorous and valid findings.
3. 【2605.31506】Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy
链接:https://arxiv.org/abs/2605.31506
作者:Michael R. DeMarco
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Retrieval-Augmented Generation, current industry standard, current industry, Expert Blindness Effect, real-world facts
备注: 15 pages, 7 tables. Preliminary findings; Experiment 3 identified as future work
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.
4. 【2605.31414】Beyond Instance-Level Alignment and Uniformity: Semantic Factor Learning for Collaborative Filtering
链接:https://arxiv.org/abs/2605.31414
作者:Yajie Yu,Chenzhong Bin,Zhoubo Xu,Zhixin Zeng,Tongxin Xu,Cihan Xia,Jiafeng Wu
类目:Information Retrieval (cs.IR)
关键词:Collaborative filtering, recommender systems, Semantic, Semantic Factor, semantic factors
备注: Accepted by KDD 2026
点击查看摘要
Abstract:Collaborative filtering (CF) is widely used in recommender systems (RecSys) due to its simplicity and efficiency. However, existing CF methods follow an instance-level learning paradigm. During the instance learning stage, a large number of uninteracted user-item instances, of which items are potential interested by the user, are incorrectly treated as true negative samples resulting in a severe limitation to the generalization and scalability of models. Moreover, mainstream graph convolutional networks (GCNs) inherently suffer from high computational cost and over-smoothing issues, which limit the ability in capturing higher-order connectivity and lead to a poor generalization under sparse supervision signals. To address the above limitations, we propose Semantic Factor enhanced Alignment and Uniformity (SaFeAU), a novel framework that augments interacted instances with semantic factors, thereby mitigating false negative labeling and enabling matrix factorization (MF) to capture high-order CF signals without graph neighborhood aggregation. Specifically, SaFeAU consists of three tightly coupled components. First, Semantic Factor Routing (SFR) disentangles item representations into independent and global semantic factors. Building on these factors, Semantic Factor Matching (SFM) identifies uninteracted items, which share the same semantic factors with interacted ones, as potential positive pairs for enriching sparse supervision signals. Finally, Semantic Pairs Alignment (SPA) aligns both observed and potential positive pairs while promoting uniformity of user and item representations. Extensive experiments on four sparse real-world datasets show that SaFeAU consistently outperforms GCN-based and MF-based state-of-the-art CF methods in both recommendation accuracy and computational efficiency, confirming the effectiveness of the proposed semantic enhanced learning paradigm.
5. 【2605.31377】DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval
链接:https://arxiv.org/abs/2605.31377
作者:Siyuan Qi,Xinyuan Wang,Yingxuan Yang,Haochuan Guo,Jianghao Lin,Weiwen Liu,Yong Yu,Weinan Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:short-horizon inference loops, high inference cost, Agentic Retrieval-Augmented Generation, Retrieval-Augmented Generation improves, Retrieval-Augmented Generation
备注:
点击查看摘要
Abstract:Agentic Retrieval-Augmented Generation improves retrieval by integrating planning, tool use, and iterative reasoning, but existing agentic RAG methods often couple semantic expansion with retrieval decisions in short-horizon inference loops, leading to high inference cost and limited suitability for time-sensitive news retrieval. We propose DynaTree, a two-stage framework for efficient and adaptive news retrieval. In the offline stage, DynaTree uses coordinated agents to construct a reusable retrieval tree that materializes the semantic space of a query topic. In the online stage, DynaTree performs lightweight daily subtree selection over a time-localized evaluation proxy, without further agentic reasoning, tree modification, or retraining. Experiments on a multi-day Syft news benchmark and multiple BEIR datasets show that DynaTree achieves strong recall and ranking performance, consistently outperforming standard RAG and prior agentic baselines. We further deploy DynaTree in the Syft production system and evaluate it through online A/B testing from Jan. 28 to Feb. 6, 2026. The dynamically adapted variant improves survival rate from 0.32-0.53 to 0.59-0.73 over a fixed offline-selected subtree and outperforms existing production recallers on every evaluation day. These results show that persistent, structure-aware semantic expansion can translate offline agentic reasoning into practical improvements in coverage, freshness, and relevance for real-world news retrieval.
6. 【2605.31295】Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation
链接:https://arxiv.org/abs/2605.31295
作者:Ioannis Prokopiou,Pantelis Vikatos,Maximos Kaliakatsos-Papakostas,Theodoros Giannakopoulos,Themos Stafylakis
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:complex symbolic sequences, significant gap remains, Multitrack Music Transformer, Transformer-based architectures, symbolic sequences
备注: Accepted at EUSIPCO 2026 (34th European Signal Processing Conference), 5 pages, 2 figures
点击查看摘要
Abstract:Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.
7. 【2605.31291】Contextual Scalarisation Thompson Sampling for multi-objective decisions in public media
链接:https://arxiv.org/abs/2605.31291
作者:Théo Maëtz,Luc Guillet,Andrea Cavallaro
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Recommender systems, operate under multiple, systems may operate, Scalarisation Thompson Sampler, Recommender
备注: 15 pages, 3 figures, 3 tables. Submitted-manuscript version of a paper accepted at ICPR 2026. The Version of Record will be published in the Springer Lecture Notes in Computer Science series; DOI will be added when available
点击查看摘要
Abstract:Recommender systems may operate under multiple, competing objectives. For example, audience reach, cultural values, public service mandate, and operational constraints must be balanced in editorial decisions of public service media. Existing approaches relying on fixed combinations of objectives or Pareto-based optimisation do not adapt to changing priorities across situations. In this paper, we propose Contextual Scalarisation Thompson Sampler (CSTS), a multi-objective contextual bandit method that learns to weight objectives as a function of the observed context. We evaluate CSTS on real programming data from Radio Télévision Suisse, the Swiss national broadcaster, showing improved contextual relevance and better alignment with expert curation practices compared to fixed weight and standard contextual bandit approaches.
8. 【2605.31171】MIMO: Multilingual Information Retrieval via Monolingual Objectives
链接:https://arxiv.org/abs/2605.31171
作者:Youngjoon Jang,Seongtae Hong,Heuiseok Lim
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Multilingual Information Retrieval, reflects real-world search, real-world search environments, Multilingual Information, Information Retrieval
备注:
点击查看摘要
Abstract:Multilingual Information Retrieval (MLIR) reflects real-world search environments in which queries and relevant documents may appear in different languages within a mixed-language corpus. However, existing embedding models are primarily optimized for Multi-Monolingual retrieval and their performance often degrades in MLIR settings. Moreover, directly applying conventional contrastive learning to MLIR can exacerbate language clustering and expose a trade-off between cross-lingual alignment and embedding uniformity. To address these limitations, we propose MIMO: Multilingual Information Retrieval via Monolingual Objectives, a two-stage framework that uses a stable English semantic space from a high-performing teacher model as an anchor. MIMO first initializes the student model's cross-lingual alignment through knowledge distillation, and then jointly optimizes distillation and cross-lingual contrastive learning to improve retrieval discrimination while preserving alignment. Extensive experiments show that MIMO consistently outperforms existing cross-lingual training baselines across various MLIR and Multi-Monolingual benchmarks. MIMO also remains competitive with off-the-shelf models of similar or larger parameter scales. Furthermore, our cross-lingual Alignment-Uniformity analysis clarifies the distinct roles of the two loss components and shows that their combination yields a favorable trade-off between alignment and uniformity.
9. 【2605.31100】Vector Linking via Cross-Model Local Isometric Consistency
链接:https://arxiv.org/abs/2605.31100
作者:Ziying Chen,Yang Cao,He Sun,Beining Yang,Tianjian Yang
类目:Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:partially overlapping datasets, embedding clouds produced, cross-model object correspondences, overlapping datasets, clouds produced
备注: Accepted at ICML 2026
点击查看摘要
Abstract:We study Vector Linking: given two embedding clouds produced by different black-box encoders over partially overlapping datasets, recover cross-model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short-range distances are approximately preserved up to a scale factor, while long-range distances are not due to model-specific distortion. Building on this, we propose an iterative, reference-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash-space matching, and aggregates evidence across views in a Beta-Bernoulli posterior to bootstrap high-confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out-of-domain anchors, with applications to vector database integration and cross-model clustering. Code is available at this https URL.
10. 【2605.31086】Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory
链接:https://arxiv.org/abs/2605.31086
作者:Han Zhang,Zihao Tang,Xin Yu,Xiao Liu,Yeyun Gong,Haizhen Huang,Yan Lu,Weiwei Deng,Feng Sun,Qi Zhang,Hanfang Yang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, underlying personas tend, long-term semantic consistency, lack long-term semantic
备注:
点击查看摘要
Abstract:In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios, interactions between users and assistants involve more diverse, heterogeneous data streams, such as documents and emails. These shortcomings significantly limit the realism and effectiveness of current evaluations. To address these limitations, we introduce RHELM (Realistic, Heterogeneous, and Evolving Long-term Memory). Driven by meticulously crafted user profiles and a novel LOOP (pLan-rOllout-evOlve-Prune) module, we construct realistic dialogues across diverse interaction scenarios that exhibit dynamic temporal evolution and long-term coherence. Crucially, these dialogues are deeply integrated with heterogeneous external sources synchronized with the user's temporal event trajectory. The resulting benchmark encompasses challenging question-answer pairs spanning seven inquiry types, with each question mapping to at least one of 27 critical memory characteristics that we identify as essential yet underexplored in current research. Comprehensive experiments across full-context models, retrieval-augmented generation (RAG) methods, and representative memory frameworks reveal that contemporary approaches still expose critical weaknesses in complex, real-world settings, particularly in resolving multi-source aggregation and real-world contextual reasoning.
11. 【2605.31064】Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA
链接:https://arxiv.org/abs/2605.31064
作者:Hao Chen,Xing Tang,Qirui Liu,Weijie Shi,Shiwei Li,Fuyuan Lyu,Weihong Luo,Xiku Du,Xiuqiang He
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, financial question answering, significantly advanced online
备注: Accepted by KDD 2026 ADS track
点击查看摘要
Abstract:Large Language Models (LLMs) have significantly advanced online data services, particularly in the domain of financial question answering (FinQA). However, such systems remain susceptible to numerical reasoning hallucinations, which critically undermine reliability in high-stakes financial applications. Although retrieval-augmented generation (RAG) has been widely adopted to ground responses in external knowledge, it introduces three persistent challenges: noise sensitivity, calculation fragility, and an auditability crisis. Existing model-centric approaches, which primarily focus on optimizing either the retriever or generator in isolation, still struggle to address these issues in an integrated manner. In this work, we pioneer a data-centric paradigm and propose a novel framework, the Data-centric Reasoning Compiler (DCRC). The framework operates through three cohesive phases: (1) adversarial data construction, which synthesizes training examples with controlled noise to teach robustness; (2) multi-stage training that cultivates a Data-centric Structuring Agent (DSA) capable of explicit evidence auditing and program synthesis; and (3) a compile-and-execute inference process, where the DSA transforms user queries and retrieved documents into verifiable, executable reasoning programs. This data-driven framework ensures faithful numerical reasoning by design. We conduct extensive experiments on established offline benchmarks and further validate our framework through deployment in a real-world online financial QA system.
12. 【2605.31003】Graph-GRPO: Dependency-Aware Credit Assignment for Generative E-commerce Search Relevance
链接:https://arxiv.org/abs/2605.31003
作者:Jiarui Che,Yifei Chen,Zhixing Tian,Chenyang Wang,Ziguang Cheng
类目:Information Retrieval (cs.IR)
关键词:e-commerce search systems, user query matches, query matches candidate, matches candidate products, search systems
备注: 11 pages, 2 figures, 2 tables. Submitted to CIKM 2026
点击查看摘要
Abstract:Search relevance modeling is a core task in e-commerce search systems, assessing how well a user query matches candidate products. Rather than relying on a single holistic matching signal, relevance judgment often requires structured reasoning over query understanding, product understanding, and facet-level matching. With large language models (LLMs), this process is increasingly formulated as chain-of-thought (CoT) reasoning and optimized with reinforcement learning (RL). However, existing RL methods mainly rely on outcome-level rewards and treat the entire reasoning chain as a single optimization unit. This makes it difficult to distinguish faulty reasoning steps from correct intermediate ones, leading to misaligned credit assignment. Although process-reward methods provide denser supervision, they often treat reasoning steps independently and ignore dependency-driven error propagation, making responsibility attribution difficult and limiting the optimization of structured relevance reasoning. We propose Graph-GRPO, a graph-structured extension of GRPO for multi-component relevance reasoning. Graph-GRPO constructs a relevance reasoning dependency graph, where CoT steps are modeled as nodes and their logical dependencies as edges. It propagates outcome-level rewards over the graph to derive step-level credit signals, enabling more accurate fine-grained credit assignment. We further introduce a main-loss-driven controller that adaptively adjusts edge-wise credit-propagation coefficients. Together with CoT random masking for supervised policy initialization and graph-node-based multi-head distillation, we build a trainable and deployable framework for generative relevance modeling. Extensive offline evaluations and online A/B tests on a leading e-commerce platform demonstrate that the Graph-GRPO-based framework improves relevance classification metrics and key engagement metrics.
13. 【2605.30966】Reading Between the Citations: A Typed Claim Network for Scientific Literature
链接:https://arxiv.org/abs/2605.30966
作者:Ning Ding,Sergio J. Rodríguez Méndez,Pouya G. Omran
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Knowledge graphs, legal opinions, encode the topology, graphs over corpora, scholarly inter-referencing documents
备注:
点击查看摘要
Abstract:Knowledge graphs over corpora of inter-referencing documents - scholarly papers, legal opinions, policy briefs - encode the topology of reference but not its stance. The standard representation collapses a rich evaluative relation into an untyped edge, losing the very content that supports community-level queries about how one document is received by another. We propose the claim network: a representational pattern in which each cross-document reference is reified as a typed claim, carrying source, target, claim text, and a four-class stance label grounded in the citation-intent literature. We give a construction pipeline applicable to any corpus of scholarly inter-referencing documents and instantiate it on a corpus of 127 papers in 3D point cloud semantic segmentation, producing a network of 8,260 typed claims. Three downstream task families demonstrate what the network enables: retrieval signal augmentation, aggregated-stance summarisation, and topological analytics. Head-to-head evaluation against standard Retrieval-Augmented Generation (RAG) baselines shows that the gain over flat retrieval is the gain from the right intermediate representation rather than the wrong one.
14. 【2605.30917】Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search
链接:https://arxiv.org/abs/2605.30917
作者:Gyu-Hwung Cho(1 and 2),Youngjune Lee(1),Kiyoon Jeong(1),Siyoung Lee(1),Sanggyu Han(1),Hervé Dejean(3),Stéphane Clinchant(3),Seung-won Hwang(2) ((1) NAVER Corp., Republic of Korea, (2) Seoul National University, Republic of Korea, (3) Naver Labs Europe, France)
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:gained increasing attention, enterprise PDFs continue, large-scale visual-document corpora, lexically indexes visual, neural query encoding
备注: 12 pages, 5 figures, 12 tables, preprint
点击查看摘要
Abstract:As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at this https URL.
15. 【2605.30790】On the impact of retrieved content representations in RAG Pipelines
链接:https://arxiv.org/abs/2605.30790
作者:Jonathan J Ross,Bevan Koopman,Anton van der Vegt,Guido Zuccon
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:RAG pipelines inherit, RAG pipelines, language model input, pipelines inherit retrieval, inherit retrieval components
备注: 23 pages, 15 figures, submitted to ACL May 2026 ARR
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.
16. 【2605.30772】FOSTER: First-order Dataset Distillation for Text-based Sequential Recommendation
链接:https://arxiv.org/abs/2605.30772
作者:Hung Vinh Tran,Tong Chen,Xinyi Gao,Junliang Yu,Julien Monteil,Hongzhi Yin
类目:Information Retrieval (cs.IR)
关键词:sequential recommender systems, greatly improving recommendation, improving recommendation accuracy, text-based sequential recommendation, Text-based sequential recommender
备注:
点击查看摘要
Abstract:Text-based sequential recommender systems, while greatly improving recommendation accuracy by incorporating item contexts, are undeniably more expensive to train. By condensing a large dataset into a compact set of synthetic samples for model training, dataset distillation offers a promising solution. However, its adoption in text-based sequential recommendation is non-trivial given the large pool of discrete items. This challenge is further compounded by language model-based item encoding, which makes bi-level optimization commonly used in dataset distillation prohibitively expensive. To this end, we propose First-order dataset distillation for Text-based Sequential Recommendation (FOSTER), which facilitates effectiveness and efficiency via three novel components: (1) stochastic item subset sampling that replaces costly full-corpus embedding extraction at each distillation step; (2) first-order optimization with trajectory-anchored parameter reset to avoid expensive bi-level gradient computation; and (3) regularization that explicitly promotes co-occurrence between semantically similar items in the synthetic sequences. Extensive experiments on three benchmarks show that FOSTER consistently outperforms existing dataset distillation and coreset selection baselines, approximating full-dataset performance using as few as 20 synthetic interaction sequences.
17. 【2605.30729】SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching
链接:https://arxiv.org/abs/2605.30729
作者:Inwon Kang,Kavitha Srinivas,Nandana Mihindukulasooriya,Sola Shirai,Parikshit Ram,Horst Samulowitz,Oshani Seneviratne
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:heterogeneous data sources, integrating heterogeneous data, Pre-trained Language Models, fundamental step, step in integrating
备注: Accepted to KDD 26 Research Track
点击查看摘要
Abstract:Schema matching is a fundamental step in integrating heterogeneous data sources. While Pre-trained Language Models (PLMs) have revolutionized this task by capturing linguistic semantics, they typically process tabular data as serialized text sequences of standalone column descriptions. This serialization discards critical structural information -- specifically, the row-level co-occurrences, i.e. the relational context -- forcing models to rely solely on column header semantics or standalone distributions. To bridge this gap, we propose SemStruct, a framework that joins the semantic power of frozen PLMs with the structural inductive bias of Graph Neural Networks (GNNs). We model the table as a heterogeneous graph where columns and values are nodes connected by rows, allowing the GNN to propagate disambiguating context across the structure. Unlike other state-of-the-art methods that require proprietary LLM access and fine-tuning of language models, SemStruct keeps the language model frozen and trains only a lightweight structural encoder. Extensive experiments on the Valentine and SOTAB-SM benchmarks demonstrate that SemStruct achieves state-of-the-art performance, outperforming fully fine-tuned baselines on complex, semantically joinable datasets. Furthermore, our ablation studies reveal that row representations serve primarily as topological conduits rather than semantic entities, validating the necessity of explicit structural modeling in schema matching.
18. 【2605.30604】An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations
链接:https://arxiv.org/abs/2605.30604
作者:George Fatouros,Georgios Makridis,George Kousiouris,John Soldatos,Dimosthenis Kyriazis
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:enforces organization-level scope, cybersecurity workflows lack, Regulated cybersecurity workflows, regulated security operations, locally deployable
备注: 8 pages, 3 figures
点击查看摘要
Abstract:Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent large language model (LLM) agent systems report strong results on isolated cybersecurity tasks, yet they do not by themselves define an auditable platform architecture for regulated security operations centre (SOC) and compliance workflows, where a single analyst may trigger actions that bind the organization, and where the runtime must integrate with existing SIEM/XDR stacks as a primary source of context and alert-driven triggers rather than operate as a standalone analytical layer. This paper proposes an organization-scoped LLM agent runtime architecture for financial cybersecurity. The contribution is a typed Security Context that is created at every entry point, including SIEM/XDR notifications ingested as first-class triggers, and enforced at every component boundary, combined with a shared Runtime Core, logical specialist subagents, a governed Tool Adapter Layer exposing SIEM/XDR query, enrichment, and response primitives under uniform policy and audit, structured findings with evidence references, tiered human-in-the-loop (HITL) gates, and append-only audit. Model Context Protocol (MCP), extended telemetry, digital twins for pentesting, graph retrieval, and federated knowledge sharing are treated as optional extension paths rather than mandatory runtime assumptions. We describe an implementable slice as the architecture's testability surface, and we propose a falsifiable evaluation plan with metric-level pass criteria for architecture readiness, security-policy enforcement, evidence traceability, output quality, and operational observability.
19. 【2605.30407】Exploring Autonomous Agentic Data Engineering for Model Specialization
链接:https://arxiv.org/abs/2605.30407
作者:Yujie Luo,Xiangyuan Ru,Jingsheng Zheng,Jingjing Wang,Yuqi Zhu,Jintian Zhang,Runnan Fang,Kewei Xu,Ye Liu,Zheng Wei,Jiang Bian,Zang Li,Shumin Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, high-quality domain-specific data, demonstrated strong performance, Language Models
备注: Work in progress
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize \textbf{Autonomous Agentic Data Engineering}, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by \textbf{57.29\%}, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization\footnote{Code will be released at this https URL.}.
20. 【2605.28918】When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
链接:https://arxiv.org/abs/2605.28918
作者:Youting Wang,Yuan Tang,Bowen Liu,Xuan Liu,Dingyan Shang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:LLM-generated reward shaping, one-shot generation, LLM-generated reward, structured reinforcement-learning tasks, reward shaping
备注:
点击查看摘要
Abstract:For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.
计算机视觉
1. 【2605.31604】Representation Forcing for Bottleneck-Free Unified Multimodal Models
链接:https://arxiv.org/abs/2605.31604
作者:Yuqing Wang,Zhijie Lin,Ceyuan Yang,Yang Zhao,Fei Xiao,Hao He,Qi Zhao,Zihan Ding,Fuyun Wang,Shuai Wang,Youliang Zhang,Haoqi Fan,Xihui Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:aim to handle, separately pretrained VAE, Unified multimodal models, Unified multimodal, generation
备注: Project page: [this https URL](https://yuqingwang1029.github.io/RepresentationForcing)
点击查看摘要
Abstract:Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.
2. 【2605.31603】Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models
链接:https://arxiv.org/abs/2605.31603
作者:Jiazheng Xing,Hangjie Yuan,Lingling Cai,Xinyu Liu,Yujie Wei,Fei Du,Hai Ci,Tao Feng,Jiasheng Tang,Weihua Chen,Fan Wang,Yong Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Connector-based video unified, limiting achievable visual, instruction-grounded video synthesis, Connector-based video, computationally prohibitive
备注: Project page ( [this https URL](https://jiazheng-xing.github.io/nexus-lumos-home/) ) and Code ( [this https URL](https://github.com/alibaba-damo-academy/Lumos-Custom/) ) are available
点击查看摘要
Abstract:Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at this https URL.
3. 【2605.31598】Linear Scaling Video VLMs for Long Video Understanding
链接:https://arxiv.org/abs/2605.31598
作者:Cristobal Eyzaguirre,Jiajun Wu,Juan Carlos Niebles
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:encoders still rely, rely on spatiotemporal, latency to grow, grow quadratically, Video vision-language models
备注:
点击查看摘要
Abstract:Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.
4. 【2605.31597】SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models
链接:https://arxiv.org/abs/2605.31597
作者:Olaf Dünkel,Basavaraj Sunagad,Haoran Wang,David T. Hoffmann,Christian Theobalt,Adam Kortylewski
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, Measuring structured object, limited part-level supervision, inconsistent evaluation protocols, models remains challenging
备注: Project page: [this https URL](https://genintel.github.io/SOCO/)
点击查看摘要
Abstract:Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.
5. 【2605.31596】KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems
链接:https://arxiv.org/abs/2605.31596
作者:Alireza Kheirandish,Jihoon Hong,Sara Fridovich-Keil
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:shown promising performance, computational imaging, OOD, OOD detection, shown promising
备注: CVPR 2026
点击查看摘要
Abstract:Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback-Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems. Our code can be found at this https URL.
6. 【2605.31595】Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction
链接:https://arxiv.org/abs/2605.31595
作者:Mungyeom Kim,Minkyeong Jeon,Honggyu An,Jaewoo Jung,Hyuna Ko,Jisang Han,Hyeonseo Yu,Donghwan Shin,Sunghwan Hong,Takuya Narihira,Kazumi Fukuda,Yuki Mitsufuji,Seungryong Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:monocular video remains, computer vision, remains a fundamental, fundamental challenge, challenge in computer
备注: Project Page: see [this https URL](https://cvlab-kaist.github.io/C4G)
点击查看摘要
Abstract:Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.
7. 【2605.31591】CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
链接:https://arxiv.org/abs/2605.31591
作者:Nurjahan Sultana,Moi Hoon Yap,Xinqi Fan,Wenqi Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:AI-based skin cancer, severe performance drop, skin cancer screening, cancer screening suffer, expert dermoscopic
备注: 'Accepted by CVPR 2026'
点击查看摘要
Abstract:Models for AI-based skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clinical (target) images, hindering real-world deployment. Existing domain adaptation methods often ignore crucial semantic invariants, such as clinical concepts. While new foundation models like MONET can provide this semantic information as dense, probabilistic scores, this metadata is unavailable at test time, creating a deployment paradox for practical image-only screening tools. We address this gap by proposing CoFiDA-M, a privileged information framework that learns from concepts at training time but deploys as an image-only model. Our method trains a teacher network that uses MONET concept probabilities to guide a FiLM modulator, transforming visual features into a semantically ``edited" feature space. A lightweight, image-only student is then trained to reproduce this edited representation, not just the teacher's final predictions. This distillation ``bakes" the clinical reasoning into the student's weights. On a challenging multi-dataset benchmark, our image-only student significantly outperforms state-of-the-art approaches, especially in melanoma recall. Our work provides a practical and generalizable framework for leveraging noisy, probabilistic metadata as privileged information, demonstrating strong cross-dataset robustness and potential for real-world deployment beyond dermatology. Implementation code is available at: this https URL
8. 【2605.31590】unerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation
链接:https://arxiv.org/abs/2605.31590
作者:Ruotong Liao,Guowen Huang,Qing Cheng,Guangyao Zhai,Lei Zhang,Xun Xiao,Thomas Seidl,Daniel Cremers,Volker Tresp
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:faces challenging questions, generation faces challenging, faces challenging, challenging questions, questions when generating
备注: 17 pages, 13 figures
点击查看摘要
Abstract:Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.
9. 【2605.31589】Recognizing Co-Speech Gestures in-the-Wild
链接:https://arxiv.org/abs/2605.31589
作者:Sindhu B Hegde,K R Prajwal,Andrew Zisserman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:humans naturally gesture, specific spoken words, sparse subset, movements are visually, visually depictive
备注:
点击查看摘要
Abstract:While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.
10. 【2605.31577】SurGe: Improved Surface Geometry in Point Maps
链接:https://arxiv.org/abs/2605.31577
作者:Karim Knaebel,Gonzalo Martin Garcia,Christian Schmidt,Ilya Fradlin,Lucas Nunes,Daan de Geus,Bastian Leibe
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reconstruction methods predict, Recent feedforward, methods predict point, reconstruction methods, predict point maps
备注: Project page at [this https URL](https://vision.rwth-aachen.de/surge)
点击查看摘要
Abstract:Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.
11. 【2605.31576】Joint Multi-Camera LiDAR Extrinsic Calibration via Learned Pairwise Initialization and Geometric Refinement
链接:https://arxiv.org/abs/2605.31576
作者:Aziz Al-Najjar,Marzieh Amini,James R. Green,Felix Kwamena
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:camera-LiDAR pair independently, learning-based camera-LiDAR calibration, ignoring the rigid, learning-based camera-LiDAR, camera-LiDAR pair
备注: Paper is accepted in CVPR 2026 Workshop URVI: Unified Robotic Vision with Cross-Modal Sensing and Alignment
点击查看摘要
Abstract:Most learning-based camera-LiDAR calibration methods treat each camera-LiDAR pair independently, ignoring the rigid geometric coupling in multi-camera platforms. As a result, per-camera estimates may be individually accurate yet inconsistent at the system level. We present a two-stage framework for joint multi-camera LiDAR extrinsic calibration that combines learned pairwise matching with geometric refinement. First, CMRNext is applied independently to each camera to produce initial extrinsic estimates and dense 2D-3D correspondences. These predictions are then jointly refined through a multi-frame bundle adjustment with reprojection, per-camera prior, and relative-pose prior terms. This approach converts pairwise predictions into a globally consistent multi-camera calibration. Experiments on KITTI (in-domain for CMRNext) and Walkley (out-of-domain) datasets show improved per-camera accuracy and inter-camera consistency. On KITTI, the method achieves 0.89 cm translation error and 0.038 rotation error. On Walkley, it reduces translation error from 108.6 cm to 3.1 cm, highlighting the benefit of explicit multi-camera coupling when single-camera predictions are less reliable.
12. 【2605.31572】nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving
链接:https://arxiv.org/abs/2605.31572
作者:Zhiyu Huang,Johnson Liu,Rui Song,Zewei Zhou,Ruining Yang,Yun Zhang,Tianhui Cai,Hanyin Zhang,Mingxuan Gao,Valeria Xu,Jiali Chen,Yishan Shen,Yiluan Guo,Tony(Xuewei)Qi,Jiaqi Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:infer agent interactions, apply commonsense knowledge, understand spatial relations, make safe decisions, Reasoning
备注:
点击查看摘要
Abstract:Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.
13. 【2605.31557】EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision
链接:https://arxiv.org/abs/2605.31557
作者:Rosario Forte,Giuseppe Lando,Antonino Furnari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Continuous episodic memory, autonomous agents operating, Continuous episodic, real-world environments, operating in dynamic
备注:
点击查看摘要
Abstract:Continuous episodic memory is a core capability for autonomous agents operating in dynamic, real-world environments, yet current streaming video benchmarks provide limited tools for diagnosing what models remember and for how long. We introduce \egostream, a diagnostic benchmark for streaming episodic memory evaluation in egocentric vision. \egostream organizes 2,250 curated questions along seven cognitive dimensions: detail, spatial, temporal, event, social, causal, and prospective memory. We introduce the Answer Validity Window (AVW), which specifies the temporal span an answer remains valid as the observed scene evolves. This allows us to expand the questions into 8,528 recall-conditioned evaluations, enabling controlled testing from instant to ultra-long-term recall while separating genuine model forgetting from natural world-state changes. We rigorously establish baseline performance through a unified streaming MLLM framework that compares several state-of-the-art memory-management mechanisms, covering sliding windows, attention sinks, KV-cache pruning, merging, and offloading. Experiments within a unified Qwen3-VL backbone reveal that comparable aggregate accuracies mask starkly different memory profiles. For instance, token pruning preserves fine-grained details and temporal structure significantly better than token merging, while quantized offloading rescues ultra-long-term recall. Ultimately, all mechanisms operate well below real-time (1s per frame), and top performing methods ceil at about 45\% accuracy, exposing critical gaps in current architectures. \egostream provides the diagnostic testbed needed to close these gaps.
14. 【2605.31556】Vision-Language Models Suppress Female Representations Under Ambiguous Input
链接:https://arxiv.org/abs/2605.31556
作者:Arnau Marin-Llobet,Simon Henniger,Mahzarin R. Banaji
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:Alignment teaches vision-language, expressing demographic biases, avoid expressing demographic, Alignment teaches, teaches vision-language models
备注: 16 pages, 12 figures, 1 table
点击查看摘要
Abstract:Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.
15. 【2605.31551】SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation
链接:https://arxiv.org/abs/2605.31551
作者:Parthsarthi Rawat
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Skeletal Tracking Challenge, FIFA Skeletal Tracking, Tracking Challenge, Skeletal Tracking, FIFA Skeletal
备注: CVPR 2026 SoccerNet FIFA Skeleton Tracking Light Challenge, Rank 6
点击查看摘要
Abstract:We present our approach to the FIFA Skeletal Tracking Challenge 2026, which requires estimating 3D world-space poses of soccer players from broadcast video. Our method finetunes SMPLest-X (ViT-H, 687 M parameters) via a stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing. Against the FIFA baseline score of 1.053 on the validation set, SMART achieves 0.647, a 38.6% improvement; on the held-out test set, SMART scores 0.593 (Global MPJPE: 0.324 m, Local MPJPE: 0.054 m).
16. 【2605.31539】Automated Prediction of Postoperative Pancreatic Fistula Using Preoperative Computed Tomography
链接:https://arxiv.org/abs/2605.31539
作者:Ashok Choudhary,Chris Varghese,Leo Y. Li-Han,Frank G. Lee,Ellen L. Larson,Elizabeth B. Habermann,Cornelius A. Thiels,Hojjat Salehinejad
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
关键词:Postoperative pancreatic fistula, increasing morbidity, hospital stay, healthcare costs, Postoperative pancreatic
备注:
点击查看摘要
Abstract:Postoperative pancreatic fistula (POPF) is a serious complication after pancreatic resection, increasing morbidity, hospital stay, and healthcare costs. We present an automatic, end-to-end deep learning pipeline-from pancreatic segmentation to classification-for preoperative POPF risk estimation and stratification using preoperative CT scans. A data set with auto-segmented pancreas volumes and surgical outcomes was used to evaluate multiple architectures, including a custom lightweight 3D CNN baseline (CNN3D), R(2+1)D ResNet-18, and ResNet-MC3-18 models. Evaluation across multiple 3D architectures demonstrated promising predictive performance. This approach offers a clinically valuable tool and a methodological benchmark for pancreas-specific CT classification, supporting improved preoperative decision-making in pancreatic surgery.
17. 【2605.31535】RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
链接:https://arxiv.org/abs/2605.31535
作者:Ulrich Prestel,Stefan Andreas Baumann,Nick Stracke,Björn Ommer
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:multi-network system designs, turning self-supervised NVS, view synthesis, remains challenging, challenging to scale
备注: Project Page: [this https URL](https://compvis.github.io/rayder)
点击查看摘要
Abstract:Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: this https URL
18. 【2605.31534】Feature-Optimized Vision for Adaptive 3D Scene Reconstruction
链接:https://arxiv.org/abs/2605.31534
作者:Eric Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Three-dimensional scene reconstruction, Three-dimensional scene, depends on local, visually discriminative, discriminative and geometrically
备注:
点击查看摘要
Abstract:Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed feature thresholds and uniform feature budgets are easy to deploy, but they can waste computation on repeated texture, low-parallax regions, or unstable points. This paper proposes an adaptive feature-optimized vision front end for 3D reconstruction. The method scores candidate features by texture, repeatability, distinctiveness, expected triangulation angle, and spatial coverage, then allocates a per-view feature budget to maximize useful tracks under a fixed reconstruction pipeline. A small synthetic multi-view prototype evaluates four selection policies across corridor, facade, object-table, and cluttered scenes. Compared with random, texture-only, and uniform-grid baselines, the adaptive policy obtains the best quality-aware completeness and the lowest aggregate reconstruction RMSE while preserving broad image coverage. The result is not a replacement for modern learned matching or neural reconstruction systems; it is a modular front-end policy that can make classical and learned 3D pipelines more deliberate about which visual evidence they spend compute on.
19. 【2605.31529】SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
链接:https://arxiv.org/abs/2605.31529
作者:Yulu Pan,Han Yi,Seongsu Ha,Md Mohaiminul Islam,Benjamin Zhang,Lorenzo Torresani,Gedas Bertasius
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:True video intelligence, video intelligence demands, Strategic Video Intelligence, True video, video intelligence
备注:
点击查看摘要
Abstract:True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.
20. 【2605.31513】Personalize Your Large Vision-language Models With In-context Prompt Tuning
链接:https://arxiv.org/abs/2605.31513
作者:Yanshu Li,Jiaqian Li,Kuai Yu,Xi Xiao,Dongfang Liu,Tianyang Wang,Ruixiang Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large vision-language models, demonstrated strong general, strong general multimodal, general multimodal capability, Large vision-language
备注: 27 pages, 10 figures, 5 tables
点击查看摘要
Abstract:Large vision-language models (LVLMs) have demonstrated strong general multimodal capability and are increasingly deployed in downstream systems. This trend has driven growing interest in LVLM personalization, which aims to enable models to quickly and effectively learn out-of-distribution multimodal concepts to meet user-specific needs. However, many existing methods rely on inference-time training, which reduces efficiency. They also struggle to maintain accuracy in complex multi-image, multi-concept settings. These limitations restrict the broader deployment of LVLM-based systems. Therefore, this paper proposes in-context prompt tuning (ICPT). Specifically, ICPT employs a lightweight projection module capable of operating in complex scenarios to extract fine-grained visual semantics from multiple reference images, seamlessly transforming these features alongside identity-label mappings into continuous prompts. To maximize computational efficiency, this module adaptively determines the prompt length based on the intrinsic visual complexity of each concept. Crucially, to overcome the environmental biases and cross-concept interference prevalent in real-world applications, we introduce two novel geometric regularizations. These constraints refine prompt representations by decoupling key identities from transient environmental states and separating concepts to avoid semantic confusion. Extensive experiments show that ICPT achieves state-of-the-art personalization accuracy across diverse tasks and LVLM backbones.
21. 【2605.31508】Internalizing Temporal Consistency in Video Object-Centric Learning without Explicit Regularization
链接:https://arxiv.org/abs/2605.31508
作者:Rongzhen Zhao,Zhiyuan Li,Juho Kannala,Joni Pajarinen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Object-Centric Learning, Object-Centric Learning, video OCL methods, textit, aims to represent
备注: 14 pages
点击查看摘要
Abstract:Video Object-Centric Learning (OCL) aims to represent objects as \textit{slot} vectors and maintain their consistency across frames. Slot-Slot Contrastive (SSC) loss has become the cornerstone for state-of-the-art (SOTA) video OCL methods. While highly effective, SSC relies on one-to-one object correspondence across frames and introduces an extra loss. Following Occam's Razor, we propose a paradigm shift: temporal consistency is better enforced as an implicit model design rather than an explicit loss. To elegantly exclude SSC (\textbf{xSSC}), we introduce two quasi-zero-overhead synergistic mechanisms: (\textit{i}) Chrono-Channel Decomposition (CCD) structurally disentangles slot representations along the channel dimension into \textit{static} and \textit{dynamic} sub-spaces, serving as an empirically unified information bottleneck; (\textit{ii}) Cross-Temporal Reconstruction (CTR) stochastically reconstructs target features of either the current or previous time step by fusing current slots' static channels and target slots' dynamic channels, using a single standard OCL decoder with minor training adaptation. Thereby, the slot sets inherently learn temporal consistency by minimizing the standard reconstruction error alone. Extensive experiments show that integrating xSSC into leading baselines not only improves training efficiency but also establishes new SOTAs on video object discovery and recognition tasks. Furthermore, our PCA and gradient analyses confirm that objects' time-invariant semantics and time-variant kinematics are encoded into the proposed sub-spaces. Our source code, model checkpoints and training logs are provided on this https URL.
22. 【2605.31503】How can embedding models bind concepts?
链接:https://arxiv.org/abs/2605.31503
作者:Arnas Uselis,Darina Koishigarina,Seong Joon Oh
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Humans easily determine, Humans easily, easily determine, determine which color, color belongs
备注: ICML 2026
点击查看摘要
Abstract:Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at this https URL.
23. 【2605.31487】Enhancing Computer Vision Model Generalization in Warehouse Facilities: A Case Study on Anomaly Detection in Vertical Material Handling Systems
链接:https://arxiv.org/abs/2605.31487
作者:Ruiliang Liu,Tina Dongxu Li,Joshua Migdal,Ken Meszaros,Trevor Dardik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deploying computer vision, Facilities traditionally requires, Deploying computer, Warehouse Facilities traditionally, computer vision models
备注: 6 pages, 10 figures. Accepted at IEEE International Conference on Mechatronics and Automation (ICMA) 2026
点击查看摘要
Abstract:Deploying computer vision models in Warehouse Facilities traditionally requires extensive resources for camera mounting, image collection, annotation, training, and deployment - a process often needing repetition in each new environment due to camera mounting constraints and environmental variability. This paper explores an innovative approach to streamline this process by conducting the standard procedure solely in a laboratory setting, focusing on vertical material handling systems and anomaly detection in forks of the systems. Through extensive experimentation, we have found that combining optimal camera placement, strategic image triggering, careful model selection and model ensemble enables effective generalization from laboratory conditions to diverse warehouse facilities environments, potentially transforming warehouse automation implementation by simplifying warehouse facilities deployment to just camera mounting, image collection, and model deployment, thereby saving significant resources and time typically spent on image annotation and model retraining. This is an experimental research study and not a production deployment.
24. 【2605.31466】VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching
链接:https://arxiv.org/abs/2605.31466
作者:Tuan Duc Ngo,Chuang Gan,Evangelos Kalogerakis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single RGB image, RGB image remains, image remains challenging, inferring hidden structures, single RGB
备注:
点击查看摘要
Abstract:Reconstructing the complete geometry of a scene from a single RGB image remains challenging - especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.
25. 【2605.31457】VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning
链接:https://arxiv.org/abs/2605.31457
作者:Hengbo Xu,Shengjie Jin,Yanbiao Ma,Zhiwu Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large multimodal models, visual, inference-time overhead, real-world deployment, reasoning
备注: Accepted at ICML 2026
点击查看摘要
Abstract:With the rapid advancement of large multimodal models (LMMs), inference-time overhead has become a key bottleneck for real-world deployment. Existing methods typically prune visual tokens at prefill, assuming the required visual evidence remains static during reasoning. However, we empirically show that visual evidence is strongly step-dependent: only a sparse subset of visual tokens is critical at each decoding step, and the critical set evolves across reasoning. Furthermore, we identify a coupled bottleneck where redundant visual context can steer the model toward query-irrelevant regions, lengthening the reasoning trace. Guided by these insights, we propose VisionPulse, a step-wise visual token pruning framework during reasoning. VisionPulse computes a lightweight visual attention mass to estimate the step-wise retention budget by exploiting its strong positive correlation with LMMs' effective visual token usage and retain only the most critical tokens under this budget. By enforcing visual sparsity during reasoning, VisionPulse filters redundant visual context while preserving relevant visual evidence, shortening reasoning traces naturally. Extensive experiments show that VisionPulse only retains 5% of visual tokens per step with reasoning traces shortened by 11.2%, while keeping accuracy almost unchanged.
26. 【2605.31437】Astra: a generalizable report generation foundation model for 3D computed tomography
链接:https://arxiv.org/abs/2605.31437
作者:Zhuhao Wang,Fang Chen,Chaohui Yu,Zihan Li,Yuchao Zheng,Jing Wang,Xuan Yang,Jia Guo,Zhenlu Yang,Xingju Zheng,Yihua Sun,Haojie Han,Xiaoxiao Qin,Zhan Feng,Wenbo Xiao,Chao Zhu,Yuehua Li,Shipeng Zhang,Hao Luo,Yunsong Peng,Fan Wang,Hongen Liao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interpretation requires radiologists, making reporting time-consuming, slices per examination, highly expertise-dependent, interpretation requires
备注:
点击查看摘要
Abstract:CT interpretation requires radiologists to review hundreds of volumetric slices per examination, making reporting time-consuming and highly expertise-dependent. Automated CT report generation offers a promising route to improving clinical efficiency, yet the field still lacks a generalizable CT report generation foundation model that supports multi-region reporting and remains robust across external real-world cohorts. Intrinsic inconsistencies in reporting style and diagnostic terminology across cohorts make naive joint training prone to noisy textual supervision, thereby limiting model generalizability. Here we present Astra, a generalizable CT report generation foundation model trained on 90,678 thoracoabdominal CT-report pairs (CTRgDB) with 353,671 abnormalities spanning eight organ systems. By harmonizing report style and further refining diagnostic consistency via reinforcement learning, Astra achieves style-consistent and diagnostically accurate report generation across diverse anatomical regions and institutions. Evaluating on CTRgDB and six external cohorts, Astra achieves state-of-the-art performance with a 44.1% average improvement in fine-grained diagnostic metrics (P0.001). In real-world clinical workflows, Astra assistance accelerates chest report drafting by 29.6% and improves abdominal report completeness by 11.3% (P0.001). Furthermore, Astra also demonstrates broad utility as a foundation for CT AI development, improving downstream diagnostic performance and scaling vision-language pretrain through high-quality report synthesis. Overall, Astra serves as a broadly accessible clinical assistant and a pivotal infrastructure for the next generation of AI-powered healthcare.
27. 【2605.31429】YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models
链接:https://arxiv.org/abs/2605.31429
作者:Ting Chen,Geng Li,Guohao Chen,Yu Hu,Guan Huang,Mai Chen,Langsheng Lei,Jun Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, visually degraded model, Large Vision-Language, Vision-Language Models, standard model
备注: 21 pages, 11 figures
点击查看摘要
Abstract:Contrastive decoding (CD) seeks to mitigate hallucinations in Large Vision-Language Models (LVLMs) by contrasting the output distributions of a standard model and a visually degraded model. However, existing training-free CD methods suffer from sub-optimal degraded branches: completely dropping visual tokens is too extreme and induces language hallucinations, while corrupting input images offers coarse control over visual evidence and suffers from high inference latency due to requiring two full forward passes. To address these dilemmas, we propose YARD, a training-free Y-Architecture Register Decoding framework. Motivated by the observation that reliable text-to-vision grounding predominantly emerges in the middle decoder layers, YARD constructs the degraded branch internally by sharing shallow-layer computations and branching exactly at this critical stage. For the degraded branch, YARD replaces patch-level visual tokens with register tokens, which preserve global image semantics but lack fine-grained local evidence. This image-aware yet locally under-grounded design provides a faithful contrastive signal without extreme modality mismatch, while the Y-architecture strictly avoids a costly second forward pass. Extensive experiments on generative and discriminative hallucination benchmarks demonstrate that YARD consistently achieves state-of-the-art hallucination mitigation across multiple LVLMs, alongside a significant reduction in inference latency.
28. 【2605.31419】riangle Splatting SLAM
链接:https://arxiv.org/abs/2605.31419
作者:Nicholas Fry(1 and 2),Eric Dexheimer(2),Kirill Mazur(2),Paul H. J. Kelly(1 and 2),Andrew J. Davison(2) ((1) Software Performance Optimisation Group, Imperial College London, (2) Department of Computing, Imperial College London)
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:dense RGB-D SLAM, RGB-D SLAM system, RGB-D SLAM, dense RGB-D, Gaussian Splatting
备注: 26 pages, 11 figures
点击查看摘要
Abstract:We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.
29. 【2605.31400】FSM-Net: An Efficient Frequency-Spatial Network for Real-World Deblurring
链接:https://arxiv.org/abs/2605.31400
作者:Vinh-Thuan Ly
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:balance existing methods, Efficient Real-World Deblurring, image deblurring demands, Frequency-Spatial Multi-branch Network, Real-world image deblurring
备注: Accepted to NTIRE Workshop at CVPR 2026. Project page: [this https URL](https://efficient-deblurring-fsmnet.vercel.app)
点击查看摘要
Abstract:Real-world image deblurring demands both high-fidelity restoration and computational efficiency, a balance existing methods often struggle to achieve. In this paper, we propose FSM-Net (Frequency-Spatial Multi-branch Network), a highly efficient solution that secured 2nd place in the NTIRE 2026 Challenge on Efficient Real-World Deblurring. FSM-Net pioneers a dual-domain approach: a novel Frequency Attention module explicitly recovers high-frequency structural details via FFT, while a Cross-Gated Vision E-Branchformer at the bottleneck captures global dependencies with linear complexity. To ensure robust convergence, we employ a progressive curriculum training strategy guided by a composite loss function (Multi-Scale Charbonnier, Structural Edge, and Frequency). Evaluated on the RSBlur benchmark, FSM-Net achieves an outstanding 33.144 dB PSNR with only 4.94M parameters and 159.35 GMACs (at 1920x1200 resolution). By effectively pushing the Pareto frontier of efficiency and quality, FSM-Net establishes a strong baseline for resource-constrained image restoration.
30. 【2605.31376】LiftNav: Path Planning via Semantic Lifting in TSDF-Guided Gaussian Splatting
链接:https://arxiv.org/abs/2605.31376
作者:Hannah Schieber,Dominik Frischmann,Victor Schaack,Angela P. Schoellig,Daniel Roth
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:unknown indoor environments, indoor environments require, Autonomous robots, object-level understanding, reliable collision avoidance
备注:
点击查看摘要
Abstract:Autonomous robots in unknown indoor environments require both reliable collision avoidance and object-level understanding. Classical representations such as TSDF support safe planning but lack semantics, while photorealistic methods like Gaussian Splatting (GS) provide rich appearance yet suffer from soft geometry, limiting precise obstacle avoidance. We present LiftNav, a hybrid navigation framework built on GSFusion's TSDF+GS dual map, augmented with a real-time pipeline of YOLO-based detection, TSDF-based 3D lifting, and B-spline trajectory optimization. This design enables flexible semantic navigation without dense 3D embeddings. We further introduce a hinge-loss-based collision penalty that improves trajectory smoothness and safety. We evaluate our approach in a simulation using the Replica dataset. Compared against a state-of-the-art radiance field baseline we show a 100% feasibility rate and shorter trajectories.
31. 【2605.31369】A Unifying View of Variational Generative Wasserstein Flows
链接:https://arxiv.org/abs/2605.31369
作者:Paul Caucheteux,Clément Bonet,Anna Korba
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:modern generative models, Wasserstein gradient flows, geometric principles, viewed as minimizing, algorithmic and geometric
备注: Accepted as a spotlight at ICML2026
点击查看摘要
Abstract:Many modern generative models can be viewed as minimizing divergences between probability distributions, yet they rely on different algorithmic and geometric principles. Wasserstein gradient flows provide a continuous-time formulation for optimizing over distributions, and can be approximated through their implicit discretization via the Jordan-Kinderlehrer-Otto (JKO) scheme. In this work, we present a unified theoretical framework for generative modeling based on Wasserstein gradient flows, which we refer to as Generative Wasserstein Flows (GWF). We show that a broad class of existing methods can be derived as instances of parametric JKO schemes for $f$-divergence objectives, and we establish equivalences between several recently proposed algorithms. We extend this framework beyond f-divergence to Integral Probability Metrics and squared Maximum Mean Discrepancy, deriving new JKO-based generative algorithms, and clarifying their connections with GANs. We study empirically the impact of the JKO regularization for a wide set of objectives. Finally, we analyze parametric Wasserstein flows, where the dynamics are restricted to distributions induced by parametrized maps.
32. 【2605.31351】A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation
链接:https://arxiv.org/abs/2605.31351
作者:Yi Zhao,Siqi Wang,Zhe Hu,Yushi Li,Jing Li
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visually Impaired Assistance, AI-based Visually Impaired, Impaired Assistance Benchmark, Visually Impaired, Impaired Assistance
备注:
点击查看摘要
Abstract:AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: this https URL
33. 【2605.31349】FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection
链接:https://arxiv.org/abs/2605.31349
作者:Paramananda Bhaskar,Naquee Rizwan,Daksh Jogchand,Saurabh Kumar Pandey,Animesh Mukherjee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Functionality Based Hateful, confounding rhetorical hate, Hateful meme detection, Based Hateful Memes, rhetorical hate mechanisms
备注:
点击查看摘要
Abstract:Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.
34. 【2605.31336】DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory
链接:https://arxiv.org/abs/2605.31336
作者:Zhenhao Yang,Xiaoshi Wu,Zhengyao Lv,Xiaoyu Shi,Xintao Wang,Pengfei Wan,Kun Gai,Kwan-Yee K. Wong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:promoted rapid progress, Recent advances, video generative models, generative models, promoted rapid
备注: Project page is available at [this https URL](https://jeffreyyzh.github.io/DecMem-Page)
点击查看摘要
Abstract:Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.
35. 【2605.31312】Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization
链接:https://arxiv.org/abs/2605.31312
作者:Haolin Deng,Xin Zou,Zhiwei Jin,Chen Chen,Haonan Lu,Xuming Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal hallucination remains, Multimodal hallucination, Vision-Language Models, Direct Preference Optimization, visual preference DPO
备注: ICML 2026
点击查看摘要
Abstract:Multimodal hallucination remains a persistent challenge for Vision-Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse-grained negatives that could enable shortcut learning. In this work, we propose In-Context Visual Contrastive Optimization (IC-VCO). By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. We further introduce Visual Contrast Distillation (VCDist), an auxiliary reliability-gated regularizer that encourages consistency between multi-image contrastive training and single-image inference. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC-VCO's best overall performance and the effectiveness of our sample editing strategy. Code and data are available at this https URL.
36. 【2605.31304】Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance
链接:https://arxiv.org/abs/2605.31304
作者:Doğukan Bağcı,Bernt Schiele,Simone Schaub-Meyer,Jonas Fischer,Robin Hesse
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, Deep neural, learn remains difficult, neural networks, Deep
备注: Preprint
点击查看摘要
Abstract:Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.
37. 【2605.31294】okTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens
链接:https://arxiv.org/abs/2605.31294
作者:Qingcheng Zhao,Yifang Pan,Karan Singh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, interaction with language, Conditional Flow Matching, conversational interaction, Recent
备注:
点击查看摘要
Abstract:Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to sequential stages of speech recognition, text generation, turn-based text response, speech synthesis, and audio driven facial animation. Based on our insight that audio-tokens produced by current Audio-LLMs carry sufficient information to reconstruct a plausible facial performance, we present TokTalk, a system that directly outputs expressive facial animation in real-time from streaming audio-tokens. We construct a novel audio-token to 3D facial motion dataset, on which TokTalk is trained using a Chunk-based Conditional Flow Matching model. A lightweight adaptation strategy allows our trained model to seamlessly connect to any token-based Audio-LLM at minimal computational overhead. Our chunk-based processing further enables parametric trade-off between latency and facial quality, shown through ablation studies. We further show that the real-time performance of TokTalk is comparable in latency to prior art solutions, and significantly favorable (via a perceptual study) in terms of quality, expressivity and control of the 3D facial performance. We showcase TokTalk's flexibility using a chatbot Avatar, a voice-driven user Avatar, and an animation Director's interface, as diverse audio-visual face applications.
38. 【2605.31292】Authentication of Copy Detection Patterns via Cross-Camera Dual-Synthetic Referencing
链接:https://arxiv.org/abs/2605.31292
作者:Ivan Oleksiyuk,Roman Chaban,Slava Voloshynovskiy
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Copy Detection Patterns, Detection Patterns, Copy Detection, enable cost-effective authentication, physical objects
备注: To appear in Proc. ICIP2026, September 13-17, 2026, Tampere, Finland
点击查看摘要
Abstract:Copy Detection Patterns (CDPs) are structures printed on physical objects to enable cost-effective authentication. Verification is achieved by comparing a captured image with the digital template from which the CDP was printed. In practice, printer stochasticity and camera distortions hinder this comparison, limiting robustness against counterfeiting. Prior work addressed camera effects by synthesising reference images in the verification camera domain, but it ignored printing variability. We introduce an enrolment-based cross-camera dual-synthetic referencing framework. Each printed CDP is first captured by a controlled enrolment camera, and a deep-learning-based translator jointly exploits the digital template and the enrolled capture to generate a high-quality reference for the verification image. We provide an information-theoretic justification showing that the dual reference is more informative than template-based references. Experiments on heterogeneous mobile cameras demonstrate improved authentication performance, robustness to machine-learning-based copy attacks, and reliable verification from small CDP regions and on low-end devices.
39. 【2605.31284】SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy
链接:https://arxiv.org/abs/2605.31284
作者:Suyog Jadhav,Dilip K. Prasad,Krishna Agarwal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:understanding cellular health, energy production, cellular health, metabolic regulation, morphological analysis
备注: Accepted at PHAROS-AIF-MIH workshop @ CVPR 2026
点击查看摘要
Abstract:The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production, and metabolic regulation. While foundation models like the Segment Anything Model (SAM) have revolutionized natural image segmentation, their direct application to FM is hindered by a significant domain shift characterized by diffraction-limited resolution, low contrast, and complex overlapping organelle networks. Furthermore, the development of robust models is bottlenecked by a severe lack of high-quality, manually annotated instance segmentation datasets for mitochondria. In this paper, we propose a scalable solution to this data scarcity by finetuning SAM exclusively on synthetically generated FM data. We simulate realistic mitochondria data and emulate the optical properties of fluorescence microscopes to create a large-scale annotated dataset. We evaluate our fine-tuned model on a curated dataset of real, manually annotated FM images. Qualitative and quantitative analyses demonstrate that our synthetically fine-tuned model improves precision and average dice score over strong baselines. This work establishes the potential of simulation-assisted training for FM instance segmentation.
40. 【2605.31283】opologically Consistent Multi-view 3D Head Reconstruction via Coarse-Guided Layered Surface Sampling
链接:https://arxiv.org/abs/2605.31283
作者:Timo Bolkart,Daoye Wang,Prashanth Chandran
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semantic Head Estimation, Head Estimation, efficient feed-forward framework, Semantic Head, dense semantic correspondence
备注: SIGGRAPH Conference Papers 2026
点击查看摘要
Abstract:We present SHELLS (Semantic Head Estimation via Layered Local Sampling), an efficient feed-forward framework for 3D head reconstruction in dense semantic correspondence from multi-view images. Existing methods typically refine vertices independently via localized feature volumes. This approach couples memory-intensive feature sampling to mesh resolution, which limits scalability for dense topologies ( 10k vertices) and introduces surface noise. In contrast, SHELLS decouples feature extraction from mesh resolution via a hierarchical sampling strategy. We extract multi-view features using a DINOv2 backbone with LoRA adaptation, projectively sample a sparse global feature cloud, and predict an intermediate coarse mesh. This coarse prior guides the construction of layered, surface-aware sampling shells that serve as a discrete search space for the final reconstruction. SHELLS maintains surface consistency while using 88% less inference GPU memory (2.4GB vs. 20GB) than volumetric baselines. It reduces median registration error by 21% to 29% with a 3.5x inference speedup (0.08s vs. 0.29s) for 18k-vertex meshes. Notably, our model is trained exclusively on synthetic data yet generalizes effectively to real-world captures, eliminating the need for the costly, pre-registered multi-view datasets common in prior work.
41. 【2605.31271】DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions
链接:https://arxiv.org/abs/2605.31271
作者:Weicheng Zheng,Yixin Huang,Qiao Sun,Derun Li,Hang Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Driving VLA framework, language-action gap limits, Driving VLAs, Driving VLA, language to improve
备注: arXiv admin note: text overlap with [arXiv:2605.21273](https://arxiv.org/abs/2605.21273)
点击查看摘要
Abstract:Driving Vision-Language-Action Models (Driving VLAs) aim to use language to improve end-to-end planning, but the language-action gap limits this promise. We propose DriveMA, a Driving VLA framework built on verifiable meta-actions, which summarize future ego motion into compact language-domain intentions and can be constructed from expert trajectories with a trajectory-grounded annotation pipeline and can be verified against generated trajectories through rule-based projection. DriveMA exploits this verifiability with action-centric supervised training and a data-efficient turn-level credit assignment reinforcement learning framework, explicitly aligning high-level decisions with low-level trajectory planning through dense rewards and precise credit assignment. DriveMA sets a new state of the art on the Waymo Open Dataset Vision-based E2E Driving, achieving a Rater Feedback Score of 8.060 with a 2B model and further improving it to 8.079 with a 4B model; it also obtains competitive closed-loop planning performance on NAVSIM. These results show that even a simple meta-action interface can achieve state-of-the-art planning when made verifiable and optimized for language-action alignment. Code, data, and models will be released to facilitate future research.
42. 【2605.31266】Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation
链接:https://arxiv.org/abs/2605.31266
作者:Nan Bao,Yifan Zhao,Wenzhuang Wang,Jia Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:task enables fine-grained, enables fine-grained control, task enables, spatial layouts, enables fine-grained
备注: Accepted to ICML 2026; code available at [this https URL](https://github.com/iCVTEAM/DSP)
点击查看摘要
Abstract:The layout-to-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few-shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation-driven framework that disentangles semantics from primitives for robust few-shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency-aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5-shot regime over state-of-the-art L2I methods in both visual fidelity and alignment across diverse atypical domains. The source code is publicly available at this https URL.
43. 【2605.31251】ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models
链接:https://arxiv.org/abs/2605.31251
作者:Kaiwen Xue,Tao Wei,Guoxin Zhang,Zhonghong Ou,Kaoyan Lu,Yu Feng,Yifan Zhu,Haoran Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal large language, shown strong potential, remains underexplored due, Multimodal large, geo-localization remains underexplored
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: this https URL
44. 【2605.31246】BadBone: Backdoor Attacks Against Backbone Models in Visual Prompt Learning
链接:https://arxiv.org/abs/2605.31246
作者:Ziqing Yang,Rui Wen,Xinlei He,Yun Shen,Michael Backes,Yang Zhang
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:attracted ample attention, ample attention due, proven efficacy, machine learning paradigm, Prompt learning
备注: Accepted by IEEE Transactions on Information Forensics Security
点击查看摘要
Abstract:Prompt learning is a new machine learning paradigm that has attracted ample attention due to its simplicity and proven efficacy. Despite its growing adoption, the security vulnerabilities associated with this paradigm remain underexplored. In this work, we take the first step to propose BadBone, a stealthy and adaptive backdoor attack against prompt learning using bi-level optimization. Instead of backdooring the prompt learning process, we aim to compromise a backbone model such that only target downstream tasks employing prompt learning inherit the backdoor vulnerability. Extensive experiments on three different models and three datasets from various domains show that our targeted/untargeted backdoored models achieve high attack performance while maintaining utility on both pre-training and downstream tasks. Moreover, we evaluate our approach against six state-of-the-art model-level defenses, including Neural Cleanse, ABS, MNTD, NAD, CLP, and D-BR. The results demonstrate that these defenses are largely ineffective against our backdoored models and thus leave the effective defense as an important direction for future work.
45. 【2605.31229】Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval
链接:https://arxiv.org/abs/2605.31229
作者:Alicja Dobrzeniecka,Filip Szatkowski,Sebastian Cygert,Szymon Lukasik,Bartlomiej Twardowski
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:remains critically underexplored, tasks remains critically, retrieval tasks remains, continually updating, critically underexplored
备注:
点击查看摘要
Abstract:While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model this http URL achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.
46. 【2605.31227】HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding
链接:https://arxiv.org/abs/2605.31227
作者:Andrea Zenotto,Simone Alberto Peirone,Francesca Pistilli,Giuseppe Averta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Procedural activities follow, follow well-defined structures, activities follow well-defined, Procedural activities, repairing a car
备注: Technical report for the Ego4D Goal Step - Step Grounding challenge at CVPR 2026, derived from [arXiv:2505.12911](https://arxiv.org/abs/2505.12911)
点击查看摘要
Abstract:Procedural activities follow well-defined structures: whether we consider a cooking recipe or a mechanic repairing a car, these activities naturally decompose in a hierarchy of steps and sub-steps. Traditional approaches for step grounding require extensive annotations and scale poorly. Instead, we argue that such hierarchical structure can emerge naturally from uncurated videos of human activities through recurring patterns of co-occurring actions and activities. Our approach builds on HiERO, a weakly-supervised representation learning approach that maps close in the feature space actions that are functionally related to each other, leveraging only fine-grained action-level narrations. In this feature space, procedure steps can be detected by a simple clustering, with no additional task-specific fine-tuning. For the Ego4D Step Grounding challenge, we augment this approach by ensuring fine and coarse level agreement in step assignments, enforcing strict temporal monotonicity of the grounded steps and post-processing the detected steps to reduce the impact of noisy predictions. We call this approach HiERO-StepG and it achieves 56.27 % on the R@1 (IoU = 0.3) metric on the global leaderboard at submission time, ranking second while being completely zero-shot and not requiring procedure-specific annotations. Project page: this https URL.
47. 【2605.31219】Latent Geometric Chords for Query-Efficient Decision-Based Adversarial Attacks
链接:https://arxiv.org/abs/2605.31219
作者:Ei Hmue Khine,Yao Li,Jiebao Sun,Shengzhu Shi,Zhichang Guo,Boying Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:severe security threat, current methodologies suffer, black-box adversarial attacks, adversarial attacks present, decision-based black-box adversarial
备注: 14 pages, 9 figures, 7 tables. Submitted to IEEE Transactions on Information Forensics and Security. The source code is available at [this https URL](https://github.com/eihmuekhine/Latent-Geometric-Chords)
点击查看摘要
Abstract:While decision-based black-box adversarial attacks present a severe security threat, current methodologies suffer from fundamental limitations. Pixel-wise attacks frequently introduce unnatural, high-frequency visual artifacts, while latent-space frameworks are confined by the limited search space of low-dimensional manifolds and inherent reconstruction flaws. To resolve these limitations, we propose Latent Geometric Chords (LGC) for Query-Efficient Decision-Based Adversarial Attacks alongside a variant, LGC-H. At its core, LGC navigates decision boundaries by executing a curvature-aware geometric search within a compressed semantic manifold. To guarantee high visual fidelity and circumvent dimensionality bottlenecks, we introduce a Residual-based Adversarial Generation (RAG) mechanism. RAG isolates semantic perturbations as geometric chords and superimposes them directly onto the original source image. RAG substantially resolves baseline reconstruction flaws and effectively doubles the permissible search space dimensions. Experimental results demonstrate that LGC achieves robust cross-dataset transferability and substantially outperforms state-of-the-art baselines. Notably, our method, LGC, minimizes perturbation magnitudes while achieving state-of-the-art visual fidelity--with a Structural Similarity Index Measure (SSIM) exceeding 0.99 and a Learned Perceptual Image Patch Similarity (LPIPS) below 0.01 at 5000 queries--and sustaining high attack success rates under stringent perceptual constraints, successfully compromising adversarially trained robust models. The source code is available at: this https URL.
48. 【2605.31217】ALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation
链接:https://arxiv.org/abs/2605.31217
作者:Abid Ali,Arunkumar Rathinam,Djamila Aouada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:process individual frames, estimation methods predominantly, methods predominantly process, predominantly process individual, image sequence acquired
备注: 13 pages paper with 3 figures in total
点击查看摘要
Abstract:Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.
49. 【2605.31215】Fixed-Point Masked Generative Modeling
链接:https://arxiv.org/abs/2605.31215
作者:Andrea Miele,Yiming Qin,Alba Carballo-Castro,Justin Deschenaux,Pascal Frossard
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Masked Generative Models, achieve strong performance, require full-sequence bidirectional, full-sequence bidirectional transformers, Generative Models
备注:
点击查看摘要
Abstract:Masked Generative Models (MGMs) enable parallel decoding and achieve strong performance across modalities, but require full-sequence bidirectional transformers at every step, making training costly and degrading quality under low sampling budgets. Existing work improves efficiency via better samplers or cheaper fixed-depth denoisers, but they still allocate a fixed amount of denoiser computation to each refinement step. We introduce Fixed-Point Masked Generative Models (FP-MGMs), which replace part of the denoiser with a fixed-point solver over shared attention layers to enable adaptive depth with fewer parameters. To make it more effective for masked generation, we first introduce a cross-step consistency loss, which aligns hidden representations at neighboring denoising steps and, second, three-state reuse (3SR) which warm-starts the solver using the previous solution by treating differently unchanged, still-masked, and newly revealed tokens respectively. Together, these components define our complete training-to-inference framework for fixed-point masked generation, \emph{CoFRe}. We also show that pre-trained MGMs can be converted into FP-MGMs with short fine-tuning, avoiding full retraining. Across modalities, CoFRe improves the quality and cost trade-off. On OpenWebText, CoFRe reduces parameters by 38.8\%, training time by 11.5\%, and VRAM by 16.9\%, while improving generative perplexity from 830.8 to 101.8 at a budget of $96$ transformer-block forward passes, compared to MDLM. In ImageNette, CoFRe reduces training time by 48.6\% and VRAM by 50.7\%, while improving FID in all sample budgets tested. Overall, CoFRe offers a practical framework for cheaper training and stronger low-budget masked generation.
50. 【2605.31212】Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education
链接:https://arxiv.org/abs/2605.31212
作者:Junling Wang,Boqi Chen,Heejin Do,Mubashara Akhtar,April Yi Wang,Mrinmaya Sachan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:educational content creation, support educational content, content creation, intended to teach, systems are increasingly
备注:
点击查看摘要
Abstract:AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.
51. 【2605.31204】Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
链接:https://arxiv.org/abs/2605.31204
作者:Johannes Schusterbauer,Jannik Wiese,Nick Stracke,Timy Phan,Björn Ommer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate weather forecasts, Accurate weather, textbf, Accurate, extreme weather conditions
备注: CVPR 2026, Project Page: [this https URL](https://compvis.github.io/weather-rf/)
点击查看摘要
Abstract:Accurate weather forecasts are essential across various domains and are safety-critical in extreme weather conditions. Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting. In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation. However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process. In this work, we introduce $\textbf{FREUD}$, a $\textbf{Fr}$ame-wise $\textbf{E}$ncoder and $\textbf{U}$nited $\textbf{D}$ecoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty via ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by model and test-time scaling. Code available here: this https URL
52. 【2605.31196】Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
链接:https://arxiv.org/abs/2605.31196
作者:Jun Wang,Xiaohao Xu,Xiaonan Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:Safe human, robot collaboration requires, safely separated, collaboration requires, Safe
备注: 31 pages, 9 figures
点击查看摘要
Abstract:Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.
53. 【2605.31192】he Regularizing Power of Language-Training Deepfake Detectors
链接:https://arxiv.org/abs/2605.31192
作者:Benedikt Hopf,Zongwei Wu,Radu Timofte
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advent of Multimodal-LLMs, Recently, artifacts, describable artifacts typically, artifacts typically generalize
备注:
点击查看摘要
Abstract:Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.
54. 【2605.31191】Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10
链接:https://arxiv.org/abs/2605.31191
作者:Umut Onur Yasar
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:relationships modulate knowledge, ResNet-based image classification, capacity relationships modulate, modulate knowledge distillation, effectiveness in ResNet-based
备注: 9 pages, 2 figures, 5 tables. Code available at [this https URL](https://github.com/umutonuryasar/kd-capacity-gap)
点击查看摘要
Abstract:We investigate how teacher-student capacity relationships modulate knowledge distillation (KD) effectiveness in ResNet-based image classification on CIFAR-10. Across three teacher-student pairs -- R50-R18, R34-R18, and R50-R34 -- we compare Logit-KD and Feature-KD under controlled, reproducible conditions (3 seeds, mean+/-std reported throughout). We report three main findings. First, student capacity is a key moderating factor in distillation gain: R34 students benefit substantially more from KD than R18 students even when teacher-student accuracy gaps are comparable, with the strongest gain of +0.30pp observed for R50-R34 Feature-KD versus +0.18pp for R34-R18 Feature-KD and +0.00pp for R34-R18 Logit-KD. Second, implementation correctness critically affects Feature-KD: a gradient clipping bug that excluded projection layers suppressed Feature-KD performance and produced misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs, reaching 95.55% on R50-R34 against a baseline of 95.25%. Third, input-resolution-aware architecture is a prerequisite for effective distillation: correcting the ResNet stem for 32x32 inputs raises teacher accuracy by over 5pp -- an order of magnitude larger than any KD gain. All code and results are available at this http URL.
55. 【2605.31187】From Local Geometry to Global Pseudo Labeling for Robust Positive Unlabeled Learning under Covariate Shift
链接:https://arxiv.org/abs/2605.31187
作者:Firas Gabetni,Alexandre Rocchi Henry,Nacim Belkhir,Ziyi Liu,Gianni Franchi
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:reliable vision systems, building reliable vision, Detecting covariate shift, explicitly detecting covariate, Detecting covariate
备注:
点击查看摘要
Abstract:Detecting covariate shift is critical for building reliable vision systems. While most prior work focuses on improving robustness to shift, explicitly detecting covariate shift remains underexplored. Existing approaches typically rely on fully supervised training, requiring labeled examples from both original and shifted distributions, which is often impractical. In this paper, we show that covariate shift detection can be effectively addressed with weaker supervision using Positive Unlabeled (PU) learning. However, under covariate shift, in distribution and shifted data overlap significantly, making classical PU methods unstable and sensitive to noise. To overcome this challenge, we introduce Spectral PU Neighborhood Annotation (SPUNA), a geometry aware framework that progressively discovers shifted data by leveraging the local manifold structure of visual features. Extensive experiments show that SPUNA achieves state of the art performance in PU settings and remarkably matches the performances of fully supervised methods. Moreover, our approach transfers robustly across different types of shifts, demonstrating strong generalization capabilities.
56. 【2605.31177】Vanilla ViT for Automotive Point Cloud Semantic Segmentation
链接:https://arxiv.org/abs/2605.31177
作者:Gilles Puy,Nermin Samet,Alexandre Boulch,Spyros Gidaris,Tuan-Hung VU,Renaud Marlet
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Plain Transformers, processing text, offering a unified, multimodal learning, unified backbone
备注:
点击查看摘要
Abstract:Plain Transformers have become the de-facto architecture for processing text, audio, image, and video, offering a unified backbone for multimodal learning. However, state-of-the-art architectures for point cloud semantic segmentation remain dominated by U-Nets architectures where convolutions are interleaved with local or windowed attentions. In this work, we show how to effectively leverage vanilla, non-hierarchical ViTs for segmentation of large-scale automotive lidar scenes. We bridge the performance gap thanks to a carefully designed tokenizer, a lightweight decoder segmentation head, and tailored data augmentations. Our approach, VaViT for Vanilla ViT, matches or exceeds the performance of state-of-the-art methods while maintaining the simplicity of ViT architecture. We provide extensive evaluations on nuScenes, SemanticKITTI, and Waymo Open Dataset to validate the efficiency of our method. Code and models are available at this https URL.
57. 【2605.31174】Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning
链接:https://arxiv.org/abs/2605.31174
作者:Wenlun Zhang,Jun Yin,Kentaro Yoshioka
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:heterogeneous object distributions, remains challenging due, diverse image degradations, degradations and heterogeneous, hinder the generalization
备注:
点击查看摘要
Abstract:Object detection in real-world scenarios remains challenging due to diverse image degradations and heterogeneous object distributions, which significantly hinder the generalization of existing detectors. Conventional approaches, including scene-specific representation learning and end-to-end pipeline design, are inherently limited by their reliance on predefined conditions and lack adaptability to dynamic environments. In this paper, we propose DetAS, an agentic detection framework that formulates object detection as a dynamic decision process. Instead of relying on static pipelines, DetAS leverages a Multimodal Large Language Model (MLLM) as a central agent to adaptively compose detection workflows by selecting from a toolbox of restoration modules and specialized detectors. Specifically, DetAS consists of two key components: Self-Adaptive Image Restoration, which dynamically determines whether and how to enhance images for downstream detection, and Multi-Expertise Detection, which integrates multiple domain-specialized detectors and resolves their predictions through instance-level reasoning. To further improve decision quality under fine-grained conditions, we introduce Self-Evolving Experience Harvesting and extend the framework to DetAS-X, which accumulates node-level decision experience from a small set of annotated data and enables experience-aware reasoning during inference. This mechanism allows the system to progressively refine its decision policy and adapt to diverse real-world scenarios. Extensive experiments on six challenging benchmarks demonstrate that DetAS-X significantly outperforms existing MLLM-based detectors, achieving an average improvement of 28.36% in F1 score, with up to 37.01% gain on DarkFace. These results demonstrate the promise of agentic detection and establish a solid foundation for its application in complex and dynamic environments.
58. 【2605.31162】Guidance for Low-Level Perceptual Editing in Unconditional Diffusion Models
链接:https://arxiv.org/abs/2605.31162
作者:Shreyansh Modi,Akshat Tomar,Aarush Aggarwal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:powerful generative priors, remains largely unexplored, offer powerful generative, aesthetically enhanced outputs, enhanced outputs remains
备注: 11 pages, 12 figures, Generative Models for Computer Vision Workshop CVPR 2026
点击查看摘要
Abstract:Unconditional diffusion models offer powerful generative priors, yet steering them toward aesthetically enhanced outputs remains largely unexplored. We show that h-space patching, the dominant paradigm for training-free diffusion editing, systematically fails for global, low-level transformations required for aesthetic and perceptual refinement. We introduce a novel, generalized framework for image-editing in unconditional diffusion models without explicit training. This inference-time mechanism operates on low-level features by extracting degradation concept vectors and combining bottleneck patching with classifier-free guidance to guide sampling away from the degraded manifold, producing consistently improved images without any model retraining.
59. 【2605.31158】Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models
链接:https://arxiv.org/abs/2605.31158
作者:Jiacheng Lu,Haoyi Zhu,Sipei Yi,Enze Xie,Yu Li,Cheng Zhuo
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:virtual scene navigation, real-time game simulation, generate video chunk, user-controlled camera movements, Interactive video world
备注: 13 pages, 6 figures, 3 tables. Project page: [this https URL](https://2843721358l-del.github.io/Light-Interaction-Project/)
点击查看摘要
Abstract:Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.
60. 【2605.31153】BIAS-ID: A Framework for Analyzing Transformation Biases in AI-Generated Image Detectors
链接:https://arxiv.org/abs/2605.31153
作者:Jonas Ricker,Asja Fischer,Erwin Quiring
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reliably distinguishing authentic, urgent research topic, AI-generated imagery online, distinguishing authentic images, harmful AI-generated imagery
备注:
点击查看摘要
Abstract:Given the surge of harmful AI-generated imagery online, reliably distinguishing authentic images from generated ones has become an urgent research topic. While many proposed detection methods perform well under controlled settings, they often collapse when tested on real-world data. A potential root cause are subtle biases in the detectors' training data. As a result, detectors may rely on spurious correlations instead of learning true forensic artifacts. While a recent line of work has identified the problem, there is not yet an established protocol to evaluate how biased a detector actually is. In this work, we therefore take a step back: First, we discuss what it means for a detector to be biased, and how this differs from a lack of robustness. Second, we propose BIAS-ID, a transparent framework for analyzing and quantifying the presence of transformation biases in AI-generated image detectors. We validate our framework by performing an evaluation of six detectors across two datasets, revealing that several state-of-the-art detection methods are strongly affected by biases. Our results highlight the importance of bias-aware evaluation for developing reliable AI-generated image detectors.
61. 【2605.31148】SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
链接:https://arxiv.org/abs/2605.31148
作者:Tianhui Liu,Jie Feng,Zhiheng Zheng,Shengyuan Wang,Yiming Guo,Yanxin Xi,Hangyu Fan,Yong Li,Pan Hui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:form cognitive representations, perceive spatial layouts, effortlessly perceive spatial, form cognitive, cognitive representations
备注:
点击查看摘要
Abstract:Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.
62. 【2605.31145】FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization
链接:https://arxiv.org/abs/2605.31145
作者:Mohammed Asad Karim,Vinay Kumar Verma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:seeks to localize, localize a target, small set, visually grounded ICL, grounded ICL remains
备注: Accepted at ICML 2026. * Equal Contributions
点击查看摘要
Abstract:In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.
63. 【2605.31137】PolSAR Image Classification using a Hybrid Complex-Valued Network (HybridCVNet)
链接:https://arxiv.org/abs/2605.31137
作者:Mohammed Q. Alkhatib
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision tasks, convolutional neural networks, convolutional neural, effectiveness in computer, Recently
备注: Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)
点击查看摘要
Abstract:Recently, convolutional neural networks (CNNs) have become popular for image classification due to their effectiveness in computer vision tasks. Now, researchers are exploring the potential of vision transformers (ViTs) in remote sensing and Earth observation. However, traditional Real-Valued networks often overlook important phase information in Complex-Valued (CV) data like polarimetric synthetic aperture radar (PolSAR) data. To address this, new CV deep architectures have emerged. HybridCVNet, a novel hybrid network, blends CV-CNN and CV vision transformer (CV-ViT) techniques. It efficiently combines CV 3D and 2D CNNs as feature extractors, enhancing PolSAR image classification by extracting complementary information and effectively leveraging interdependencies within the data. Experimental results from widely-used PolSAR datasets show HybridCVNet outperforms other methods, achieving an overall accuracy of 97.39% on the Flevoland dataset and showing promise even with just a 1% sampling ratio, with a Kappa value of 0.972 on the San Francisco dataset. Source code is accessible through this https URL
64. 【2605.31124】QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer
链接:https://arxiv.org/abs/2605.31124
作者:Zhizhen Pan,Hesong Wang,Huan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Geometry Grounded, single forward pass, Geometry Grounded Transformer, predicts camera parameters, Grounded Transformer
备注: Accepted by CVPR 2026. Project page: [this https URL](https://ddsacu.github.io/QVGGT/)
点击查看摘要
Abstract:Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps. Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32. Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.
65. 【2605.31116】NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving
链接:https://arxiv.org/abs/2605.31116
作者:Jiahui Li,Jiawei Sun,Zixiang Ren,Ming Liu,Jiamin Shi,Ruiteng Zhao,Zhiyang Liu,Liying Liu,Zuoguan Wang,Kaidi Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:compressing dense image, dense image patch, downstream trajectory generation, image patch tokens, Recent perception-free
备注:
点击查看摘要
Abstract:Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim12. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.
66. 【2605.31115】Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
链接:https://arxiv.org/abs/2605.31115
作者:Hao Zheng,Hu Wang,Tiantian Zheng,Prajjwal Bhattarai,Tuka Alhanai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex bimanual activities, understanding complex bimanual, densely predicting actions, densely predicting, untrimmed videos
备注: CVPR 2026
点击查看摘要
Abstract:Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12x larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at this https URL.
67. 【2605.31108】Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams
链接:https://arxiv.org/abs/2605.31108
作者:Jonathan Swinnen,Tinne Tuytelaars
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:time to evolving, domain incremental learning, non-stationary data, incremental learning, adapting models
备注:
点击查看摘要
Abstract:In this work we introduce a novel approach to domain incremental learning, adapting models over time to evolving, non-stationary data. In contrast to other works, we do not attempt to avoid catastrophic forgetting, but rather allow it and exploit it. Our model combines a main task head with a self-supervised masked autoencoder (MAE) head. We then learn domain-specific LoRA adapters during incremental training. Each adapter specializes to its domain, naturally inducing forgetting on other domains in both heads. At inference, we perform online test-time training on the self-supervised MAE head to identify which LoRAs best matches the current input, so the model can `remember' the domain again. Our scheme is especially well-suited to real-world streaming data, such as video, where consecutive samples are highly correlated and domain shifts are gradual. We demonstrate our method on domain-incremental action recognition and semantic segmentation tasks.
68. 【2605.31096】VGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
链接:https://arxiv.org/abs/2605.31096
作者:Chang-Bin Zhang,Yujie Zhong,Qiang Zhang,Kai Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:phase remains underexplored, multimodal large language, large language models, inference phase remains, enhance fine-grained perception
备注: Accepted by ICML 2026
点击查看摘要
Abstract:While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.
69. 【2605.31094】Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation
链接:https://arxiv.org/abs/2605.31094
作者:Erik Großkopf,Soumya Snigdha Kundu,Hendrik Möller,Nicolas Münster,Mehdi Astaraki,Paula Tamara Buzduga,Kerstin Ritter,Benedikt Wiestler,Jan Kirschke,Jonathan Shapey,Tom Vercauteren,Florian Kofler
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:jointly evaluating instance, Panoptic Quality, standard for jointly, jointly evaluating, Quality
备注: 9 pages, 4 figures
点击查看摘要
Abstract:The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction- and ground-truth-side degrees yields four matching strategies: One-to-One, Many-to-One, One-to-Many, and Many-to-Many. We show that the first three are well-defined within the PQ framework, while Many-to-Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex-based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part-aware panoptic segmentation, and we explore part-aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open-source package built on Panoptica. It exposes Voronoi-based region-wise analysis, part-aware evaluation, and Area Under Threshold Curve computations as configurable options.
70. 【2605.31093】Cross-Modal Clinical Knowledge Integration for Mammography Report Generation
链接:https://arxiv.org/abs/2605.31093
作者:Jiayi Zhu,Fuxiang Huang,Yu Xie,Xi Wang,Zhixuan Chen,Yuan Guo,Qingcong Kong,Zhenhui Li,Qiong Luo,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:global health concern, major global health, Breast cancer, mammography screening plays, health concern
备注: 16 pages, 5 figures
点击查看摘要
Abstract:Breast cancer is a major global health concern, and mammography screening plays a central role in early detection. The large volume of screening examinations creates a substantial workload for radiologists, making accurate and consistent report generation a critical clinical challenge. Existing automated mammography report generation methods primarily focus on direct visual-to-text mapping, while overlooking the structured clinical reasoning process followed by radiologists in real-world practice. To address this limitation, we propose MammoRG, a mammography report generation framework that explicitly simulates the clinical reporting workflow by following the BI-RADS guideline and incorporating prior clinical knowledge to produce diagnostic reports. Specifically, MammoRG adopts a two-stage training framework. In the first stage, the model learns to integrate clinically relevant prior knowledge from a patient's four-view mammograms through classification-based supervision. In the second stage, a terminology-aware supervised fine-tuning strategy is introduced to model mammography-specific clinical terms as atomic semantic units, enabling the generation of high-quality reports with improved clinical consistency. To facilitate clinical efficacy evaluation of generated reports, we further develop MammoRGTool, a dedicated mammography report parsing tool that extracts structured clinical information from free-text reports. Extensive experiments demonstrate that MammoRG consistently outperforms existing methods across multiple clinical efficacy metrics, particularly in diagnosis-related BI-RADS F1, where it surpasses the second-best model by 2.73%, 2.04%, 1.90%, and 3.27% on the internal, external 1, external 2, and VinDr-Mammo datasets, respectively.
71. 【2605.31090】On Revisiting Entropy for Identifying Mislabeled Images
链接:https://arxiv.org/abs/2605.31090
作者:Chunlei Li,Zixuan Zheng,Yilei Shi,Guanglu Dong,Pengfei Li,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:memorize erroneous labels, overparameterized models tend, datasets severely degrade, erroneous labels, severely degrade
备注: ICML 2026
点击查看摘要
Abstract:Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at this https URL.
72. 【2605.31080】A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
链接:https://arxiv.org/abs/2605.31080
作者:Iosif Tsangko,Andreas Triantafyllopoulos,George Margetis,Ioana Crihana,Björn W. Schuller
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Blind and low-vision, visual art descriptions, on-premise vision-language models, audiences remain underserved, vision-language models
备注: 7 pages, 2 figures, 3 tables. Preprint
点击查看摘要
Abstract:Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.
73. 【2605.31075】ask-Focused Memorization for Multimodal Agents
链接:https://arxiv.org/abs/2605.31075
作者:Tao Zou,Yichen He,Tian Qiu,Yuan Lin,Hang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accumulate world knowledge, build coherent experience, achieve continual learning, Long-term memory, coherent experience
备注:
点击查看摘要
Abstract:Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.
74. 【2605.31069】owards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
链接:https://arxiv.org/abs/2605.31069
作者:Bo Peng,YuanJie Lyu,PengGang Qin,Tong Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Accurately predicting future, Accurately predicting, Large Language Models, long-video event prediction, predicting future events
备注:
点击查看摘要
Abstract:Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.
75. 【2605.31068】HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning
链接:https://arxiv.org/abs/2605.31068
作者:Md Aminur Hossain,Ayush V. Patel,Sanjay K. Singh,Biplab Banerjee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:hybrid quantum-classical joint-embedding, quantum-classical joint-embedding predictive, joint-embedding predictive architecture, hybrid quantum-classical, quantum-classical joint-embedding
备注: 19 pages
点击查看摘要
Abstract:We introduce HQ-JEPA, a hybrid quantum-classical joint-embedding predictive architecture for cross-modal remote sensing representation learning. The proposed framework extends JEPA-style masked latent prediction to paired Sentinel-1 and Sentinel-2 imagery by predicting masked target representations from visible context regions while aligning heterogeneous modality features in a shared embedding space. To improve representation quality, HQ-JEPA combines four complementary objectives: latent token prediction, cross-modal token alignment, SIGReg-based Gaussian regularization in the fused latent space, and a differentiable SWAP-test-based Fidelity Quantum Similarity (FQS) loss. Unlike pixel reconstruction methods, HQ-JEPA learns semantic representations directly in latent space and uses quantum state-overlap-based similarity as an additional regularization signal. We evaluate the pretrained encoder on GeoBench classification and segmentation tasks under linear probing and fine-tuning settings. Results show that HQ-JEPA achieves competitive and often superior performance over strong self-supervised and remote sensing foundation-model baselines, demonstrating the benefit of integrating predictive self-supervision, cross-modal geometric regularization, and quantum fidelity-based representation learning for remote sensing applications.
76. 【2605.31057】LVSA: Training-Free Sparse Attention for Long Video Diffusion
链接:https://arxiv.org/abs/2605.31057
作者:Gael Glorian,Ioannis Lamprou,Zhen Zhang,Yujie Yuan,Hongsheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:cost grows quadratically, long-video diffusion inference, Long Video Sparse, Wan, cost grows
备注: 10 pages, 5 figures, 4 tables. Code: [this https URL](https://github.com/JiusiServe/LongVideoSparseAttention)
点击查看摘要
Abstract:Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.
77. 【2605.31048】Rethinking Efficient Crack Segmentation with Task-Aligned Structural-Directional Modeling
链接:https://arxiv.org/abs/2605.31048
作者:Shipeng Liu,Liang Zhao,Dengfeng Chen,Weihua Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:auxiliary enhancement branches, Recent crack segmentation, Recent crack, enhancement branches, methods often follow
备注:
点击查看摘要
Abstract:Recent crack segmentation methods often follow generic semantic segmentation designs, using stronger backbones, hybrid CNN-Transformer-Mamba encoders, and auxiliary enhancement branches. Although effective, this raises whether stronger generic feature mixing is the most suitable direction for crack segmentation. We instead formulate crack segmentation as sparse structural recovery. Cracks have limited category-level semantics but strong morphological regularities, being thin, sparse, anisotropic, locally fragmented, and easily confused with textures or shadows. Thus, the key bottleneck lies in preserving weak structural evidence, recovering directional continuity, and suppressing background coupling. We propose RIFT, a compact family of morphology-aligned crack segmentation models. Rather than compressing a complex generic architecture, RIFT is simple by design, preserving local evidence, aggregating cooperative directional continuity, and restoring crack structures through lightweight multi-scale fusion. Experiments on four public benchmarks show that RIFT achieves the best or tied-best results across the 16 main metrics against reproduced representative baselines. RIFT-B gives the strongest overall accuracy, while RIFT-T provides the best deployment efficiency with only 0.47M parameters and high inference speed. Topology-aware evaluation, ablations, transfer experiments, and visualizations further verify that task-aligned simplicity can match or surpass complex hybrid architectures when its inductive bias fits crack morphology. Code: this https URL
78. 【2605.31041】Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?
链接:https://arxiv.org/abs/2605.31041
作者:Jingtao He,Hongliang Lu,Xiaoyun Qiu,Yixuan Wang,Xinhu Zheng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:demonstrated promising capability, unified multimodal architectures, jointly modeling perception, highlighting the potential, perception and planning
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual information remains poorly understood. Existing evaluation protocols mainly focus on aggregate performance metrics, lacking structured and practical diagnostics to quantify visual-behavior dependency. In this work, we introduce a structured multi-level visual perturbation framework to analyze visual-behavior dependency in VLA-based driving models systematically. The framework organizes controlled visual perturbations along three complementary dimensions: channellevel degradation, information-level disruption, and structurelevel modification. We apply it to VLA-based driving systems and evaluate behavioral responses under both open-loop trajectory prediction and interactive closed-loop safety evaluation. Experimental results reveal evaluation-dependent dependency patterns and uneven visual grounding across abstraction levels. These findings call for more structured analyses and principled design of VLA driving models to better understand how visual information shapes behavior and develop safer, more robust systems.
79. 【2605.31039】GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
链接:https://arxiv.org/abs/2605.31039
作者:Xiangtao Kong,Jixin Zhao,Lingchen Sun,Rongyuan Wu,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-world, Generative Ground Truth, Real-world image restoration, models, LQ-HQ paired dataset
备注:
点击查看摘要
Abstract:Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.
80. 【2605.31033】SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation
链接:https://arxiv.org/abs/2605.31033
作者:Weijia Dou,Hui Li,Jiahao Cui,Lei Zhou,Jingdong Wang,Siyu Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation models typically, models typically rely, organizes historical context, video generation models, chunk segments
备注:
点击查看摘要
Abstract:Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic inconsistency when entities exit the frame or during interactive prompt transitions. To address these limitations, we propose SlotMemory, an object-centric Key-Value memory mechanism for streaming video diffusion. Our approach shifts the memory abstraction from "when" an event occurred to "what" is being represented by decomposing the transformer's key-value manifold into discrete, reusable semantic slots. By utilizing these slots as routing addresses to index and store high-fidelity key-value tokens, we enable entity-level persistence and prompt-aware retrieval across long horizons. Evaluated on 60-second interactive narratives using the Wan2.1-T2V-1.3B backbone, SlotMemory achieves a state-of-the-art quality score of 81.61 and a 22.8 percent relative improvement in dynamic consistency over the strongest existing streaming baseline. Our results demonstrate that structured semantic representation, rather than raw temporal capacity, is the essential primitive for persistent long-form video synthesis. Our codes and checkpoints are available at this https URL.
81. 【2605.31029】PEEK: Picking Essential frames via Efficient Knowledge distillation
链接:https://arxiv.org/abs/2605.31029
作者:Killian Steunou,Anas Filali Razzouki,Khalil Guetari,Mounîm A. El-Yacoubi,Yannis Tevissen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:making frame selection, Video-language models, limited number, selection a key, key bottleneck
备注: Supplementary material at [this https URL](https://www.killian-steunou.com/peek/static/pdfs/peek_supplementary.pdf)
点击查看摘要
Abstract:Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only $5.2\%$ to the captioning time, compared with $65.4\%$ for CSTA and $211.9\%$ for MaxInfo. We release our code and pre-trained checkpoint at this https URL.
82. 【2605.31001】Iterative Framework For Data Augmentation Of Segmented Fingerprints
链接:https://arxiv.org/abs/2605.31001
作者:João Leonardo H. D. Agnol,Wesley Augusto de Bona,Erick Oliveira Rodrigues,Luiz Fernando Puttow Southier,Jefferson Oliva,Marcelo Filipak,Dalcimar Casanova
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:biometrics presents unique, presents unique challenges, unique challenges due, Infant biometrics presents, robust matching systems
备注:
点击查看摘要
Abstract:Infant biometrics presents unique challenges due to the physiological differences between infants and adults, compounded by the scarcity of available data for research that limits the development of robust matching systems. This paper proposes a novel data augmentation method that uses iterative techniques to generate diverse variants of segmented fingerprints by inducing errors in a convolutional neural network trained to extract fingerprint ridges and valleys. Experiments on real infant fingerprints demonstrate the method's effectiveness in expanding fingerprint variability, with augmentations exhibiting significant fluctuations in minutiae counts while still retaining visual similarity to the originals. The study also highlights the method's customizable nature for applying varying levels of changes to fingerprint segmentations. Future research includes training segmentation and matching neural networks using datasets augmented by the proposed framework.
83. 【2605.30991】Parallel Tempering Initial Sampling in Inference-Time Reward Alignment
链接:https://arxiv.org/abs/2605.30991
作者:Myeongjun Oh,Gwangho Kim,Sungyoon Lee
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Sequential Monte Carlo, steers pretrained diffusion, flow-based generative models, satisfy user-specified rewards, Inference-time reward alignment
备注: 31 pages, 11 figures
点击查看摘要
Abstract:Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this task by iteratively filtering and propagating multiple particles. However, we show that standard SMC-based methods often suffer from poor performance because they initialize particles from a standard prior, whereas high-reward regions in complex reward landscapes are extremely rare. Further, we show that even recent reward-aware initial sampling approaches remain vulnerable to getting trapped in local modes, as complex reward landscapes are often multi-modal. To overcome these limitations, we propose PATHS (PArallel Tempering for High-complexity reward Sampling), a novel initialization method that couples multiple sampling chains through parallel tempering. PATHS maintains a ladder of reward-tempered chains and periodically performs Metropolis swaps, enabling efficient exploration across flattened reward landscapes, thereby mitigating the mode-trapping issues. Our analysis reveals that this mechanism substantially enhances the finite-budget exploration of rare, high-reward regions that are typically challenging to sample. Experiments on layout-to-image and quantity-aware generation show that PATHS achieves consistent gains in alignment quality, particularly on complex prompts.
84. 【2605.30987】Benchmarking Single-Step Inpainting Methods for Multi-Object 3D Gaussian Splatting Scenes
链接:https://arxiv.org/abs/2605.30987
作者:Finn Dröge,Cecilia Curreli,Abhishek Saroha,Daniel Cremers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, scenes face challenges, face challenges, Gaussian, Splatting
备注: Accepted as an extended abstract to the CVEU Workshop at CVPR 2026
点击查看摘要
Abstract:The tasks of object removal and inpainting 3D Gaussian Splatting (3DGS) scenes face challenges such as 3D consistency across camera views. In comparing 2D inpainters and their suitability for the 3D domain, we find that reconstruction-based inpainters outperform generative diffusion models in 3D consistency. Integrating these 2D inpainters into different single-step methods for creating and finetuning 3DGS scenes, our results indicate that initializing the scene from scratch produces higher quality results than finetuning the existing scene. Using a state-of-the-art generative 2D inpainter, we create a straightforward baseline to underline the importance of object removal before inpainting in the 3D setting. Since 360° datasets rarely include real-world ground truths, and challenging occlusion scenarios are equally sparse, we introduce a novel multi-object scene with recorded ground truth data and many views with object occlusions.
85. 【2605.30984】Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation
链接:https://arxiv.org/abs/2605.30984
作者:Tom Maye-Lasserre,Yitong Li,Bailiang Jian,Morteza Ghahremani,Benedikt Wiestler,Christian Wachinger
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:exhibit critically low, critically low pathology, generate fluent radiology-style, fluent radiology-style text, Template Collapse
备注:
点击查看摘要
Abstract:Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.
86. 【2605.30983】Can BEV Perception Gracefully Degrade under Sensor Failures?
链接:https://arxiv.org/abs/2605.30983
作者:Haifa Zhang,Yijing Wang,Haoyu Wang,Zheng Li,Zhiqiang Zuo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:current systems exhibit, existing fusion mechanisms, perception in autonomous, autonomous driving, current systems
备注:
点击查看摘要
Abstract:Despite the remarkable success of multi-modal bird's-eye view (BEV) perception in autonomous driving, current systems exhibit a critical vulnerability: existing fusion mechanisms are highly brittle to sensor corruptions, often causing catastrophic performance degradation. This vulnerability largely stems from the fact that standard fusion frameworks typically integrate multi-modal representations in a static manner, leading to a precipitous performance collapse under missing or corrupted modalities. In contrast, we show that graceful degradation is achievable through active modality reliability assessment. To this end, we present Grace-BEV, a lightweight and plug-and-play framework that enforces active reliability awareness during multi-modal fusion. Instead of relying on computationally expensive cross-modal interactions, Grace-BEV leverages the aligned BEV space to explicitly assess modality trustworthiness via a TrustGate Router and dynamically recalibrate feature integration using the FailSafe Fusion Block. Furthermore, we devise a Three-Phase Training strategy with Modality Dropout to prevent modality dominance and encourage balanced cross-modal learning under unreliable inputs. Extensive experiments on nuScenes-R and nuScenes-C show that Grace-BEV maintains robust performance across diverse corruption settings. Notably, under catastrophic LiDAR failures where standard baselines collapse to 0.0% mean Average Precision (mAP), Grace-BEV restores performance to as high as 34.7% mAP. Moreover, it improves clean accuracy by up to 1.4%, achieving a strong trade-off between robustness and efficiency.
87. 【2605.30972】BiSegMamba: Efficient Bidirectional Tri-Oriented Mamba for 3D Medical Image Segmentation
链接:https://arxiv.org/abs/2605.30972
作者:Bakht Zada,Chao Tong,Qile Su,Shuai Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fine boundary preservation, medical image segmentation, boundary preservation, image segmentation requires, medical image
备注: 10 pages, 7 figures, 5 tables. Code is available at: [this https URL](https://github.com/bakhtzadaabshare/BiSegMamba)
点击查看摘要
Abstract:Accurate 3D medical image segmentation requires both long-range volumetric context and fine boundary preservation. CNN-based methods have limited global dependency modeling, while Transformer-based models are often computationally expensive for dense 3D inputs. Recent Mamba-based methods provide an efficient alternative, but existing volumetric designs still depend on repeated high-resolution scanning, forward-only sequential modeling, and fixed directional summation, causing high cost, scan-order bias, and suboptimal directional aggregation. We propose BiSegMamba, an efficient bidirectional tri-oriented Mamba network for 3D medical image segmentation. BiSegMamba follows a compact-to-detail design, where a progressive compacting stem (PCS) enables efficient latent-space reasoning while retaining shallow high-resolution features for reconstruction. A multi-scale spatial mixer (MSSM) captures local anatomical patterns in early stages, and the proposed bidirectional tri-oriented Ortho Mamba (Bi-ToOM) block models long-range dependencies from multiple orthogonal views using jointly processed forward and backward scan sequences. Adaptive directional fusion (ADF) learns input-dependent channel-wise weights across scan orientations, replacing fixed summation with orientation-aware fusion. Experiments on a collected carotid CTA dataset and three public benchmarks, BraTS2023, ACDC, and AMOS-CT, show that BiSegMamba generalizes well across vascular, cardiac, brain tumor, and abdominal multi-organ segmentation tasks. Compared with SegMamba-V2, BiSegMamba achieves slightly better performance on BraTS2023 and clear improvements on ACDC and the carotid dataset, while reducing computational cost by up to 77.9% FLOPs, demonstrating a strong accuracy-efficiency balance for general 3D medical image segmentation.
88. 【2605.30969】Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning
链接:https://arxiv.org/abs/2605.30969
作者:Zhenwu Shi,Jingyu Gong,Peiwei Wang,Xingzan Wang,Tianwen Qian,Wenxi Li,Yuan Fang,Jiao Xie,Lizhuang Ma,Shaohui Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Text-based human motion, natural language instructions, Text-based human, modify existing motion, existing motion sequences
备注:
点击查看摘要
Abstract:Text-based human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. Our method integrates three complementary components: (1) retrospective feature supervision that enforces coarse-to-fine consistency across transformer layers,(2) motion preservation mechanism that focuses on subtle variations according to the source-target similarity, and (3) triplet-based semantic alignment that strengthens text-motion correspondence. Together, these components form a unified supervision paradigm that balances change and invariance. Extensive experiments on the MotionFix and STANCE Adjustment datasets demonstrate that OmniME achieves state-of-the-art performance in editing alignment, validating the effectiveness of our unified learning framework. Our source codes and models have been released at: this https URL
89. 【2605.30968】Variational Adapter for Cross-modal Similarity Representation
链接:https://arxiv.org/abs/2605.30968
作者:WenZhang Wei,Zhipeng Gui,Dehua Peng,Tiandi Ye,Huayi Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:vision-language models lies, core of vision-language, vision-language models, models lies, lies in measuring
备注: Accepted by the 43rd International Conference on Machine Learning (ICML 2026)
点击查看摘要
Abstract:The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.
90. 【2605.30942】PRISM: Progressive Reasoning through Iterative Slot Memory for Vision
链接:https://arxiv.org/abs/2605.30942
作者:Ziyu Wang,Shuangpeng Han,Mengmi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single feed-forward pass, Modern vision models, Modern vision, feed-forward pass, recover missing evidence
备注:
点击查看摘要
Abstract:Modern vision models process images in a single feed-forward pass, which limits their ability to recover missing evidence or refine uncertain representations under incomplete observations. Inspired by the iterative nature of human perception, we introduce PRISM (Progressive Reasoning through Iterative Slot Memory), a pyramid vision architecture that reasons over images through iterative refinement. At a high level, PRISM groups visual features into object-centric representations, retrieves relevant patterns from a learned memory, and iteratively refines the representation to resolve ambiguity and recover missing information. This organize-recall-refine process operates recurrently across multiple scales, enabling progressive improvement of visual representations. Across standard vision tasks, including image classification, object detection, and semantic segmentation, PRISM achieves competitive performance while demonstrating improved robustness under incomplete observations such as occlusion. These results suggest that iterative reasoning with structured representations and memory is a promising direction for building more resilient and adaptive vision models. Source code and models will be released.
91. 【2605.30939】IAF-Net: Illumination-Adaptive Fusion for Low-Light Urban Road Segmentation
链接:https://arxiv.org/abs/2605.30939
作者:Bingtao Wang,Daojie Peng,Fulong Ma,Jun Ma,Liang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semantic road segmentation, Semantic road, methods suffer severe, road segmentation, suffer severe performance
备注:
点击查看摘要
Abstract:Semantic road segmentation is important for autonomous driving, but existing methods suffer severe performance degradation under low-light conditions. Many existing multi-modal fusion methods do not explicitly adapt to illumination-dependent changes in modality reliability, which can propagate degraded RGB features into the fused representation at night. We propose IAF-Net (Illumination-Adaptive Fusion Network), an end-to-end framework with illumination-adaptive fusion for robust road segmentation across different lighting conditions. It dynamically adjusts fusion weights of RGB and geometric features via the core Illumination-Adaptive Fusion (IAF) module, and enhances low-light feature selection with a brightness-modulated attention decoder. We also construct two dedicated datasets: nuScenes Nighttime Road Segmentation (nuScenes-NRS) and CARLA Multi-Weather Road Segmentation (CARLA-MWRS). Experiments on nuScenes-NRS show state-of-the-art overall performance among the compared methods, while CARLA-MWRS further validates robustness across adverse weather conditions. Ablation studies on a 40% training subset further highlight the importance of the IAF module, which provides the largest individual gain of 0.70% in MaxF.
92. 【2605.30925】MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance
链接:https://arxiv.org/abs/2605.30925
作者:Nathan Sala,Ofir Abramovich,Ariel Shamir,Daniel Cohen-Or,Andreas Aristidou,Sigal Raab
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:generation has progressed, recent years, offering an expressive, human-computer interaction, progressed rapidly
备注: Accepted to SIGGRAPH 2026 conference. Project page: [this https URL](https://natsala13.github.io/multiact.github.io)
点击查看摘要
Abstract:Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: this https URL.
93. 【2605.30917】Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search
链接:https://arxiv.org/abs/2605.30917
作者:Gyu-Hwung Cho(1 and 2),Youngjune Lee(1),Kiyoon Jeong(1),Siyoung Lee(1),Sanggyu Han(1),Hervé Dejean(3),Stéphane Clinchant(3),Seung-won Hwang(2) ((1) NAVER Corp., Republic of Korea, (2) Seoul National University, Republic of Korea, (3) Naver Labs Europe, France)
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:gained increasing attention, enterprise PDFs continue, large-scale visual-document corpora, lexically indexes visual, neural query encoding
备注: 12 pages, 5 figures, 12 tables, preprint
点击查看摘要
Abstract:As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at this https URL.
94. 【2605.30915】DiTTo: Scalable Order-aware All-in-One Image Restoration Agent
链接:https://arxiv.org/abs/2605.30915
作者:Seungho Choi,Jihyong Oh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-world images rarely, removed substantially affects, vision-language model schedules, images rarely suffer, Restoration-action Trajectory Dataset
备注: Please visit our project page at [this https URL](https://cmlab-korea.github.io/DiTTo/)
点击查看摘要
Abstract:Real-world images rarely suffer from a single degradation, and the order in which degradations are removed substantially affects the final restoration quality, motivating agent-based image restoration (IR), where a vision-language model schedules a pool of pre-built restoration-experts. However, existing training-based agents require $\mathcal{O}((N^{\mathbf{D}})^{2})$ restoration-expert calls per image to construct the Optimal Restoration-action Trajectory Dataset (ORTD), where $N^{\mathbf{D}}$ denotes the number of degradation types in the universe $\mathbf{D}$, and couple agent training to a fixed restoration-expert pool, preventing extension to newly introduced restoration-experts without full retraining. To overcome these efficiency and extensibility bottlenecks, we propose \textbf{DiTTo}, a novel order-aware image restoration agent framework consisting of the DiTTo Simulator and the DiTTo Agent. The DiTTo Simulator combines $\cup$S-IR for single-step restoration-action simulation and AiO-IQA for per-action quality prediction, reducing ORTD construction to $\mathcal{O}(N^{\mathbf{D}})$ simulator calls per image; the DiTTo Agent is trained by SFT on the simulator-generated ORTD, followed by \textbf{Order-aware Restoration Alignment (ORA)} that aligns degradation identification, restoration-action-ordering, and output format along independent axes. This enables \textbf{plug-and-play scalable extensibility}: adding a new restoration-expert requires updating only the lightweight ORA stage. On the MiO-100 evaluation set with up to five concurrent degradations, our DiTTo Agent achieves state-of-the-art multi-degradation restoration quality among previous agent-based IR methods.
95. 【2605.30912】Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR
链接:https://arxiv.org/abs/2605.30912
作者:Ruina Hu,Chen Wang,Lai Wei,Jionghao Bai,Bin Yu,Weiran Huang,Kai Wang,Yue Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Reinforcement learning, improves vision-language models, outcome rewards derived, optimizing outcome rewards, improves vision-language
备注:
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.
96. 【2605.30911】What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness
链接:https://arxiv.org/abs/2605.30911
作者:Yusheng He,Jizhe Zhou,Xia Du,Zheng Lin,Jun Luo,Jiancheng Lv
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, key challenges undermining, reliability of Large, Large Vision-Language, key challenges
备注:
点击查看摘要
Abstract:Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.
97. 【2605.30904】MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging
链接:https://arxiv.org/abs/2605.30904
作者:Luyuan Zhang,Siyuan Li,Zedong Wang,Qingsong Xie,Cheng Tan,Anna Wang,Yanhao Zhang,Chen Chen,Haonan Lu,Haoqian Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:VAEs offer high-fidelity, continuous VAEs offer, offer high-fidelity reconstruction, VQ-based models enable, discrete VQ-based models
备注: 11 pages (main text), 7 figures. Preprint. Under review at NeurIPS 2026
点击查看摘要
Abstract:Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.
98. 【2605.30894】SteerFace: Debiasing Synthetic Face Generation via Adaptive Residue Perturbation
链接:https://arxiv.org/abs/2605.30894
作者:Yuxi Mi,Qiuyang Yuan,Jianqing Xu,Yichun Zhou,Xuan Zhao,Jun Wang,Rizen Guo,Shuigeng Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sparked growing interest, legally compliant data, shortage of legally, legally compliant, sparked growing
备注:
点击查看摘要
Abstract:The shortage of legally compliant data for face recognition training has sparked growing interest in using synthetic data as an alternative. While recent diffusion-based methods enable the generation of photorealistic face images with strong identity adherence and data diversity, their downstream recognition performance still exhibits a significant synthetic-real gap. This paper identifies visual tendency as a previously underexplored limitation, whereby synthetic data exhibit an unrealistic prevalence of visual attributes and thus deviate from the real-data distribution. Visual tendency can be attributed to the generator's conditioning on identity embeddings, through which co-occurring residual visual cues are unintentionally absorbed into learned identity semantics. To discourage the generator from exploiting such visual cues, this paper proposes SteerFace, a simple and efficient training framework that perturbs identity embeddings by steering them toward random orthogonal directions on the embedding hypersphere. The perturbation serves as an identity-preserving regularizer that penalizes the generator's reliance on non-identity components, as supported by theoretical analysis. This paper further introduces an adaptive strategy that learns perturbation strengths with both sample-wise preference and favorable overall statistics. Extensive experiments show that SteerFace effectively mitigates visual tendency, outperforms prior methods in downstream face recognition, and generalizes well across different training datasets and generation pipelines.
99. 【2605.30893】Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation
链接:https://arxiv.org/abs/2605.30893
作者:Qi Chen,Shuhan Ding,Yu Gu,Nan Liu,Jiang Bian,Alan Yuille,Zongwei Zhou,Jingjing Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compress high resolution, clinically relevant structure, preserving clinically relevant, Variational autoencoders, Foundation VAE
备注: ICML 2026 Accepted
点击查看摘要
Abstract:Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at this https URL.
100. 【2605.30884】GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning
链接:https://arxiv.org/abs/2605.30884
作者:Junlong Li,Chao Hao,Lap-Pui Chau,Yi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:agentic reinforcement learning, Existing agentic reinforcement, reinforcement learning methods, GUI grounding, agentic reinforcement
备注:
点击查看摘要
Abstract:Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.
101. 【2605.30863】DSD-GS: Dynamic-Static Decomposition of Gaussian Splatting for Efficient and High-Fidelity Dynamic Scene Reconstruction
链接:https://arxiv.org/abs/2605.30863
作者:Youngtae Han,Sung-hwan Han,Youngmin Yi
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:next-generation visual intelligence, visual intelligence applications, virtual reality, digital twins, view synthesis
备注: 23 pages, 9 figures, 7 tables
点击查看摘要
Abstract:Dynamic scene reconstruction and novel view synthesis are fundamental to next-generation visual intelligence applications such as virtual reality, robotics, and digital twins. However, high-fidelity reconstruction of complex, time-varying scenes from arbitrary viewpoints remains a significant challenge. Existing dynamic 3DGS methods suffer from computational inefficiency, since they model all Gaussians as dynamic components. While recent decomposition-based approaches address this issue, they still struggle with degraded reconstruction quality and prolonged training time. To mitigate these limitations, we propose a novel dynamic reconstruction framework built upon an efficient static-dynamic decomposition strategy using a Feed-Forward Gaussian Splatting encoder and an optical flow model. By eliminating redundant computations on static regions, our method achieves state-of-the-art performance, outperforming existing baselines across rendering quality, training and rendering speed, and storage efficiency. Notably, on the Neural 3D dataset, our framework requires only 10 minutes for training and achieves a rendering speed of over 700 FPS on a single NVIDIA RTX 5090 GPU at resolution of 1352x1014. Furthermore, our decomposition strategy eliminates the need for COLMAP preprocessing and enables deterministic initialization, thereby enhancing both efficiency and reproducibility.
102. 【2605.30855】Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation
链接:https://arxiv.org/abs/2605.30855
作者:Hanlin Chen,Jiaxin Wei,Xibin Song,Yifu Wang,Steve Wang,Hongdong Li,Pan Ji,Gim Hee Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interactive world simulation, Frame-wise action-controlled, world simulation, promising paradigm, paradigm for interactive
备注:
点击查看摘要
Abstract:Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from \textit{Latent--RGB Cycling}, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training--inference gap induced by the \textit{error-free hypothesis}, where clean training memory fails to match prediction-corrupted inference memory. To address these challenges, we present \textbf{Robust Dreamer}, a memory-augmented framework built around how to design 3D memory and how to use it robustly. First, we introduce \textbf{Latent Gaussian Memory}, which anchors diffusion latents inherited from the generation process to Gaussian primitives and recalls them via latent-space Gaussian splatting. This provides dense, geometry-aware, view-aligned conditioning while avoiding accumulated degradation from repeated VAE conversion. Second, we propose \textbf{Deviation Learning with Dynamic Deviation Archive}, which synthesizes rollout-induced latent deviations through a one-step approximation, stores them by autoregressive stage and denoising timestamp, and injects them into historical memory during training. This exposes the generator to realistic corrupted memory states and teaches internal correction before inference. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.
103. 【2605.30846】Count Anything
链接:https://arxiv.org/abs/2605.30846
作者:Mengqi Lei,Shuokun Cheng,Wei Bao,Shaoyi Du,Jun-Hai Yong,Siqi Li,Yue Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:counting remains fragmented, Object counting remains, Object counting, text-guided object counting, remains fragmented
备注:
点击查看摘要
Abstract:Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: this https URL.
104. 【2605.30829】LegSegNet: A Public Deep Learning System for Lower Extremity CT Tissue Segmentation and Quantification
链接:https://arxiv.org/abs/2605.30829
作者:Yuwen Chen,Yaqian Chen,Roy Colglazier,Haoyu Dong,Hanxue Gu,Maciej A. Mazurowski,Kevin W. Southerland
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:musculoskeletal disease monitoring, scale requires accurate, extremity computed tomography, Lower extremity computed, Lower extremity
备注: 9 pages
点击查看摘要
Abstract:Lower extremity computed tomography (CT) contains clinically relevant information for body composition analysis, sarcopenia assessment, and musculoskeletal disease monitoring, but extracting these measurements at scale requires accurate tissue segmentation and an automated quantification workflow. Existing public segmentation tools are not designed for comprehensive lower extremity CT analysis, particularly for clinically important inter/intramuscular adipose tissue, and most public methods only provide mask prediction rather than an end-to-end quantification system. To address this problem, we present LegSegNet, a deep learning system for lower extremity CT tissue segmentation and body composition quantification. Given an input CT scan, LegSegNet segments bone, skeletal muscle, subcutaneous adipose tissue, and inter/intramuscular adipose tissue. It then computes quantitative tissue measurements for downstream analysis. We developed the segmentation model using 1,302 manually annotated CT slices and evaluated it on 900 held-out test slices, with all annotations reviewed by radiologists. We benchmark LegSegNet against a broad set of 2D segmentation methods, including CNN-based models, transformer-based models, and finetuned foundation models, and further evaluate its generalization on an external public CT dataset. LegSegNet achieves the best overall segmentation performance, with an average Dice score of 89.31 on the held-out test set. To our knowledge, LegSegNet is the first publicly available end-to-end system for lower extremity CT tissue segmentation and quantification, providing a practical evaluation tool for future computer vision research in medical image analysis. The code and model weights are available at: this https URL
105. 【2605.30819】Function2Scene: 3D Indoor Scene Layout from Functional Specifications
链接:https://arxiv.org/abs/2605.30819
作者:Ruiqi Wang,Qimin Chen,Daniel Ritchie,Angel X. Chang,Manolis Savva,Kai Wang,Hao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:methods generate rooms, synthesis methods generate, object-centric prompts, methods generate, scene synthesis methods
备注: project page: [this https URL](https://function2scene.github.io/)
点击查看摘要
Abstract:Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.
106. 【2605.30794】MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding
链接:https://arxiv.org/abs/2605.30794
作者:Qian Kou,Xiaofeng Shi,Yulin Li,Xiaosong Qiu,Xinyang Wang,Hua Zhou,Cao Dongxing
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, visual question answering, Large Language
备注: accept by iclm2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.
107. 【2605.30784】xt-guided Feature Disentanglement for Cross-modal Gait Recognition
链接:https://arxiv.org/abs/2605.30784
作者:Zhiyang Lu,Ming Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:identifies individuals based, Cross-modal Gait recognition, Gait Modality Text, Gait recognition, LiDAR-Camera Cross-modal Gait
备注: Accept by CVPR2026
点击查看摘要
Abstract:Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the topk matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.
108. 【2605.30774】CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping
链接:https://arxiv.org/abs/2605.30774
作者:Haoyu Zhao,Jiaxi Gu,Haoran Chen,Qingping Zheng,Yeying Jin,Hongyi Yang,Junqi Cheng,Yuang Zhang,Zenghui Lu,Huan Yu,Jie Jiang,Peng Shu,Zuxuan Wu,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:maintaining geometric consistency, geometric consistency remains, Precise camera pose, Precise camera, remains a challenge
备注: 28 pages, 16 figures
点击查看摘要
Abstract:Precise camera pose control is critical for video diffusion, yet maintaining geometric consistency remains a challenge. Existing methods that directly inject numerical camera parameters into the diffusion backbone often fail to bridge the gap between abstract coordinates and visual content, leading to structural distortions. To address this issue, we propose CameraNoise, a flow-to-noise warping method that encodes camera motion into a temporally coherent stochastic representation. Unlike conventional conditioning, CameraNoise embeds camera poses directly into the noise space. This decouples motion from scene appearance while faithfully preserving trajectory dynamics. Specifically, we introduce a novel Geometry-guided Reprojection Flow and a noise warping algorithm, which jointly preserve the Gaussian prior of diffusion and ensure consistent noise propagation under camera transformations. By integrating CameraNoise into the diffusion process, our framework delivers stable, high-fidelity videos. Extensive experiments demonstrate that our approach significantly outperforms prior methods in both visual quality and trajectory faithfulness. The project page and code are available at: this https URL.
109. 【2605.30769】DisPlace: Discriminative Place Projections for Multi-Reference Visual Place Recognition
链接:https://arxiv.org/abs/2605.30769
作者:Dhyey Manish Rajani,Michael Milford,Tobias Fischer
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Visual Place Recognition, matching query images, challenge in Visual, Place Recognition, Visual Place
备注: Under review
点击查看摘要
Abstract:A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.
110. 【2605.30750】SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling
链接:https://arxiv.org/abs/2605.30750
作者:Xiang Fang,Wanlong Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Video-Language Models, critical causal transitions, sparse frame sampling, frame sampling creates, era of Large
备注: Accepted by ICML 2026
点击查看摘要
Abstract:In the era of Large Video-Language Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on generative hallucination (e.g., latent diffusion) or autoregressive extrapolation often fail to maintain semantic consistency over long horizons, suffering from object vanishing and energetic instability. We propose a paradigm shift from probabilistic generation to variational mechanics with the \textbf{Semantic Least Action Principle (SLAP)}. Drawing a rigorous isomorphism between classical mechanics and semantic dynamics, we model the latent video trajectory as a path on a Riemannian manifold governed by a Semantic Lagrangian. By formulating the interpolation task as a Boundary Value Problem (BVP) solved via the discrete Euler-Lagrange equations, SLAP naturally enforces object persistence without pixel-level rendering. Extensive experiments show the effectiveness of our proposed SLAP.
111. 【2605.30745】Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness
链接:https://arxiv.org/abs/2605.30745
作者:Xiang Fang,Wanlong Fang,Wei Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved unprecedented success, aligning visual features, broad semantic concepts, Large Vision-Language Models, Large Language Models
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Large Vision-Language Models have achieved unprecedented success in zero-shot recognition by aligning visual features with broad semantic concepts. However, this semantic abstraction creates a critical vulnerability in open-world deployment: the ``Hubris of Semantics'', where models force-fit unknown anomalies into known categories with high confidence due to the lack of explicit negative knowledge. To address this \textit{Open-World Trustworthiness Paradox}, we propose \textbf{Immuno-VLM}, a bio-inspired framework that adapts the biological principle of \textbf{Immunological Negative Selection} to high-dimensional latent spaces. Departing from traditional Open-Set Recognition methods that rely on passive density estimation or inefficient pixel-space outlier generation, Immuno-VLM leverages the generative reasoning of Large Language Models to actively hallucinate ``Semantic Antibodies'', textual descriptions of near-distribution outliers (e.g., look-alikes, contextual anomalies) that effectively bound the decision space of known this http URL experiments on ImageNet-1K and four challenging OOD benchmarks reveal that Immuno-VLM establishes a new state-of-the-art.
112. 【2605.30742】Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding
链接:https://arxiv.org/abs/2605.30742
作者:Xiang Fang,Daizong Liu,Wanlong Fang,Pan Zhou,Yu Cheng,Keke Tang,Kai Zou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:temporal sentence grounding, sentence grounding, temporal sentence, unsupervised temporal sentence, challenging TSG setting
备注: Published in Findings of EMNLP 2023
点击查看摘要
Abstract:This paper addresses the task of temporal sentence grounding (TSG). Although many respectable works have made decent achievements in this important topic, they severely rely on massive expensive video-query paired annotations, which require a tremendous amount of human effort to collect in real-world applications. To this end, in this paper, we target a more practical but challenging TSG setting: unsupervised temporal sentence grounding, where both paired video-query and segment boundary annotations are unavailable during the network training. Considering that some other cross-modal tasks provide many easily available yet cheap labels, we tend to collect and transfer their simple cross-modal alignment knowledge into our complex scenarios: 1) We first explore the entity-aware object-guided appearance knowledge from the paired Image-Noun task, and adapt them into each independent video frame; 2) Then, we extract the event-aware action representation from the paired Video-Verb task, and further refine the action representation into more practical but complicated real-world cases by a newly proposed copy-paste approach; 3) By modulating and transferring both appearance and action knowledge into our challenging unsupervised task, our model can directly utilize this general knowledge to correlate videos and queries, and accurately retrieve the relevant segment without training. Extensive experiments on two challenging datasets (ActivityNet Captions and Charades-STA) show our effectiveness, outperforming existing unsupervised methods and even competitively beating supervised works.
113. 【2605.30734】Beyond Accuracy: Evaluating Efficiency, Robustness and Explainability in Deep Learning for Malaria Diagnosis
链接:https://arxiv.org/abs/2605.30734
作者:Olivier Kanamugire,Kerol Djoumessi
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:infrastructure makes timely, scarce diagnostic infrastructure, diagnostic infrastructure makes, sub-Saharan Africa, makes timely
备注: Under review
点击查看摘要
Abstract:Malaria remains a leading cause of mortality in sub-Saharan Africa, where scarce diagnostic infrastructure makes timely, accurate diagnosis particularly challenging. While deep learning offers a compelling path toward automated malaria screening, clinical adoption is hindered by computational cost and opacity in decision-making. This work benchmarks four deep learning models spanning a wide range of designed design architectures and model capacities on the NLM-Malaria dataset, jointly evaluating predictive performance, robustness, and post-hoc explainability. We find that lightweight, efficient-by-design models match their heavier counterparts in predictive performance, and the Friedman test confirms no statistically significant performance differences. CAM-based XAI methods consistently localize diagnostically relevant regions, while fine-grained attribution methods produce less targeted explanations, particularly with heavier backbones. Robustness evaluation under three types of image corruption further reveals that model confidence degrades faster than accuracy, providing a practical signal for human review. However, no XAI method is robust to corruption, with explanation reliability degrading at noise levels plausible in clinical practice, even when predictions remain accurate. These findings support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings, while highlighting the vulnerability of post-hoc explanations as an important consideration for responsible clinical deployment.
114. 【2605.30716】Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation
链接:https://arxiv.org/abs/2605.30716
作者:Zhiyuan Yang,Jiahao Cheng,Vincent Quoc-Huy Trinh,Mahdi S. Hosseini
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Generating clinically, long visual-token sequences, whole-slide images, gigapixel resolution, long visual-token
备注: Accepted by the DeLTA 2026 conference
点击查看摘要
Abstract:Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.
115. 【2605.30714】Vision-Based Localization in Dense Urban Environments: A Case Study of an Urban Village in China
链接:https://arxiv.org/abs/2605.30714
作者:Menglin Wu,Rui Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:major residential hubs, cities in China, rapid urbanization, result of rapid, major residential
备注:
点击查看摘要
Abstract:Urban villages, the widespread informal settlements which have emerged as a result of rapid urbanization, are now major residential hubs for migrant workers in large cities in China. The dense arrangement of buildings in these areas often leads to unreliable GPS signals, while incomplete mapping data further impairs accurate route planning and navigation. These issues not only hinder everyday mobility but also pose significant challenges for emergency response, as confusing road layouts and GPS inaccuracies can complicate evacuation efforts. To address these challenges, we propose a practical vision-based geo-localization solution tailored for dense urban environments. Our approach features a low-cost data collection pipeline utilizing a dual-camera system, comprising a panoramic camera and a smartphone camera, to capture synchronized 360-degree panoramas and query images. Using Shipai Village, a well-known densely populated urban village in Guangzhou, as a case study, we develop a specialized image geo-localization dataset. We then assess and compare the performance of existing models across various scene types to identify their strengths and weaknesses. The findings demonstrate both the potential and limitations of visual-based localization in dense urban-village environments. Our framework aims to enhance pedestrian navigation, last-mile delivery, and emergency management in areas with poor GPS coverage, ultimately supporting the vulnerable populations living within these informal settlements.
116. 【2605.30713】Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models
链接:https://arxiv.org/abs/2605.30713
作者:Yijie Tong,Yifan Hou,Shaobo Cui,Antoine Bosselut,Mrinmaya Sachan
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Test-time compute, large language models, lightweight approach, approach to boost, boost reasoning
备注: ICML 2026
点击查看摘要
Abstract:Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.
117. 【2605.30705】Equivariant Latent Alignment via Flow Matching under Group Symmetries
链接:https://arxiv.org/abs/2605.30705
作者:Sunghyun Kim,Jaehoon Hahm,Jeongwoo Shin,Joonseok Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Geometry-aware generative models, shown strong potential, Geometry-aware generative, view synthesis, view synthesis approaches
备注:
点击查看摘要
Abstract:Geometry-aware generative models and novel view synthesis approaches have shown strong potential in visual fidelity and consistency. In parallel, equivariant representation learning has emerged as a powerful framework for constructing latent spaces where analytically known group transformations could act directly, capturing geometric structure in data and enhancing both interpretability and generalization in novel view synthesis. However, we identify that existing approaches often suffer from latent misalignment, a discrepancy between the intended group action and the actually required transformations in the latent space. Consequently, the learned latents often fail to consistently preserve the equivariant relations imposed by the underlying group symmetry. To address this, we propose Residual Latent Flow, a flow-based framework that corrects the misaligned latents, thereby improving compliance with the underlying equivariance relation. Our comprehensive experiments show that our method significantly reduces latent misalignment and improves novel view synthesis quality, under rotation groups SO(n).
118. 【2605.30700】Mathematical Morphology in Machine Learning
链接:https://arxiv.org/abs/2605.30700
作者:Erick Oliveira Rodrigues,Aura Conci
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:mathematical morphology-an established, morphology-an established visual, established visual computing, visual computing theory-into, computing theory-into machine
备注:
点击查看摘要
Abstract:This work introduces mathematical morphology-an established visual computing theory-into machine learning to exploit shape and density aspects often overlooked by standard techniques. We propose a fast clustering algorithm based on morphological reconstruction that accurately preserves cluster shapes and density. This scheme offers unique features: an intrinsic sense of maximal clusters, cost-free noise removal, and diverse growth patterns controlled by structuring this http URL, we propose a novel distance metric combining Minkowski and Chebyshev distances, highly efficient for morphological dilations. In $Z^2$ discrete neighbourhood iterations, it is roughly 1.3 times faster than Manhattan and 329.5 times faster than Euclidean distances. When evaluated using a k-Nearest Neighbours (k-NN) classifier across 33 UCI datasets against 14 other distances, our metric achieved above-average accuracies most frequently (26 of 33 cases) and the best overall accuracy in 9 this http URL, we introduce novel morphological classifiers. Unlike current literature, this proposal uniquely models shape, density, and fractal information in datasets.
119. 【2605.30699】A Context-Aware Middleware for Medical Image Based Reports: An approach based on image feature extraction and association rules
链接:https://arxiv.org/abs/2605.30699
作者:Erick O. Rodrigues,Jose Viterbo,Aura Conci,Trueman Mac Henry
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:medical workflow organization, efficiency improvement, work proposes, proposes a context-aware, workflow organization
备注:
点击查看摘要
Abstract:This work proposes a context-aware middleware for medical workflow organization and efficiency improvement. In hospitals, laboratories and teleradiology companies, each physician or technician is specialized in a specific kind of diagnosis or analysis. Therefore, certain types of medical images are often forwarded to a certain physician or a certain group. This forwarding is time consuming. That is, repeatedly deciding who would be the best physician, whether he is available at a certain moment given a certain context is exhaustive and may be very inefficient. Thus, the proposed middleware has the ability to process and collect data from images analyzed by each medical staff. Based on the collected data and current clinical context, the middleware is able to infer who would be the best fit staff to receive a certain incoming medical image.
120. 【2605.30698】Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence
链接:https://arxiv.org/abs/2605.30698
作者:Yuhan Wang,Shuochen Chang,Yalin Feng,Dongsheng Ma,Yuanzi Li,Zhengren Wang,Yinglong Yang,Yufei Chen,Yikang Wang,Shaoxu Sun,Wentao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:Vision-language models, visual question answering, question answering, textbf, achieved strong performance
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.
121. 【2605.30689】ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization
链接:https://arxiv.org/abs/2605.30689
作者:Kanchan Keisham,Thenukan Pathmanathan,Thangarajah Akilan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Zero-shot Temporal Action, Temporal Action Localization, Zero-shot Temporal, previously unseen actions, Action Localization
备注: 4 figures, 8 tables
点击查看摘要
Abstract:Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.
122. 【2605.30671】WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation
链接:https://arxiv.org/abs/2605.30671
作者:Varun Nair,Vidyut Baradwaj,Jiahang He,Anya Singh,Jai Relan,Cabrel Happi
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Recovering ego-camera orientation, Recovering ego-camera, egocentric demonstrations, disentangling hand motion, prerequisite for disentangling
备注:
点击查看摘要
Abstract:Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.
123. 【2605.30639】PInVerify: An Offline Embodied Benchmark for Active Instance Verification
链接:https://arxiv.org/abs/2605.30639
作者:Yuhang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:made strong progress, subtle attribute differences, white floral, white striped, require close-range
备注: Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: [this https URL](https://github.com/Avalon-S/PInVerify)
点击查看摘要
Abstract:Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: this https URL.
124. 【2605.30631】Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models
链接:https://arxiv.org/abs/2605.30631
作者:Arunkumar Kannan,Yanbo Zhang,Han Liu,Michael Baumgartner,Jianing Wang,Alexander Hertel,Bogdan Georgescu,Sasa Grbic
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:automated diagnosis systems, achieved remarkable success, development remains limited, based lung cancer, lung cancer screening
备注:
点击查看摘要
Abstract:While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their development remains limited by the scarcity of diverse, annotated pulmonary nodule datasets. Diffusion-based generative models offer a promising strategy for data synthesis; however, many existing conditional approaches primarily optimize spatial reconstruction losses, which encourage voxel-wise similarity but may inadequately constrain lesion-level intensity distributions. As a result, these methods may produce over-smoothed texture profiles and underrepresent the distinct attenuation characteristics of different nodule subtypes, including solid, part-solid, and ground-glass nodules. To address this challenge, we propose a controllable latent diffusion model that synthesizes pulmonary nodules within full 3D CT volumes while accurately modeling nodule-specific intensity distributions. Specifically, rather than relying solely on spatial losses, we introduce a histogram-based regularization term that constrains voxel intensity distributions during the generative process. The model combines subtype, spatial mask, and Hounsfield unit (HU) histogram conditioning with the differentiable feature-space histogram regularization term to better align lesion-level intensity distributions, improving the visual plausibility and subtype consistency of synthesized nodules. Extensive experiments on lung CT data demonstrate that our framework achieves strong visual realism, validated through both quantitative metrics and a visual Turing test. Furthermore, when used for data augmentation, the generated nodules improve performance in downstream clinical tasks, particularly for underrepresented nodule subtypes, and show a potential benefit for subtype-informed malignancy classification.
125. 【2605.30611】Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
链接:https://arxiv.org/abs/2605.30611
作者:Haozhe Zhao,Shuzheng Si,Zhenhailong Wang,Zheng Wang,Liang Chen,Xiaotong Li,Zhixiang Liang,Maosong Sun,Minjia Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:complex research ideas, communicating complex research, producing publication-quality illustrations, publication-quality illustrations remains, research ideas
备注: 24 pages, 11 figures
点击查看摘要
Abstract:Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at this https URL.
126. 【2605.30587】ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models
链接:https://arxiv.org/abs/2605.30587
作者:Zihu Wang,Karthik Somayaji N.S,Peng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:verbalizing intermediate reasoning, intermediate reasoning steps, natural language, significantly improved, ability of large
备注:
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning has significantly improved the reasoning ability of large vision-language models (LVLMs) by verbalizing intermediate reasoning steps in natural language. However, such discrete textual rationales are often insufficient for encoding continuous visual evidence. Recent work addresses this limitation by moving reasoning into continuous latent space. Despite promising progress, existing methods leave latent reasoning insufficiently connected to the compositional and relational structure of visual evidence. To address this gap, we introduce ReGuLaR, a relation grounded latent reasoning framework that explicitly grounds latent states in these critical yet overlooked visual evidence. ReGuLaR uses a training-time ReGFormer to focus latent reasoning on question-relevant objects and inter-object relations, while at inference time the model reasons and generates answers without invoking the ReGFormer. To support training ReGuLaR, we construct RGROUNDING-351K, a real-world vision-language dataset annotated with key object bounding boxes and inter-object relations. Extensive experiments across diverse benchmarks show that ReGuLaR consistently outperforms existing approaches and achieves state-of-the-art performance. We include our code in the submission and will release the code and training data publicly upon acceptance.
127. 【2605.30581】Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes
链接:https://arxiv.org/abs/2605.30581
作者:Chenxi Tao,Seung-Kyum Choi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:involves a broader, broader mismatch, evidence and required, Industrial visual, simulated RGB-D observations
备注: Review article; 103 references; 9 main figures; empirical anchors on T-LESS/BOP, MVTec AD, and VisA
点击查看摘要
Abstract:Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.
128. 【2605.30578】AdvScene: Rethinking Adversarial Patch Evaluation Through Scene Robustness
链接:https://arxiv.org/abs/2605.30578
作者:Xiaoyong(Brian)Yuan, Lan (Emily)Zhang
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:objects to mislead, Adversarial patches, scene robustness, scene, physical patterns attached
备注:
点击查看摘要
Abstract:Adversarial patches are physical patterns attached to real objects to mislead AI vision systems. Their real-world risk is not determined by a single successful prediction, but by whether they remain effective after deployment under changing viewpoints, distances, and scene conditions. We refer to this property as scene robustness, the effectiveness of a deployed patch across conditions in a real environment. Yet existing evaluations do not measure scene robustness well: real image benchmarks are realistic but fixed, while simulators are controllable but not grounded in a specific real scene. We present AdvScene, a scene-grounded framework for measuring the scene robustness of adversarial patches in reconstructed real environments. AdvScene reframes evaluation as operational measurement: given a fixed deployed patch, it characterizes the patch's operational envelope - where and when the attack succeeds - as a function of viewpoint, distance, and scene context. A key challenge is that the attack is typically defined only in a single anchor view, while evaluation requires a representation that remains faithful under viewpoint changes. We formalize this as a constrained lifting problem and introduce Adversarial Patch-to-Scene Embedding (APSE), which resolves cross-view ambiguity while preserving attack-critical appearance and enforcing locality, target-surface attachment, and cross-view consistency. We validate AdvScene using real-world physical data and conduct a comprehensive evaluation of existing adversarial patches. Our results show that AdvScene reveals substantial scene-dependent variation in attack effectiveness that is not captured by existing image-centric or simulator-based evaluations.
Subjects:
Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.30578 [cs.CR]
(or
arXiv:2605.30578v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2605.30578
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
129. 【2605.30561】VLM3: Vision Language Models Are Native 3D Learners
链接:https://arxiv.org/abs/2605.30561
作者:Zhipeng Cai,Zhuang Liu,Yunyang Xiong,Zechun Liu,Vikas Chandra,Yangyang Shi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision Language Models, Vision Language, Language Models, Vision, Language
备注:
点击查看摘要
Abstract:Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 - 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.
130. 【2605.30557】Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
链接:https://arxiv.org/abs/2605.30557
作者:Yue Zhang,Zun Wang,Han Lin,Yonatan Bitton,Idan Szpektor,Mohit Bansal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:deployed in real-world, real-world environments, fundamental capability, capability for vision-language, Spatial reasoning
备注: Website: [this https URL](https://zhangyuejoslin.github.io/spatialuncertain/)
点击查看摘要
Abstract:Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.
131. 【2605.30544】On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection
链接:https://arxiv.org/abs/2605.30544
作者:Gudrun Schappacher-Tilp,Nicoletta Kaehling,Jan Kornberger,Egon Teiniker
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:Data Protection Regulation, General Data Protection, Protection Regulation, creating fundamental tensions, Visual monitoring systems
备注: 6 pages, 4 figures, 3 tables, 1 listing
点击查看摘要
Abstract:Visual monitoring systems that rely on cloud-based AI inference expose raw image data to external services, creating fundamental tensions with the data-minimisation principle of the General Data Protection Regulation (GDPR). This paper presents a proof-of-concept privacy-by-design pipeline that resolves this tension by confining all inference entirely to the edge device. A YOLOv5n-seg model compiled for a Hailo-8L AI accelerator delivers real-time object detection on a Raspberry Pi 5, from which raw pixel buffers are immediately discarded after inference. A stateful trigger engine forwards minimal JSON event payloads to a locally hosted instance of Phi-3 Mini (3.8B parameters, Q4_0 quantisation), which synthesises one-to-two sentence natural-language alerts for a human operator. No image data crosses the network boundary at any point; only the generated text alert is transmitted. We describe the full system architecture and implementation, report measured inference latency and resource utilisation on the target hardware, and present representative generated alerts. The results demonstrate that combining a dedicated neural-network accelerator with an on-device large language model on a single-board computer is not only feasible but produces practically deployable, human-readable monitoring output while aligning with GDPR Art. 5(1)(c) by design.
132. 【2605.30519】OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation
链接:https://arxiv.org/abs/2605.30519
作者:Lin Zhao,Yushu Wu,Yifan Gong,Yanzhi Wang,Pu Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:latent chunks sequentially, producing latent chunks, requires repeated access, long videos requires, videos requires repeated
备注: 22 pages, 14 figures; project page: [this https URL](https://wuyushuwys.github.io/OmniMem/)
点击查看摘要
Abstract:Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.
133. 【2605.30512】PhyDrawGen: Physically Grounded Diagram Generation from Natural Language
链接:https://arxiv.org/abs/2605.30512
作者:Nafiul Haque,Syed Nazmus Sakib,Shifat E Arman
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating physics diagrams, requires strict adherence, Generating physics, text requires strict, physics diagrams
备注: 9 figures, 7 tables. Under review at EMNLP 2026
点击查看摘要
Abstract:Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausible outputs, they systematically hallucinate force vectors, ignore conservation laws, and violate geometric constraints. We present PhyDrawGen, a neuro-symbolic pipeline that decouples semantic scene understanding from physical constraint satisfaction. First, a large language model extracts a typed scene graph from the problem text. A deterministic solver then converts this graph into a Planar Straight-Line Graph (PSLG), encoding force balance, optical paths, and field topologies as exact geometric primitives. Finally, a fine-tuned Qwen-VL model implements a visually grounded propose-verify loop to iteratively correct any constraint violations. Evaluated on a benchmark of 1,449 problems spanning mechanics, optics, and electromagnetism, PhyDrawGen significantly outperforms GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro, demonstrating robust physical accuracy even on unusual-object problems.
134. 【2605.30510】A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images
链接:https://arxiv.org/abs/2605.30510
作者:Sourjya Mukherjee,Ananya Bhattacharjee,R. Murugan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:brain tumor diagnosis, cancer severity necessitates, severity necessitates precise, Global Context-aware Squeeze, brain tumor
备注: 11 pages, 9 figures, 6 tables. Submitted to arXiv cs.CV
点击查看摘要
Abstract:Brain cancer's severity necessitates precise brain tumor segmentation, which is crucial for effective brain tumor diagnosis. Manual identification, burdened by high costs, labor, and error risks, highlights the need for automated methods. In this study, we introduce the Global Context-aware Squeeze and Excite Residual UNet (GCSER-UNet), which facilitates a fusion of spatial and channel-wise attention and thus enhances the model's capacity to capture intricate spatial dependencies and contextual information. GCSER-UNet efficiently extracts tumor segments from multimodal MRI slices, delivering exceptional performance. Evaluations on benchmark databases exhibit its superiority, achieving a notable 94 percent dice score on the TCGA LGG dataset, surpassing the state-of-the-art dice score of 91.8 percent. In the BraTS 2020 dataset, the proposed GCSER-UNet ensemble approach yielded dice scores of 95 percent, 92 percent, and 90 percent for the tumor regions - Whole Tumor (W), Tumor Core (T), and Enhancing Tumor (E), respectively. The current state-of-the-art dice scores were 94 percent, 93 percent, and 88 percent. These compelling outcomes highlight the efficacy of GCSER-UNet in precise brain tumor segmentation and thus can aid neurologists in effective brain cancer management and treatment planning.
135. 【2605.30506】VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments
链接:https://arxiv.org/abs/2605.30506
作者:Shivendra Agrawal,Bradley Hayes
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:grocery stores, geometrically aliased, mobile robots, hospitals poses, poses a significant
备注:
点击查看摘要
Abstract:Global localization in geometrically aliased, quasi-static environments such as grocery stores, offices, schools, and hospitals poses a significant challenge for mobile robots. Grocery stores with parallel aisles and a long tailed distribution of products, as well as offices and labs with repetitive furniture such as chairs, desks, monitors, and doors, exemplify common indoor environments that present geometric and even semantic ambiguity. Traditional approaches rely either on distinct geometric features or on domain-specific vision pipelines that struggle with long-tail semantic distributions and transient visual clutter. We present VLM-GLoc, a method for hierarchical semantic Monte Carlo Localization (MCL) that leverages open-vocabulary Vision-Language Models (VLMs) as a unified semantic observation front-end. We hypothesize a three-fold benefit from VLMs: (1) extracting highly discriminative rich text features, (2) implicit quality filtering of blurry or dynamic objects, and (3) permanence reasoning for targeted data augmentation. We introduce an inverse semantic proposal mechanism that seeds particles via text-to-map retrieval. Evaluated across two real-world environments with different characteristics and two different platforms: a 3,500 sq. ft. grocery store with a cellphone and a 3,700 sq. ft. lab space with a quadruped, VLM-GLoc achieves 70% and 74% global localization success respectively, substantially outperforming traditional geometry-only and domain-specific baselines.
136. 【2605.30469】3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark
链接:https://arxiv.org/abs/2605.30469
作者:Jialu Xu,Yifan Zhou
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
关键词:http URL, Audio Error Bench, Spatial Audio Error, audio error maps, Audio Error Map
备注:
点击查看摘要
Abstract:3D audio and novel-view acoustic synthesis models are usually evaluated with global this http URL, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.
137. 【2605.30467】Clustering Guided Domain-Specific Pretrained Foundation Model Very High-Resolution Arctic Remote Sensing
链接:https://arxiv.org/abs/2605.30467
作者:Amal S. Perera,Chandi Witharana,Elias Manos,Michael Pimenta,Anna K. Liljedahl
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:satellite image analysis, diversity-aware regional-scale image, combining diversity-aware regional-scale, regional-scale image curation, Vision Transformer
备注:
点击查看摘要
Abstract:This study introduces a novel Arctic-focused remote sensing foundation model (RSFM) by combining diversity-aware regional-scale image curation with masked autoencoder (MAE) self-supervised pretraining of a Vision Transformer (ViT) encoder for very-high-spatial-resolution (VHSR) satellite image analysis. Spectral and acquisition-metadata descriptors were used in a scalable affinity-propagation clustering workflow to select approximately 3 million chips from 267 TB of Vantor VHSR imagery This curation strategy was designed to reduce oversampling of visually repetitive or low-information areas while preserving broad scene diversity across the study domain. We pretrained a ViT-Large encoder on the curated corpus using a domain-adapted MAE reconstruction objective, producing Arctic-specific transformer weights for downstream feature mapping. The pretrained encoder was integrated into an existing location-aware detection and segmentation framework and evaluated across four hand-labeled Arctic datasets. Compared to ImageNet-initialized ViT-Large baseline, Arctic MAE pretraining produced consistent improvements in foreground mean F1 scores of 0.87, 0.72, 0.93, and 0.87, for infrastructure, IWP, RTS, and TCNs, with approximately 5-8 percentage increase. The proposed model also outperformed Prithvi-EO-2.0 in all downstream comparisons, with the smallest gain corresponding to at least a 15 percentage improvement mean F1, suggesting that domain-specific self-supervised pretraining on curated Arctic VHSR imagery provides more transferable representations for fine-scale Arctic mapping than a general-purpose Earth observation foundation model. These results demonstrate that optimizing the pretraining data distribution at regional scale, while keeping the architecture and MAE objective fixed, can produce a reusable Arctic-domain encoder for multiple VHSR remote sensing applications.
138. 【2605.30444】Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation
链接:https://arxiv.org/abs/2605.30444
作者:Chrysa Pratikaki,Pablo Ruiz-Ponce,Jiankang Deng,Stefanos Zafeiriou,Rolandos Alexandros Potamias
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabled increasingly realistic, Recent advances, increasingly realistic motion, realistic motion synthesis, enabled increasingly
备注:
点击查看摘要
Abstract:Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of human behavior: people naturally coordinate both hands and manipulate multiple objects simultaneously. To address this gap, we present Dex2HOI, a unified diffusion model for single- and two-object HOI synthesis from text. At its core, Dex2HOI employs a Dual-Stream Diffusion approach, where each object is processed in a dedicated interaction stream and coordinated through bidirectional cross-attention. To synthesize the final motion, we introduce a Motion Fusion Network integrated with novel hand-relative object representations and contact-aware conditioning applied across the whole sequence. By sampling the diffusion process autoregressively over prefix-conditioned windows, Dex2HOI generates arbitrarily long sequences at real-time speed omitting redundant test-time optimization, achieving up to x540 inference speed-up over prior state-of-the-art methods. Extensive evaluation on both single- and two-object benchmarks demonstrates state-of-the-art quantitative results, marking a step beyond conventional single-object HOI generation and toward expressive multi-object manipulation. Code and models will be released upon acceptance.
139. 【2605.30437】Mitigating Content Shift and Hallucination in GenAI Image Editing via Structural Refinement
链接:https://arxiv.org/abs/2605.30437
作者:Luxi Zhao,Michael S. Brown
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Nano Banana, produce visually compelling, visually compelling results, produce visually, enabling non-experts
备注:
点击查看摘要
Abstract:Generative AI (GenAI) image editors, such as Nano Banana, produce visually compelling results for retouching tasks, enabling non-experts to edit images through text prompts alone. However, the generative nature of these models often introduces spatial misalignment, texture distortion, and content hallucination, all of which are detrimental to downstream workflows that require pixel-level fidelity. We identify a problem setting we call "structure-preserving GenAI fusion" for black-box GenAI image retouching: retain the perceptual enhancements of a GenAI output while enforcing structural faithfulness to the original input image. To address this problem, we propose a post-processing framework that fuses an input image with its GenAI-enhanced counterpart by first establishing coarse spatial and photometric correspondences, then performing a fusion stage that transfers desired enhancements while suppressing hallucinated content. In the absence of direct prior work in this setting, we evaluate our framework against representative methods from photorealistic style transfer and image fusion. Our experiments demonstrate that our method better preserves aesthetic quality while maintaining pixel-level structural consistency and the input resolution.
140. 【2605.30431】DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
链接:https://arxiv.org/abs/2605.30431
作者:Hidir Yesiltepe,Koutilya PNVR,Gaurav Pathak,Navaneeth Bodla,Bharat Singh,Pinar Yanardag,Jinrong Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:standard classifier-free guidance, Recent progress, restoration remains limited, Decoupled Time Guidance, enabled remarkable generative
备注:
点击查看摘要
Abstract:Recent progress in video diffusion models has enabled remarkable generative fidelity, yet leveraging these priors for restoration remains limited by the strong coupling between conditional and unconditional branches in standard classifier-free guidance. We introduce a training-free framework that enhances distorted and low-resolution videos by decoupling these signals in time. Our proposed Decoupled Time Guidance (DTG) evaluates the unconditional branch at a cleaner diffusion timestep, providing a lookahead prior that preserves geometry while suppressing replication of warped content. This temporal bias is annealed throughout sampling, allowing the model to transition from structure correction to detail refinement without retraining. Combined with any off-the-shelf restoration module in a plug-and-play manner, our approach improves perceptual coherence and restores plausible structure in AIgenerated and real-world videos alike. To facilitate evaluation, we curate GenWarp480, a benchmark of 4,400 distorted 480p videos synthesized from diverse text-to-video models. GenWarp480 focuses on characteristic generative degradations such as warped faces, body misalignments, and spatial artifacts, providing a purpose-built testbed for assessing robustness to generative errors. Extensive experiments demonstrate that our method achieves significant improvements in structural fidelity and temporal stability without any model training.
141. 【2605.30409】SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
链接:https://arxiv.org/abs/2605.30409
作者:Yuyang Zhao,Yicheng Pan,Qiyuan He,Jincheng Yu,Junsong Chen,Tian Ye,Haozhe Liu,Enze Xie,Song Han
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:formidable challenge due, Hybrid Diffusion Transformer, Diffusion Transformer architecture, broadcasting and gaming, real-time streaming video
备注:
点击查看摘要
Abstract:Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.
142. 【2605.30387】Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification
链接:https://arxiv.org/abs/2605.30387
作者:Hwa Hui Tew,Junn Yong Loo,Fang Yu Leong,Julia K. Lau,Ding Fan,Hernando Ombao,Raphaël C.-W. Phan,Chee Pin Tan,Chee-Ming Ting
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:Functional Magnetic Resonance, Magnetic Resonance Imaging, Functional Magnetic, Resonance Imaging, Magnetic Resonance
备注: Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)
点击查看摘要
Abstract:Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-dependent (BOLD) signals over time. However, the resource-intensive nature of fMRI acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non-stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual-Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi-scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low-frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class-conditioned cosine-frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time-domain BOLD signals. This dual-transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI-based brain network classification. The code is available at this https URL .
143. 【2605.30380】Lightweight SAR Ship Detection via Contrastive Distillation
链接:https://arxiv.org/abs/2605.30380
作者:Surendar Devasundaram,Saber Latibari Banafsheh,Abhijit Mahalanobis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep convolutional, SAR ship detection, SAR ship, detectors achieve strong, achieve strong performance
备注: Accepted in GLSVLSI'26 special session 74: Efficiency In Computer Vision: From Image Generation to Decision"
点击查看摘要
Abstract:Deep convolutional and transformer-based detectors achieve strong performance for SAR ship detection but are often computationally prohibitive for real-time or onboard deployment. Lightweight models offer improved efficiency yet struggle to capture the complex structural relationships inherent in SAR backscatter. Most existing SAR knowledge-distillation approaches rely on feature or logit matching, which enforces localized activation similarity while neglecting the geometric relationships among object representations. We propose a Structured Unified Relational knowledGE distillation framework for SAR Ship detection (SURGE) that transfers relational geometry from a powerful teacher detector to a compact student detector using a contrastive InfoNCE objective in a shared projection embedding space. To the best of our knowledge, this work presents the first transformer-based SAR ship detector knowledge distillation framework in SAR domain. The framework is architecture-agnostic in the sense that it provides a common region-level distillation interface for two-stage, one-stage and transformer-based detectors without modifying their deployed architectures. Experiments on the SSDD and HRSID benchmarks demonstrate that the proposed method yields substantial improvements for two-stage detectors, achieving up to 6.2 mAP and 8.0 AP75 gains over baseline student and even surpassing teacher performance
144. 【2605.30370】Updating the standard neuron model in artificial neural networks
链接:https://arxiv.org/abs/2605.30370
作者:Raul Mohedano,Thomas Batard,Erik Velasco-Salido,Ramsses De Los Santos Mendoza,Jorge H. Martínez,Stacey Levine,Marcelo Bertalmío
类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:point neuron model, artificial neural networks, so-called point neuron, neuron model, standard neuron model
备注:
点击查看摘要
Abstract:From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuroscience, hoping that this analogy would allow for a better emulation of brain function. Over the years the neuroscience literature has shown that the point neuron model is too simplistic to properly represent many fundamental neural processes; however, the standard neuron model in ANNs still remains the same. Here we substitute it by a very recent model of cortical cells and demonstrate through theoretical analyses and experimental results how, simply by using a more realistic neural unit element without augmenting the number of parameters, the resulting ANNs offer a number of important advantages that include increases in expressivity, robustness and learning speed, and a reduction in memorization and the amount of training data needed.
145. 【2605.30362】XOResNet: Exclusive-OR Meta-Residuals Facilitate Deep Spiking Neural Networks Learning
链接:https://arxiv.org/abs/2605.30362
作者:Jianfang Wu,Junsong Wang
类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Spiking neural networks, Spiking neural, demonstrating superior learning, deep SNNs, neural networks
备注: 33 pages, 12 figures, 7 Tables
点击查看摘要
Abstract:Spiking neural networks (SNNs) hold promise for demonstrating superior learning and representation capabilities in deep models. Given the tremendous success of ResNet in deep learning, it would naturally follow to train deep SNNs with residual learning. However, existing residual structures for constructing deep SNNs still present challenges of spike redundancy or information loss, as well as redundant learning. In the present study, we first aim to address issues of relative spike redundancy in identity mapping and information loss in non-identity mapping. To this end, we propose an OR-ADD (OA) shortcut connection to merge output spikes/currents from two branches in the residual structure. Furthermore, to mitigate redundant learning in the backbone branch of the residual structure, we introduce the concept of XOR meta-residuals, i.e., selecting pre-learning residuals using the Exclusive-OR (XOR) operation for the backbone branch. Finally, by integrating the OA shortcut and XOR meta-residuals, we devise the XOR residual block and further construct XOResNet with varying depths based on this block. Extensive experiments on four datasets, Fashion-MNIST, CIFAR-10, CIFAR-100, and miniImageNet, show that the proposed XOResNet outperforms existing state-of-the-art deep SNNs optimized via gradient descent. These results validate the effectiveness of our OA shortcut and XOR meta-residual components in overcoming fundamental limitations of residual learning in SNNs, providing new architectural insights for building high-performance neuromorphic systems.
146. 【2605.31426】Self-Tuning Regularization for Image Scanning Microscopy
链接:https://arxiv.org/abs/2605.31426
作者:Sofia Agostoni,Lisa Cuneo,Christian Daniele,Giacomo Garré,Laurent Le,Alessandro Zunino,Giuseppe Vicidomini,Luca Calatroni
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
关键词:Image Scanning Microscopy, infinitesimally small pinhole, ideal confocal microscope, Scanning Microscopy, combines detector-array acquisition
备注:
点击查看摘要
Abstract:Image Scanning Microscopy (ISM) is a fluorescence imaging technique that combines detector-array acquisition and computational reconstruction to achieve the theoretical resolution of an ideal confocal microscope, i.e., one operating with an infinitesimally small pinhole, while maintaining high signal-to-noise ratio. Among the reconstruction methods for obtaining the super-resolved image, multi-image deconvolution (MID) and its extension aimed at preserving the optical sectioning capability of confocal microscopy, known as super-resolution sectioning ISM (s$^2$ISM), are among the most widely used approaches. Both methods rely on Richardson--Lucy-type iterative schemes, whose semi-convergent behavior requires early stopping and often leads to noise amplification and reconstruction artifacts. In this work, we introduce a self-tuning explicit regularization framework for both MID and s$^2$ISM reconstruction. Within a Bayesian maximum a posteriori formulation, we combine a multi-frame Poisson data fidelity term with explicit regularization, considering $\ell_1$ and smoothed total variation penalties as representative examples. We further develop an automatic and ground-truth-free strategy for regularization parameter selection by adapting the residual whiteness principle to the multi-frame Poisson setting and introducing a spectral high-pass extension tailored to s$^2$ISM. The resulting framework enables stable reconstructions without empirical stopping rules. To demonstrate the proposed framework, we consider first-order optimization schemes based on proximal gradient and mirror descent methods with adaptive backtracking strategies. Experiments on simulated and real fluorescence ISM datasets demonstrate improved reconstruction stability and image quality with respect to unregularized approaches, while enabling robust super-resolution and optical sectioning in low-photon conditions.
147. 【2605.31302】MoE-dqINR: A Unified Mixture-of-Experts Implicit Neural Representation Framework for Scan-Specific Dynamic and Quantitative MRI Reconstruction
链接:https://arxiv.org/abs/2605.31302
作者:Yinzhe Wu,Fanwen Wang,Zhenxuan Zhang,Zi Wang,Chengyan Wang,Guang Yang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:Undersampled magnetic resonance, magnetic resonance imaging, Undersampled magnetic, incomplete multicoil k-space, multicoil k-space data
备注:
点击查看摘要
Abstract:Undersampled magnetic resonance imaging (MRI) reconstruction seeks to recover temporally or contrast-varying image series from incomplete multicoil k-space data while preserving state-dependent fidelity for dynamic and quantitative MRI (qMRI). Existing scan-specific implicit neural representations (INRs) often use monolithic spatiotemporal coordinate fields, explicit subspaces, motion or deformation models, calibration variables, or sequence-specific quantitative signal models. These design choices can limit flexibility in sharing spatial information while adapting image synthesis across acquisition states. Moreover, many INR-based baselines remain computationally demanding, typically requiring per-scan optimization times on the order of hundreds to thousands of seconds. We propose MoE-dqINR, a scan-specific multicoil MRI reconstruction framework that factorizes the image-domain representation into shared spatial experts and a state-conditioned routing pathway. Spatial experts encode reusable coordinate-dependent image content, whereas routing weights, conditioned on ordered acquisition states, synthesize each dynamic frame or contrast state from a common expert bank. The representation is coupled to a multicoil MRI forward model, uses the normalized state index to drive routing in both dynamic and quantitative MRI. By separating shared spatial representation from state-dependent synthesis, the framework provides an image-first architecture for dynamic and quantitative MRI while reducing scan-specific INR optimization to approximately 30 s per scan in our experiments. The proposed formulation establishes state-conditioned mixture-of-experts INR as a scan-specific multicoil MRI reconstruction prior that unifies shared spatial representation, dynamic- and qMRI-specific synthesis, and practical per-scan efficiency.

