本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新897篇论文,其中:

  • 自然语言处理191
  • 信息检索39
  • 计算机视觉133

自然语言处理

1. 【2605.28819】PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

链接https://arxiv.org/abs/2605.28819

作者:Yangyi Huang,Ruotian Peng,Zeju Qiu,Jiale Kang,Yandong Wen,Bernhard Schölkopf,Weiyang Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, adapting large language, evaluations largely emphasize, largely emphasize downstream, emphasize downstream accuracy

备注: Technical report v1 (28 pages, 9 figures, project page: [this https URL](https://spherelab.ai/PEFT-Arena/) )

点击查看摘要

Abstract:Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities. We argue that PEFT should be assessed through the stability-plasticity dilemma: the trade-off between target-task adaptation and resistance to forgetting. We introduce PEFT-Arena, a benchmark that jointly measures downstream performance and general capability retention. Across methods, we find distinct stability-plasticity profiles; under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. To explain these differences, we analyze PEFT updates from two geometric perspectives. In weight space, spectral analysis reveals how parameterizations interact with the pretrained singular-value structure. In activation space, retention metrics show whether finetuning preserves or distorts general-capability representations, with forgetting linked to non-isometric representation distortion. Finally, an analysis shows that final SFT checkpoints often overshoot a better target-retention operating point. Inspired by this, we present case studies of a post-hoc improvement with path-wise rewinding.

2. 【2605.28818】VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

链接https://arxiv.org/abs/2605.28818

作者:Jinzhou Wu,Zhengwu Ma,Jixing Li,Baoping Tang,Zitong Lu

类目:Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)

关键词:Large language models, Large language, natural reading, increasingly useful computational, Large

备注: 17 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

3. 【2605.28814】Self-Improving Language Models with Bidirectional Evolutionary Search

链接https://arxiv.org/abs/2605.28814

作者:Guowei Xu,Zhenting Qi,Huangyuan Su,Weirui Ye,Himabindu Lakkaraju,Sham M. Kakade,Yilun Du

类目:Computation and Language (cs.CL)

关键词:self-improving language models, agentic systems, self-improving language, BES, Search

备注

点击查看摘要

Abstract:Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at this https URL.

4. 【2605.28806】Personal Visual Memory from Explicit and Implicit Evidence

链接https://arxiv.org/abs/2605.28806

作者:Viet Nguyen,Thao Nguyen,Vishal M. Patel,Yuheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:remain largely text-centric, methods remain largely, largely text-centric, methods remain, remain largely

备注: Project Page: [this https URL](https://viettmab.github.io/visualmem-page/)

点击查看摘要

Abstract:Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states -- both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual--text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

5. 【2605.28805】OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

链接https://arxiv.org/abs/2605.28805

作者:Xinchen Zhang,Bowei Liu,Jiale Liu,Chufan Shi,Yizhen Zhang,Junhong Liu,Youliang Zhang,Zhiheng Li,Yujiu Yang,Ling Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multimodal large language, large language models, outcomes are increasingly, increasingly central, large language

备注: ICML 2026. Project: [this https URL](https://github.com/Cominclip/OmniVerifier)

点击查看摘要

Abstract:Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

6. 【2605.28802】Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

链接https://arxiv.org/abs/2605.28802

作者:Beiduo Chen,Pingjun Hong,Ziyun Zhang,Benjamin Roth,Anna Korhonen,Barbara Plank

类目:Computation and Language (cs.CL)

关键词:Free-text explanations extend, Free-text explanations, explanations extend human, annotators' decisions, explanations extend

备注: 43 pages, 20 figures

点击查看摘要

Abstract:Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators' decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each -- natural language inference and paraphrase judgment -- we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator's response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.

7. 【2605.28791】Skill-Conditioned Gated Self-Distillation for LLM Reasoning

链接https://arxiv.org/abs/2605.28791

作者:Jiazhen Huang,Xiao Chen,Xiao Luo,Yong Dai,Senkang Hu,Yuzhi Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:teacher-side privileged information, dense token-level supervision, On-policy self-distillation, sparse verifier outcomes, turn sparse verifier

备注

点击查看摘要

Abstract:On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at this https URL.

8. 【2605.28782】Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

链接https://arxiv.org/abs/2605.28782

作者:Mariah Al Giptiah Binte Yusoff,Jakin Tan,Bocheng Chen,Guangliang Liu,Xi Chen

类目:Computation and Language (cs.CL)

关键词:Discourse particles, textit, handling discourse particles, crucial components, components that enable

备注

点击查看摘要

Abstract:Discourse particles, such as \textit{well} and \textit{kind of}, are crucial components that enable LLMs to ``speak'' more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs' capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily on high-resource languages such as English, with little attention paid to Southeast Asian languages. In this paper, we (1) propose \textsc{MalayPrag}, a benchmark designed to systematically evaluate and analyze LLMs' capabilities in handling discourse particles in colloquial Malay; and (2) introduce five attributes that provide a linguistically grounded, unified framework for interpreting the pragmatic functions of discourse particles. Applying these two contributions, we prompt ten off-the-shelf LLMs to perform three prediction tasks. The experimental results reveal substantial challenges for current LLMs in accurately connecting discourse particles with their pragmatic functions in Malay. The provision of the five attributes designed in this study is found to significantly improve these connections, highlighting the need for structured scaffolding for models' pragmatic competence.

9. 【2605.28779】he Abstraction Gap in Vision-Language Causal Reasoning

链接https://arxiv.org/abs/2605.28779

作者:Chinh Hoang,Mohammad Rashedul Hasan

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:fluent causal explanations, distinguish linguistic plausibility, Vision-language models, Vision-language, Abstraction Gap

备注

点击查看摘要

Abstract:Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

10. 【2605.28778】Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

链接https://arxiv.org/abs/2605.28778

作者:Gabrielle Kaili-May Liu,Arman Cohan

类目:Computation and Language (cs.CL)

关键词:LLMs' linguistically expressed, linguistically expressed confidence, LLMs' linguistically, linguistically expressed, intrinsic uncertainty

备注: Code: [this https URL](https://github.com/yale-nlp/marker_internal_confidence)

点击查看摘要

Abstract:LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.

11. 【2605.28775】Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

链接https://arxiv.org/abs/2605.28775

作者:Suji Kim,Kangsan Kim,Sung Ju Hwang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:made substantial progress, recently made substantial, separate large expert, Computer-use agents, domain remains expensive

备注

点击查看摘要

Abstract:Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

12. 【2605.28774】Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

链接https://arxiv.org/abs/2605.28774

作者:Minki Kang,Shizhe Diao,Ryo Hachiuma,Sung Ju Hwang,Pavlo Molchanov,Yu-Chiang Frank Wang,Byung-Kwan Lee

类目:Computation and Language (cs.CL)

关键词:real-world problems require, problems require external, extended reasoning succeed, Vision-language models, require external tools

备注: Project page: [this https URL](https://byungkwanlee.github.io/AXPO-page/)

点击查看摘要

Abstract:Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

13. 【2605.28773】Rethinking Memory as Continuously Evolving Connectivity

链接https://arxiv.org/abs/2605.28773

作者:Jizhan Fang,Buqiang Xu,Zhixian Wang,Haoliang Cao,Xinle Deng,Baohua Dong,Hangcheng Zhu,Ruohui Huang,Gang Yu,Ying Wei,Guozhou Zheng,Feiyu Xiong,Haofen Wang,Huajun Chen,Ningyu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Multimedia (cs.MM)

关键词:Existing memory-augmented LLM, memory-augmented LLM agents, fixed retrieval pipelines, signals continuously reshape, heterogeneous signals continuously

备注: Ongoing work

点击查看摘要

Abstract:Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in this https URL.

14. 【2605.28751】Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

链接https://arxiv.org/abs/2605.28751

作者:Kunhao Zheng,Pierre Chambon,Juliette Decugis,Jonas Gehring,Taco Cohen,Benjamin Negrevergne,Gabriel Synnaeve

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:trace the Pareto, Pareto front, Linear interpolation, remains unclear, competing objectives

备注: 54 pages

点击查看摘要

Abstract:Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.

15. 【2605.28745】Stance Detection in Prediction Markets: Addressing Imbalanced Trader Commentary via Counterfactual Augmentation and Market Context

链接https://arxiv.org/abs/2605.28745

作者:Thomas Mbrice

类目:Computation and Language (cs.CL)

关键词:Polymarket aggregate crowd, real-time probability estimates, aggregate crowd beliefs, traders post beneath, rich directional stance

备注: 14 pages, 9 figures

点击查看摘要

Abstract:Prediction markets such as Polymarket aggregate crowd beliefs into real-time probability estimates, and the comments traders post beneath each market contain rich directional stance signals that prices alone cannot capture. This work introduces the first stance detection study applied to prediction market commentary, a domain characterized by extreme brevity, trader- specific vernacular, and severe class imbalance (only 8.7% of comments oppose the market outcome). RoBERTa-base is fine-tuned across a 4 x 3 ablation: four input configurations ({2- class, 3-class} x {with/without market context}) and three augmentation conditions (baseline, 50% synthetic, 100% synthetic). Synthetic minority-class samples are generated via LLM-driven Pro - Anti counterfactual flips using the Anthropic API. Results show that (1) market context is the single most impactful factor, raising 3-class Anti recall from 0.10 to 0.45; (2) counterfactual augmentation is conditionally effective, improving Anti F1 in weak configurations (0.10 - 0.24) while degrading strong ones (2-class-ctx macro F1: 0.68 - 0.50 at full dose); and (3) 50% augmentation is the optimal dose, with 100% consistently hurting performance. Attention-based interpretability analysis provides mechanistic support for all three findings.

16. 【2605.28740】Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

链接https://arxiv.org/abs/2605.28740

作者:Bushi Xiao,Sarvesh Soni,Daisy Zhe Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, Reverse Probing, large language, increasingly deployed, propose Reverse Probing

备注

点击查看摘要

Abstract:As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

17. 【2605.28734】Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

链接https://arxiv.org/abs/2605.28734

作者:Richard J. Young,Gregory D. Moody

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:question returns text, harmful question returns, general-purpose language model, returns text, question returns

备注: 21 pages, 9 figures, 5 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) under a five-judge panel; 6,675 prompts, 33,375 classification calls, Fleiss' kappa = 0.767

点击查看摘要

Abstract:A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software (ready-to-run weapons) with requests for harmful security knowledge (information a human must still operationalise) and report refusal rates over non-comparable corpora, so no single statistic measures the property that actually matters. This paper introduces an expanded consensus-labeled prompt bank that distinguishes between these two request types and provides a construct-stable substrate for cross-corpus coding-model compliance measurement. Eight corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are consolidated and classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls). The panel reaches Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"); 95.0% of prompts draw at least four agreeing judges, 76.9% are unanimous, and the panel reproduces the earlier four-corpus release at Cohen's kappa = 0.952 on the 3,133 shared prompts. The released bank comprises 4,748 consensus-CODE prompts (executable malicious code requests) and 1,923 consensus-KNOWLEDGE prompts (harmful security knowledge requests). The bank is the validated instrument the field has lacked: a reliability-quantified basis for testing whether coding models meet the stricter refusal standard their executable output demands.

18. 【2605.28732】MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

链接https://arxiv.org/abs/2605.28732

作者:Xinle Deng,Ruobin Zhong,Hujin Peng,Xiaoben Lu,Yanzhe Wu,Guang Li,Buqiang Xu,Yunzhi Yao,Jizhan Fang,Haoliang Cao,Junjie Guo,Yuan Yuan,Ziqing Ma,Yuanqiang Yu,Rui Hu,Baohua Dong,Hangcheng Zhu,Ningyu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:support long-horizon reasoning, large language models, systems remain unreliable, enabling large language, long-horizon reasoning

备注: Ongoing work

点击查看摘要

Abstract:Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at this https URL.

19. 【2605.28714】IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

链接https://arxiv.org/abs/2605.28714

作者:Michael Galarnyk,Siddharth Lohani,Vidhyakshaya Kannan,Sagnik Nandi,Aman Patel,Liqin Ye,Arnav Hiray,Rutwik Routu,Prasun Banerjee,Siddhartha Somani,Sudheer Chava

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Initial Public Offering, Public Offering, Initial Public, IPO filings, allowing individual

备注: 12 pages

点击查看摘要

Abstract:An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

20. 【2605.28710】owards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

链接https://arxiv.org/abs/2605.28710

作者:Irune Zubiaga,Aitor Soroa,Rodrigo Agerri

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, prior work focuses, generated text, Large

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.

21. 【2605.28700】he Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

链接https://arxiv.org/abs/2605.28700

作者:Dominika Agnieszka Długosz,Arlindo Oliveira,Natalia Díaz Rodríguez

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, reported consistent performance, models lack genuine, genuine reasoning capabilities, consistent performance drops

备注: 38 pages, 11 figures. Submitted to ACL ARR / EMNLP 2026

点击查看摘要

Abstract:The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p 0.001), contradicting the original authors' claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.

22. 【2605.28669】Sense Representations Are Inducible Interfaces

链接https://arxiv.org/abs/2605.28669

作者:Jan Christian Blaise Cruz,Alham Fikri Aji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:per-token meaning decompositions, existing approaches require, approaches require models, sense structure baked, per-token meaning

备注: [this https URL](https://github.com/jcblaisecruz02/acros)

点击查看摘要

Abstract:Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.

23. 【2605.28664】Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

链接https://arxiv.org/abs/2605.28664

作者:Vijeta Deshpande,Tootiya Giyahchi,Veena Padmanabhan,Leman Akoglu,Anna Rumshisky

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:violating outputs, robust generalization, outputs for robust, Helpful, Harmless

备注

点击查看摘要

Abstract:Safety detection models require examples of HHH (Helpful, Harmless, Honest)-violating outputs for robust generalization, however such examples are scarce. Activation Steering (AS) has emerged as a data-efficient method for generating target-concept-aligned responses. We investigate whether AS can generate high-quality training datasets for downstream classifiers, a question that remains untested. We present a two-fold study with intrinsic and extrinsic evaluation across $4$ concepts $\times\,2$ models $\times\,4$ steering methods. Intrinsically, beyond the field-standard rubric of steering success (concept alignment) and coherence, we introduce sample- and set-level diversity as a quality axis previously absent from the literature, and find that increasing steering strength reduces response diversity. Extrinsically, we replace HHH-violating examples in the available training data with steered generations and fine-tune detection classifiers. AS-generated data results in a better classifier than the prompting-generated data on $3$ of $4$ concepts. However, only $41$ of $136$ AS configurations outperform prompting, indicating that downstream utility lies in a narrow regime that jointly satisfies success, coherence, and diversity. The harmonic mean of these three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, providing a practical heuristic target for practitioners tuning AS hyperparameters. Together, our results highlight the potential of AS in synthetic data generation for improving safety detection and identify diversity as a critical, previously overlooked axis for tuning AS.

24. 【2605.28649】Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

链接https://arxiv.org/abs/2605.28649

作者:Li Lei,Madalina Ciobanu,Qingqing Mao,Ritankar Das

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:enhance domain-specific capabilities, LLMs increasingly require, increasingly require surgical, LLMs increasingly, full fine-tuning

备注

点击查看摘要

Abstract:LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle allowing for feature-level identification of where to intervene. In this work, we rigorously evaluate an SAE-guided editing pipeline for mathematical reasoning on Gemma-3-4B-IT and uncover a fundamental failure mode: the intuitively appealing approach of projecting task vectors onto SAE feature subspaces acts as an information bottleneck that discards approximately 97% of the modification energy, yielding no statistically significant improvements across seven math subjects. We show that this failure stems from a geometric misalignment between activation-space SAE directions and weight-space task vectors. We then propose a shift in perspective: SAE as a Stethoscope, Not a Scalpel, where SAEs are used for layer-level diagnosis rather than intervention-level filtering. By injecting unfiltered raw task vectors only into layers identified by an SAE-derived specificity score, we improve Number Theory accuracy from 29.6% to 39.4% (z=+3.41, p=0.0007) on the Minerva Math benchmark; 5 of 7 math subjects significantly improved and none significantly degraded. Our method is fully deterministic, requires no additional inference cost, and provides a principled framework for interpretability-guided model editing.

25. 【2605.28646】MaskClaw: Edge-Side Personalized Privacy Arbitration for GUI Agents with Behavior-Driven Skill Evolution

链接https://arxiv.org/abs/2605.28646

作者:Yanqiu Zhao,Dongying Zheng,Kaibo Huang,Yukun Wei,Zhongliang Yang,Linna Zhou

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:GUI agents rely, medical records, payment credentials, private messages, workplace-specific workflows

备注: Preprint. Submitted to EMNLP 2026. 21 pages, including appendices; 5 figures

点击查看摘要

Abstract:GUI agents rely on screenshots to infer intent and operate across applications, but these screenshots often contain private messages, medical records, payment credentials, and workplace-specific workflows. Privacy decisions in this setting depend on task, recipient, application state, and user role, yet static PII detectors miss these boundaries and cloud-side VLM reasoning can upload the raw screen before deciding what should be protected. We present MaskClaw, an edge-side privacy arbitrator for GUI agents. MaskClaw extracts local visual evidence, retrieves user- and task-specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user- or organization-controlled environment. In five designed skill-evolution scenarios, it turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate. We introduce P-GUI-Evo, a benchmark built from real UI patterns, reconstructed HTML screens, and sanitized labels. Experiments show that pattern matching, cloud reasoning, and routing alone tend to over-confirm, over-mask, or expose raw screenshots under the same protocol. The artifact is available at this https URL.

26. 【2605.28645】GraphSteal: Structural Knowledge Stealing from Graph RAG via Traversal Reconstruction

链接https://arxiv.org/abs/2605.28645

作者:Jinze Gu,Qinghua Mao,Xi Lin,Jun Wu

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Graph RAG, Graph RAG systems, Retrieval-Augmented Generation, grounding generation, query-relevant external evidence

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances LLMs by grounding generation in query-relevant external evidence. Beyond unstructured text corpora, Graph RAG integrates knowledge graphs into the retrieval pipeline, enabling LLMs to access entities, relations, and multi-hop dependencies encoded in structured knowledge. However, the same structured knowledge that empowers Graph RAG also creates a new privacy attack surface. We demonstrate that Graph RAG systems can be turned into structural oracles: through adaptive black-box interactions, an adversary can elicit sufficient relational evidence to reconstruct substantial portions of the hidden knowledge graph. We propose a structure-oriented reconstruction framework that recovers targeted graphs from both local and global perspectives. Specifically, Depth-Wise Heuristic Search extracts fine-grained node attributes by recursively expanding entity-centered evidence, while Breadth-Wise Diffusion Search infers graph topology by propagating across relation-induced neighborhoods. Experiments on generic and healthcare scenarios demonstrate that our method can recover over 90\% of the original knowledge graph from representative Graph RAG systems, revealing sensitive entities, relations, and structural dependencies with high fidelity. Existing guradrails provide limited defense against our attack, highlighting the inherent difficulty of safeguarding structural privacy in Graph RAG pipelines.

27. 【2605.28643】GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study

链接https://arxiv.org/abs/2605.28643

作者:Gaspard Michel,Elena V. Epure,Romain Hennequin,Christophe Cerisara,Mirella Lapata

类目:Computation and Language (cs.CL)

关键词:representing character interactions, Methods to represent, Heterogeneous Character Networks, represent literary texts, crucial aspect

备注

点击查看摘要

Abstract:Methods to represent literary texts as graphs or sequences of graphs mainly focus on representing character interactions, and often overlook another crucial aspect: the textual context in which characters interact. We introduce Dynamic Heterogeneous Character Networks (DHCNs), which organize long novels into temporally localized heterogeneous graphs that align characters with their textual contexts. We extract around 20,000 DHCNs from Project Gutenberg, and propose GraphLit, a self-supervised learning framework that learns rich literary representations through a masked graph autoencoder objective. Across a wide-range of 12 character-related tasks, GraphLit improves over text-only and graph-only baselines, particularly on tasks requiring contextual understanding. Finally, we demonstrate the applicability of DHCNs and GraphLit for literary analysis by studying the link between narrative non-linearity and dynamic social features.

28. 【2605.28639】he Attentional White Bear Effect in Transformer Language Models

链接https://arxiv.org/abs/2605.28639

作者:Rebecca Ramnauth,Brian Scassellati

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:generating prohibited content, reduces internal representation, prevent language models, suppression reduces internal, Instruction-based suppression

备注: Currently under review at EMNLP 2026

点击查看摘要

Abstract:Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

29. 【2605.28629】Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents

链接https://arxiv.org/abs/2605.28629

作者:Zheng Wu,Pengzhou Cheng,Zongru Wu,Yuan Guo,Tianjie Ju,Aston Zhang,Gongshen Liu,Zhuosheng Zhang

类目:Computation and Language (cs.CL)

关键词:large language models, multimodal large language, shown exceptional potential, Recent advancements, autonomously execute human

备注: Accepted by TASLP

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over-execution. Previous studies solve it by training a interactive mobile-using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over-soliciting behavior, relying excessively on human intervention. To mitigate both over-execution and over-soliciting, we propose a universal confidence integration framework that enables confidence-driven proactive and robust interaction in MLLM-based mobile-using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine-tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile-Aptus achieves state-of-the-art performance on the four popular mobile-using agent benchmarks: OS-Kairos, AITZ, Meta-GUI, and AndroidControl. Mobile-Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17\% in task success rate. In real-world dynamic experiments, Mobile-Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at this https URL.

30. 【2605.28616】Measuring Form and Function in Language Models

链接https://arxiv.org/abs/2605.28616

作者:Héctor Javier Vázquez Martínez,Charles Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:introduce quantitative metrics, child language acquisition, Contextual Alternative Choice, introduce quantitative, quantitative metrics

备注: Under review at ACL Rolling Review May 2026 cycle

点击查看摘要

Abstract:We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and functional discourse properties of determiners in English, which young children acquire early and accurately. We propose Contextual Alternative Choice (CAC), a new prompting method which provides targeted tests for both syntactic and discourse knowledge of language. The method enables direct comparison of language models against children, and more importantly, against statistical benchmarks independently established in empirical research. No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children, but some very large models do. We present our results as methodological and technical contributions, with specific emphasis on cognitive status of language models.

31. 【2605.28607】Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

链接https://arxiv.org/abs/2605.28607

作者:Susanna Cifani,Mario Luca Bernardi,Marta Cimitile

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Modern information systems, general environmental perception, information systems require, systems require autonomous, structured metadata parsing

备注: Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2026)

点击查看摘要

Abstract:Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

32. 【2605.28602】Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

链接https://arxiv.org/abs/2605.28602

作者:Leizhen Zhang,Shuhan Chen,Sheng Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:SAT remains unclear, Boolean satisfiability, reduce to Boolean, Large language models, Large language

备注: Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.

Comments:
Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

Cite as:
arXiv:2605.28602 [cs.AI]

(or
arXiv:2605.28602v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.28602

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
33. 【2605.28598】Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News

链接https://arxiv.org/abs/2605.28598

作者:Alejandro Buitrago López,Alberto Ortega Pastor,Javier Pastor-Galindo,José A. Ruipérez-Valiente

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLM-powered social agents, online social behavior, simulate online social, realism remains difficult, LLM-powered social

备注

点击查看摘要

Abstract:LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.28598 [cs.CL]

(or
arXiv:2605.28598v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.28598

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
34. 【2605.28591】Models That Know How Evaluations Are Designed Score Safer

链接https://arxiv.org/abs/2605.28591

作者:Katharina Deckenbach,Haritz Puerto,Jonas Geiping,Sahar Abdelnabi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:models behaving consistently, deployment settings, behaving consistently, consistently across controlled, controlled and deployment

备注

点击查看摘要

Abstract:The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at this https URL.

35. 【2605.28565】Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

链接https://arxiv.org/abs/2605.28565

作者:Yongsik Seo,Wooseok Jeong,Eunyoung Kim,Hyeonseo Jang,Dongha Lee

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:rarely verify, verify the cited, cited pages, citation, search-augmented LLMs rely

备注: Working Progress

点击查看摘要

Abstract:Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

36. 【2605.28561】Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

链接https://arxiv.org/abs/2605.28561

作者:Saurabh Dash,Pierre Clavier,John Dang,Matthias Galle,Marzieh Fadaee,Ahmet Üstün,Beyza Ermis

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:improved language models, Reinforcement Learning, mathematics and code, checked automatically, improved language

备注

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.

37. 【2605.28543】Cultural Binding Heads in Language Models

链接https://arxiv.org/abs/2605.28543

作者:Avrile Floro,Luca Benedetto

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:context warrants differentiation, LLMs often default, difference awareness, default to equal, equal treatment

备注

点击查看摘要

Abstract:LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $\alpha$-scaling shows a graded dose-response and moderate amplification steering at generation ($\alpha = 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.

38. 【2605.28534】GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

链接https://arxiv.org/abs/2605.28534

作者:Zheng Wu,Chengcheng Han,Zhengxi Lu,Tianjie Ju,Yanyu Chen,Qi Gu,Xunliang Cai,Zhuosheng Zhang

类目:Computation and Language (cs.CL)

关键词:Graphical User Interface, building Graphical User, User Interface, Graphical User, multimodal large language

备注

点击查看摘要

Abstract:Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success this http URL codes are available at this https URL.

39. 【2605.28526】Entropy-aware Masking for Masked Language Modeling

链接https://arxiv.org/abs/2605.28526

作者:Gokul Srinivasagan,Kai Hartung,Munir Georges

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:standard pretraining objective, Masked language modeling, encoder-based language models, standard pretraining, pretraining objective

备注: accepted at starsem 2026 Conference

点击查看摘要

Abstract:Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model's entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

40. 【2605.28521】ClinicalEncoder26AM: A Multlilingual Diagnosable ColBERT Model; Evidences from the MultiClinNER Shared Task

链接https://arxiv.org/abs/2605.28521

作者:François Remy

类目:Computation and Language (cs.CL)

关键词:latent space inspired, multilingual Diagnosable ColBERT, clinical latent space, Diagnosable ColBERT, biomedical texts

备注

点击查看摘要

Abstract:ClinicalEncoder26AM is a multilingual Diagnosable ColBERT for clinical and biomedical texts, which aligns at multiple levels its token-level semantic with ClinicalMap25, a clinical latent space inspired by BioLORD-2023 and enriched with synthetic and annotated supervision. The post-training recipe builds upon BGE-M3, and combines synthetic clinical notes, patient--doctor conversations, and annotated resources such as MedMentions, while considering both named-entity-level and sentence-level representations in a multi-adapter distillation, along with a ColBERT-style retrieval objective. In this system demonstration paper, we evaluate the model in the MultiClinNER shared task by finetuning it as a BIO tagger for patient symptoms, disorders, and procedure spans, using a lightweight two-layer CNN head to improve local boundary detection. The resulting system remains simple, processes most documents in a single 8192-token window, and achieves state-of-the-art multilingual entity recall, while achieving Top 5 overall across all entity types and languages in Character-weighted F1 scores. Training curves further show that ClinicalEncoder26AM is markedly more data-efficient than the base M3 model, supporting the usefulness of its clinical post-training for downstream information extraction. The model can be downloaded on this https URL

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.28521 [cs.CL]

(or
arXiv:2605.28521v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.28521

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
41. 【2605.28512】On Compositional Learning Behaviours in Formal Mathematics

链接https://arxiv.org/abs/2605.28512

作者:Kevin Yandoka Denamganaï

类目:Computation and Language (cs.CL)

关键词:Compositional Learning Behaviours, require Compositional Learning, Self-evolving scientific agents, mathematics require Compositional, Compositional Learning

备注: work in progress, under review

点击查看摘要

Abstract:Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.

42. 【2605.28500】Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

链接https://arxiv.org/abs/2605.28500

作者:Dylan Bouchard,Mohit Singh Chauhan,Zeya Ahmad,Ho-Kyeong Ra

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:shown impressive capabilities, Large language models, produce functionally incorrect, Large language, shown impressive

备注

点击查看摘要

Abstract:Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

43. 【2605.28494】A new semantically annotated corpus with syntactic-semantic and cross-lingual senses

链接https://arxiv.org/abs/2605.28494

作者:Myriam Rakho,Eric Laporte,Matthieu Constant

类目:Computation and Language (cs.CL)

关键词:word sense disambiguation, French polysemous verbs, sense disambiguation, sense-tagged corpus, word sense

备注

点击查看摘要

Abstract:We describe a new sense-tagged corpus for word sense disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Grammar tables) and (3) a fine-grained sense label resulting from the concatenation of the translation and the Lexicon-Grammar entry.

44. 【2605.28484】Comonadic Morphophonology: A Compositional Framework for Context-Dependent Morphological Rules in Finnish

链接https://arxiv.org/abs/2605.28484

作者:Yongseok Jang

类目:Computation and Language (cs.CL)

关键词:Composing finite-state transducers, multiplicative state explosion, neural models sidestep, possessive suffix assimilation, Composing finite-state

备注: 13 pages. Accepted at the Society for Computation in Linguistics (SCiL) 2026

点击查看摘要

Abstract:Composing finite-state transducers (FSTs) for context-dependent morphophonological rules -- consonant gradation, vowel harmony, possessive suffix assimilation -- leads to multiplicative state explosion; neural models sidestep the problem but provide no formal account of the rules themselves. We present the first framework where each morphophonological rule is a function from a focused local context to a single output segment -- the type of a local rule familiar from cellular automata -- and where length-changing rules compose as coKleisli arrows of a comonad. Our central contribution is the Writer comonad (DeletionSet x Zipper), a new algebraic construction that restores strict coKleisli compositionality for such rules: each rule is a coKleisli arrow, extend lifts it to a global transformation, and deletions accumulate as a monoid action rather than requiring intermediate materialization. As supporting evidence, thirteen coKleisli arrows provide an alternative formulation expressing the same morphophonological behaviors that Omorfi encodes via 874 continuation classes (67:1 reduction at the rule-representation level), and the same abstraction enables bidirectional morphology -- a MorphGenerator reuses the analysis arrows for generation. On UD Finnish-TDT, the system achieves 83.92% UPOS accuracy with rule-only disambiguation (94.66% with an external suffix tagger), validating the framework as a practical morphological engine.

45. 【2605.28465】Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents

链接https://arxiv.org/abs/2605.28465

作者:Jihyeong Park,Ingeol Baek,Jeonghyun Park,Hwanhee Lee

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, single-turn text generations, iterative interaction

备注: 28 pages, 16 figures, 19 tables

点击查看摘要

Abstract:Divergent thinking is a core dimension of creativity, yet existing evaluations of Large Language Models (LLMs) treat them as single-turn text generations, failing to capture how an agent reasons through iterative interaction. To address this, we introduce MUTATE, an interactive benchmark designed to evaluate agentic divergent thinking at two levels: path-level, where an agent discovers multiple alternative paths to the same goal, and action-level, where individual actions require non-typical, mechanism-shifting object uses. Unlike success-only evaluations, MUTATE scores both completed paths and off-path attempts, capturing divergent reasoning that conventional success rates discard. Our experiments with frontier LLMs reveal a structural blind spot in existing frameworks: when exposed to immediate convergence pressure, they tend to fall into immediate action fixation, failing to improve action-level divergence. To overcome this, we propose ReDNA, which separates unconstrained divergent candidate generation from convergent constraint selection. ReDNA significantly outperforms prior methods across both divergence levels and generalizes effectively to an external creativity environment. We also confirm its success stems from a qualitative enhancement of resilient divergent reasoning rather than simple environmental exploration.

46. 【2605.28464】he Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

链接https://arxiv.org/abs/2605.28464

作者:Junyu Lu,Qi Wei,Peishuo Zheng,Jie Zhang,Hui Huang,Qianru Wang,Chuan Xiao,Jianbin Qin,Shuyuan Zheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Legal Judgment Prediction, Judgment Prediction, criminal legal domain, Legal Judgment, formally indicted

备注: 24 pages, 5 figures, 22 tables

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) has become a core benchmark for evaluating AI in the criminal legal domain, but it only sees criminal cases that have already passed prosecutorial review and been formally indicted. As a result, LJP leaves a substantial blind spot in assessing criminal liability, overlooking cases involving insufficient evidence, no criminal liability, or guilt exempted from punishment. To fill this gap, we propose \textbf{Prosecution Decision Prediction (PDP)}, the first Legal AI task built around prosecutorial review, which classifies each case into prosecution or one of three non-prosecution decisions and reflects legal AI's capabilities in evidence evaluation, legal subsumption, and value-based discretion. We further construct \textbf{PDP-Bench}, a benchmark of 4{,}630 real Chinese prosecutorial decisions spanning 190 charges. Extensive experiments show that state-of-the-art LLMs perform substantially worse on PDP than on LJP and that mainstream enhancement routes fail to close the gap. Moreover, controlled RLVR interventions show that simple outcome rewards fail to produce generalizable PDP discrimination.

47. 【2605.28440】AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

链接https://arxiv.org/abs/2605.28440

作者:Shaolong Chen,Madalina Ciobanu,Qingqing Mao,Ritankar Das

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:widely adopted alternative, alternative to RLHF, RLHF for aligning, separate reward model, widely adopted

备注: 5 figures

点击查看摘要

Abstract:DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: the loss suppresses dispreferred responses substantially faster than it promotes preferred ones, causing the model to learn to avoid bad answers rather than to generate good ones. We propose AdaDPO, a Self-Adaptive variant of the DPO algorithm that introduces per-preference-pair, stop-gradient-based coefficients derived directly from the policy model's generation probabilities, with the reference model's probabilities as an optional component. AdaDPO is constructed to enforce equality of gradient magnitudes between preferred and dispreferred probabilities; the practical implementation balances per-token gradients and applies a numerical clipping bound for stability, while retaining DPO's original hyperparameter structure. On Llama-3-8B-Instruct trained on UltraFeedback under a SimPO similar setup, AdaDPO consistently outperforms DPO on AlpacaEval 2: it achieves higher length-controlled win rates (LC) in 81% of hyperparameter combinations, attains the global best LC (48.3%) and raw win rate (46.1%), and enlarges the LC-over-WR margin in 88% of combinations, indicating effective mitigation of length bias. Additional analyses on KL divergence, reward margin, and reward accuracy confirm that AdaDPO rectifies the gradient imbalance and yields more efficient optimization. Because it operates purely at the loss level, AdaDPO can be dropped into existing preference-based alignment pipelines without changing data collection or model architectures. The method requires only a few lines of code, and the same self-adaptive principle generalizes to a broad family of pairwise contrastive preference losses including SimPO, R-DPO, IPO, CPO, and ORPO.

48. 【2605.28438】Breaking the Script Barrier: Enabling Automatic Alignment for PoS-based ASR Error Analysis in Non-Latin Scripts

链接https://arxiv.org/abs/2605.28438

作者:Prasenjit K Mudi,Dahlia Devapriya,Sheetal Kalyani

类目:Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, Word Error Rate, Automatic Speech, Speech Recognition, Error Rate

备注

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems are commonly evaluated using aggregate metrics such as Word Error Rate (WER), which do not capture the linguistic structure of errors. Fine-grained analysis, such as Part-of-Speech (PoS)-wise error characterization, requires accurate alignment between ASR hypotheses and reference transcriptions. However, existing alignment tools are often unreliable for languages written in non-Latin scripts. In this work, we address this gap by proposing a robust, automated, language-agnostic alignment mechanism applicable across ASR architectures and across languages written in both Latin and non-Latin scripts. This enables consistent alignment of hypotheses, references, and evaluation sequences, forming the basis for downstream linguistic analysis. Building on this, we employ standard PoS taggers to perform scalable and reproducible PoS-wise error analysis. Notably, we perform alignment and downstream ASR error analysis across three major segmented writing systems, namely, Abugida (Tamil, Hindi, Kannada), Alphabetic (English, Russian, Greek), and Abjad (Arabic). We further demonstrate how such error information can be leveraged during ASR training to improve metrics such as WER.

49. 【2605.28433】Roles with Rails: Contract-Preserving Role Evolution in Multi-Agent Structured Reasoning

链接https://arxiv.org/abs/2605.28433

作者:Ling-Yue Ge,Lan-Zhe Guo

类目:Computation and Language (cs.CL)

关键词:Role-based LLM multi-agent, including capability coverage, carry structural obligations, LLM multi-agent systems, Role-based LLM

备注: 33 pages, 23 figures, 12 tables

点击查看摘要

Abstract:Role-based LLM multi-agent systems need adaptive role pools, yet adapting such systems is not merely a matter of prompt optimization: roles often carry structural obligations, including capability coverage, message compatibility, validation, final-answer aggregation, and parser-compatible output protocols. Existing systems either fix the role inventory and lose adaptivity, or allow unconstrained generation to induce role drift, removing structurally necessary roles and breaking answer contracts. We formulate this as contract-preserving role evolution, requiring every committed edit to preserve five structural contracts (capability, communication, validation, aggregation, output protocol). We instantiate this formulation in SERO, a Self-Evolving Role Orchestration framework that evolves a typed role-card pool through credit-guided retrieval, a credit-ranked communication DAG with a protected terminal aggregator and conditional validator repair, and a contextual-bandit controller whose LLM-proposed edits are committed only when they preserve the contracts and improve task score. Experiments on real-world reasoning benchmarks across three LLM backbones confirm the value of contract-preserving role evolution.

50. 【2605.28424】Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

链接https://arxiv.org/abs/2605.28424

作者:Jiapeng Zhu,Jianxiang Yu,Yibo Zhao,Chengcheng Han,Qi Gu,Xunliang Cai,Xiang Li,Weining Qian

类目:Computation and Language (cs.CL)

关键词:Equipping large language, large language models, enabling autonomous agents, Equipping large, solve complex tasks

备注

点击查看摘要

Abstract:Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.

51. 【2605.28389】FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

链接https://arxiv.org/abs/2605.28389

作者:Haihui Pan,Junwei Bao,Hongfei Jiang,Yang Song

类目:Computation and Language (cs.CL)

关键词:made significant progress, large language models, large language, made significant, significant progress

备注

点击查看摘要

Abstract:While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%--71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.

52. 【2605.28375】PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature

链接https://arxiv.org/abs/2605.28375

作者:An Dao,Nhan Ly,Thao Tran,Yuji Matsumoto,Akiko Aizawa

类目:Computation and Language (cs.CL)

关键词:fatal neurodegenerative disorders, nonspecific clinical presentations, rapidly progressive, difficult to diagnose, prion disease clinical

备注: 29 pages, 5 figures, accepted at ACL 25th Workshop on Biomedical Language Processing (BioNLP 2026)

点击查看摘要

Abstract:Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at this https URL.

53. 【2605.28365】Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

链接https://arxiv.org/abs/2605.28365

作者:Pauline Bourigault,Xiaotong Ji,Matthieu Zimmer,Rasul Tutunov,Haitham Bou Ammar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:missing library fact, judge natural-language mathematical, natural-language mathematical answers, Lean is increasingly, library fact

备注

点击查看摘要

Abstract:Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

54. 【2605.28363】PubMedCausal: A Span-Level Annotated Corpus for Causal Relation Extraction in Biomedical Text

链接https://arxiv.org/abs/2605.28363

作者:Ifeoluwa Kunle-John,Josiah Paul,Oluwatosin Agbaakin,Peter Aina,Ikenna Odezuligbo,Sydney Anuyah

类目:Computation and Language (cs.CL)

关键词:explicit causal cues, biomedical text mining, Causal, biomedical CRE, broader associations

备注: Submitted to EMNLP 2026, 8 Pages, 23 page appendix

点击查看摘要

Abstract:Causal relation extraction (CRE) is central to biomedical text mining, but current resources often conflate causal relations with broader associations, restrict annotation to sentence-level examples, or focus mainly on explicit causal cues. This limits their usefulness for evaluating whether models can recover causal claims as they are actually expressed in biomedical text. We introduce PubMedCausal, a span-level annotated corpus for biomedical CRE built from PubMed abstracts. The corpus contains 30,000 paragraph-level rows, including 3,945 causal rows and 6,491 adjudicated cause--effect pairs. Each causal relation is annotated with full-text cause and effect spans, causality type, and sententiality, enabling evaluation of both causal detection and full-span causal extraction. We benchmark discriminative encoders and open-source generative models across detection and extraction settings. For causal detection, biomedical encoders are strongest, with PubMedBERT reaching an F$_1$ score of 0.7391. For span-level extraction, the best generative baseline is DeepSeek-R1-32B with few-shot prompting, reaching a Cosine Pair F$_1$ of 0.6765. We further test transfer learning by evaluating PubMedCausal-trained encoders on external causal relation datasets, showing that the resource supports cross-dataset evaluation. Our results show that biomedical CRE remains difficult under class imbalance, long causal spans, implicit causality, inter-sentential relations, and prompt sensitivity. Code and Data can be found here: this https URL

55. 【2605.28346】When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

链接https://arxiv.org/abs/2605.28346

作者:Marcell Fekete,Johannes Bjerva,Tamás Káldi

类目:Computation and Language (cs.CL)

关键词:Vision-language models, discourse-appropriate form, increasingly evaluated, Vision-language, distinguish discourse-old Topics

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether VLMs distinguish discourse-old Topics from discourse-new Foci in visually grounded question answering. We exploit Hungarian, a language in which Topic and Focus map onto dedicated syntactic positions, making IS choices observable in text. Comparing six VLMs with human participants, we find that models produce IS-relevant constructions, but over-regularise this sensitivity. Under the interacting pressures of discourse status, grammatical role (preference for subject Topics) and definiteness (preference for indefinite Foci), humans choose variable strategies for IS realisation. VLMs, by contrast, collapse onto narrow response templates, resembling mode collapse (Kirk et al., 2024). Our findings suggest that VLM evaluation should look beyond content accuracy to how content is packaged for the discourse.

56. 【2605.28315】HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains

链接https://arxiv.org/abs/2605.28315

作者:Zheng Li,Mao Zheng,Mingyang Song,Tianxiang Fei

类目:Computation and Language (cs.CL)

关键词:General-purpose machine translation, modern large language, General-purpose machine, large language models, language models cluster

备注

点击查看摘要

Abstract:General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty-aware diagnostic benchmark for bidirectional Chinese-English domain translation. HardMTBench covers 12 domains and contains 10,000 hand-curated source sentences with reference translations, packaged as 20,000 directional test items. A three-stage construction pipeline builds a domain-balanced candidate pool of 84{,}566 pairs, applies an LLM-based multi-signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per-domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross-system GEMBA range by roughly a factor of two over FLORES-200, induces visible rank reorderings, and exposes domain-specific terminology and knowledge weaknesses that quality-only metrics tend to flatten. All data and code are open-sourced at this https URL.

57. 【2605.28313】Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

链接https://arxiv.org/abs/2605.28313

作者:Nicolás Benjamín Ocampo,Agnes Paullate Nyiranziza,Davide Ceolin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated remarkable capabilities, Language Models, demonstrated remarkable

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen's $\kappa$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75\% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.

58. 【2605.28308】HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment

链接https://arxiv.org/abs/2605.28308

作者:Yoonjin Jang,Junwoo Kim,Youngjoong Ko

类目:Computation and Language (cs.CL)

关键词:Entity Alignment, knowledge graph, relational structure, essential for knowledge, exploit name overlap

备注: 10 pages, 3 figures, 9 tables. Code and benchmarks available at [this https URL](https://github.com/Wnsdnl/HELEA)

点击查看摘要

Abstract:Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same-name entities that refer to different real-world objects. Our primary contribution is a same-name hard-negative augmentation strategy that simultaneously yields quality-controlled evaluation benchmarks (DW-HN29K, DY-HN27K) and augmented training corpora (DW-Train, DY-Train), by mining same-name but distinct entity pairs from KG name-collision groups. We further introduce HELEA, a two-stage framework integrating (i) entity encoder retrieval trained on hard-negative-augmented training corpora with 1-hop KG context, and (ii) LLM-based reranking without additional training. Experiments show that name-dependent baselines collapse to near-random performance on our hard-negative benchmarks, while HELEA achieves F1 0.967 on DW-HN29K while maintaining Hit@1 0.993 on standard DW-15K.

59. 【2605.28306】Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

链接https://arxiv.org/abs/2605.28306

作者:Guanzhi Deng,Kuan Wu,Haibo Wang,Shing Yin Wong,Sichun Luo,Linqi Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:efficient LLM scaling, LLM scaling, efficient LLM, tasks remains challenging, remains challenging

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

60. 【2605.28305】Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

链接https://arxiv.org/abs/2605.28305

作者:Yahan Yu,Noa Nakanishi,Fei Cheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, produce explicit reflective, explicit reflective traces, Language Models

备注: 15 pages, 12 figures

点击查看摘要

Abstract:Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

61. 【2605.28295】Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

链接https://arxiv.org/abs/2605.28295

作者:Soeun Kim,Albert No

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Reinforcement Learning, Verifiable Rewards, Learning with Verifiable, alternative reasoning paths, labeled trajectories

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

62. 【2605.28292】CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

链接https://arxiv.org/abs/2605.28292

作者:Yukyung Lee,Yumeng Shen,Jinhyeong Park,Hyein Yang,Jun-Hyung Park

类目:Computation and Language (cs.CL)

关键词:underline, large language models, reduces the inference, inference cost, cost of large

备注: 17 pages, 7 figures

点击查看摘要

Abstract:Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\textit{\underline{C}hain-of-thoughts \underline{I}nto \underline{R}eusable \underline{F}unctional units}), an implicit CoT framework that performs reasoning as a dynamic sequence of discrete functional tokens. CIRF assigns a functional token to each semantically coherent reasoning unit in explicit CoT traces. The model is then fine-tuned to autoregressively generate functional tokens and their optional results, followed by the final answer. This design aligns latent reasoning with a sequence of functional units, facilitating parallel training, explicit rationale alignment, and adaptive reasoning. Extensive experiments on mathematical, symbolic, and commonsense reasoning benchmarks show that CIRF provides a favorable accuracy-latency trade-off compared with state-of-the-art implicit CoT methods. Further analyses demonstrate that CIRF constructs distinct, interpretable functional tokens, leading to consistent performance improvements.

63. 【2605.28283】PrunePath: Towards Highly Structured Sparse Language Models

链接https://arxiv.org/abs/2605.28283

作者:Zhexuan Gu,Zixun Fu,Yancheng Yuan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:inference efficiency gains, hardware-friendly inference efficiency, Feed-forward networks, dominate the parameter, efficiency gains

备注

点击查看摘要

Abstract:Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

64. 【2605.28255】AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

链接https://arxiv.org/abs/2605.28255

作者:Maharshi Gor,Yoo Yeon Sung,Yu Hou,Eve Fleisig,Irene Ying,Tianyi Zhou,Jordan Boyd-Graber

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:systems are fallible, humans, make mistakes, deciding, collaboration

备注: Findings of the Association for Computational Linguistics, 2026

点击查看摘要

Abstract:AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

65. 【2605.28253】Building Community-Centred NLP Resources for Puno Quechua

链接https://arxiv.org/abs/2605.28253

作者:Elwin Huaman,Adrian Gamarra Lafuente,Johanna Cordova,Anna Korhonen

类目:Computation and Language (cs.CL); Databases (cs.DB); Human-Computer Interaction (cs.HC)

关键词:under-resourced languages requires, languages requires digital, requires digital tools, Puno Quechua, dedicated ASR resources

备注: Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2026), co-located with ACL 2026

点击查看摘要

Abstract:The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.

66. 【2605.28228】When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions

链接https://arxiv.org/abs/2605.28228

作者:Jiajie Yang,Yangchun Li,Guanyi Chen,Rui Fan,Xin Bai,Tingting He

类目:Computation and Language (cs.CL)

关键词:Emotional Support Dialogue, Support Dialogue Systems, Support Dialogue, Dialogue Systems, Balanced Emotional Support

备注

点击查看摘要

Abstract:Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers' emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.

67. 【2605.28227】Why We Need Speech to Evaluate Speech Translation

链接https://arxiv.org/abs/2605.28227

作者:Maike Züfle,Danni Liu,Vilém Zouhar,Jan Niehues

类目:Computation and Language (cs.CL)

关键词:evaluation metrics remain, metrics remain blind, preserving speech-specific information, quality estimation, increasingly capable

备注

点击查看摘要

Abstract:Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.

68. 【2605.28225】Supervised Semantic Differential for Cross-Cultural Concept Analysis: A Case Study of Human Affect

链接https://arxiv.org/abs/2605.28225

作者:Jan Sikora,Paweł Lenartowicz,Hubert Plisiecki

类目:Computation and Language (cs.CL)

关键词:Supervised Semantic Differential, supervised semantic gradients, Supervised Semantic, estimates supervised semantic, word-level translation

备注: 9 pages, 2 figures, excluding the appendices. Code to reproduce our results is available at [this https URL](https://github.com/przebor/Cross-Cultural-SSD)

点击查看摘要

Abstract:Cross-cultural comparison of psychological meaning requires methods that go beyond word-level translation and examine how semantic dimensions are organized across languages. We introduce a cross-lingual extension of the Supervised Semantic Differential (SSD), which estimates supervised semantic gradients in embedding space and compares them across aligned multilingual word embeddings. The method tests gradient alignment and difference using permutation procedures and bootstrap intervals, and interprets residual differences through clustering around the difference gradient. We demonstrate the approach on Polish, English, and French affective norm lexicons, modeling Valence, Arousal, and Dominance where available. Affective dimensions were significantly recoverable across languages and model settings. Cross-lingual comparisons showed broad alignment together with structured residual differences: Valence appeared mostly shared, whereas Arousal and Dominance produced more interpretable contrasts involving bodily threat, aesthetic stimulation, internal emotionality, macro-level authority, and everyday control. Several clusters also reflected corpus-specific artifacts, underscoring the need for cautious interpretation. Cross-lingual SSD offers an explainable framework for testing semantic alignment, identifying divergence, and generating hypotheses about cross-cultural differences in psychological meaning.

69. 【2605.28222】Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

链接https://arxiv.org/abs/2605.28222

作者:Evgenii Palnikov,Elizaveta Gavrilova

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:documentation-grounded retrieval-augmented generation, Low-Rank Adaptation, Reciprocal Rank Fusion, retrieval-augmented generation, documentation-grounded retrieval-augmented

备注: 13-page main body plus extended appendix; 6 figures; benchmark, LoRA adapters, and code at [this https URL](https://github.com/EugPal/rag-lora-tradeoffs)

点击查看摘要

Abstract:We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at this https URL.

70. 【2605.28218】IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

链接https://arxiv.org/abs/2605.28218

作者:Mingrui Sun,Mao Zheng,Zheng Li,Mingyang Song

类目:Computation and Language (cs.CL)

关键词:Modern translation workflows, translation workflows demand, Modern translation, workflows demand, JSON or HTML

备注: 11 pages, 6 figures, conference

点击查看摘要

Abstract:Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at this https URL.

71. 【2605.28215】Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

链接https://arxiv.org/abs/2605.28215

作者:Carmen Quiles-Ramírez,Leticia L. Rodríguez,Nicolás Martorell,Natalia Díaz-Rodríguez

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)

关键词:enables multimodal large, multimodal large language, In-context learning, enables multimodal, multimodal large

备注: Accepted to the CompLearn Workshop at ICML 2026

点击查看摘要

Abstract:In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.

Comments:
Accepted to the CompLearn Workshop at ICML 2026

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)

Cite as:
arXiv:2605.28215 [cs.AI]

(or
arXiv:2605.28215v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.28215

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Carmen Quiles-Ramírez [view email] [v1]
Wed, 27 May 2026 09:32:34 UTC (1,255 KB)

72. 【2605.28211】When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

链接https://arxiv.org/abs/2605.28211

作者:Maike Züfle,Jan Niehues

类目:Computation and Language (cs.CL)

关键词:users supply context, SpeechLLMs are increasingly, standard practice, users supply, fine-tune on proprietary

备注

点击查看摘要

Abstract:SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify and systematically investigate an overlooked privacy risk of such customisation: a model adapted to recognise domain-specific terminology can be nudged into transcribing a phonetically similar word from its context or training data, even when a different word is spoken, thereby leaking private information. To evaluate this risk, we construct a controlled dataset and measure leakage rates across two customisation mechanisms, prompting and fine-tuning. Both mechanisms cause measurable leakage, compounding when combined. We evaluate a prompt-level mitigation strategy and analyse the accuracy-leakage trade-off across customisation approaches, finding that fine-tuning without context prompts offers the best balance. We release our code and dataset publicly.

73. 【2605.28207】Pruning and Distilling Mixture-of-Experts into Dense Language Models

链接https://arxiv.org/abs/2605.28207

作者:Junhyuck Kim,Jihun Yun,Haechan Kim,Gyeongman Kim,Joonghyun Bae,Jaewoong Cho

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:frontier language models, loaded in memory, memory-constrained deployment, frontier language, preferable for memory-constrained

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

74. 【2605.28190】he Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

链接https://arxiv.org/abs/2605.28190

作者:Manuel Frank,Haithem Afli

类目:Computation and Language (cs.CL)

关键词:MTEB report, implicitly treating robustness, scalar property, Harder Text Embedding, Text Embedding Benchmark

备注: 29 pages, 11 figures

点击查看摘要

Abstract:Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of variation, and requires dynamic evaluation to expose failures hidden by static benchmarks. We introduce the Harder Text Embedding Benchmark (HTEB), a dynamic evaluation framework that challenges model robustness along three practically interpretable axes (Lexical/Stylistic, Length and Language) by stochastically transforming inputs at evaluation time with an LLM. Evaluating 16 open-weight embedding models on 32 datasets covering 42 languages under transformations validated by 4,800 human ratings on an English subsample, we find three patterns: (1) Models exhibit specific, partly decoupled robustness profiles across axes. (2) Across three model families, scale increases absolute scores but does not close the gap between original and transformed evaluations. Here, scaling tends to improve specifically the Language axis. (3) English datasets are more sensitive to HTEB transformations than multilingual datasets. This demonstrates that HTEB identifies strengths and weaknesses of models along deployment-relevant axes, challenging current embedding benchmarks and arguing for multidimensional, dynamic robustness evaluation.

75. 【2605.28188】Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

链接https://arxiv.org/abs/2605.28188

作者:Seojin Hwang,Minju Kim,Junhyuk Choi,JeongHyun Park,Hwanhee Lee

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, high-stakes decision-making settings, factually equivalent inputs, Language Models

备注: 29 pages, 7 figures, 31 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model's value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model's hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.

76. 【2605.28183】BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

链接https://arxiv.org/abs/2605.28183

作者:Sebastian Nagl,Ann-Kristin Mayrhofer,Martin Heidebach,Aleyna Koçak,Anne Zettelmeier,Elly Breu,Angelina Greiner,Sofija Milijas,Matthias Grabmair

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:German Law, Benchmark for German, subsumption-based legal reasoning, evaluating LLM systems, Law

备注: Pre-Print

点击查看摘要

Abstract:We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human--AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human--AI co-creation substantially outperforms unaided human work.

77. 【2605.28181】When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

链接https://arxiv.org/abs/2605.28181

作者:Jungwon Park,Jimyeong Kim,Jungmin Ko,Nojun Kwak,Wonjong Rhee

类目:Computation and Language (cs.CL)

关键词:Diffusion language models, central inference-time decision, iteratively denoising masked, Diffusion language, language models decode

备注: Preprint

点击查看摘要

Abstract:Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.

78. 【2605.28179】SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

链接https://arxiv.org/abs/2605.28179

作者:Quanen Sun,Changxin Tian,Ke Shi,Cai Chen,Cunyin Peng,Jia Liu,Kunlong Chen,Zhiqiang Zhang

类目:Computation and Language (cs.CL)

关键词:laws guide large, guide large language, Scaling laws guide, large language model, laws guide

备注

点击查看摘要

Abstract:Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

79. 【2605.28163】DEPART: DEcomposing PARiTy across Multilingual LLMs

链接https://arxiv.org/abs/2605.28163

作者:Manan Uppadhyay,Prashant Kodali,Pranjal Chitale,Reshma Ramaprasad,Himanshu Beniwal,Sunayana Sitaram

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:leaderboards report per-language, leaving systemic biases, report per-language accuracy, systemic biases unattributed, Multilingual Large Language

备注

点击查看摘要

Abstract:Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

80. 【2605.28142】Self-Consistency via Marginal Sharpening

链接https://arxiv.org/abs/2605.28142

作者:Aleksei Arzhantsev,Otmane Sakhi,Nicolas Chopin

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:elicit strong reasoning, strong reasoning abilities, additional training, elicit strong, abilities from language

备注

点击查看摘要

Abstract:Inference-time sampling can elicit strong reasoning abilities from language models without additional training. Existing power-sampling methods do so by sharpening the distribution over full generated outputs, favoring completions that are individually likely under the model. We argue that this is the wrong object to target for reasoning: a completion entangles a reasoning trace with a final answer, whereas what matters is whether an answer is supported by many plausible reasoning paths. We therefore shift the target from the full-output distribution to the sharpened answer marginal, making self-consistency an inference-time objective rather than a post-hoc voting criterion. Surprisingly, this marginal target admits an efficient approximation: we propose a simple, purely autoregressive parallel sampling algorithm that approximately samples from the sharpened answer marginal, eliciting stronger performance than standard power sampling on mathematics and coding benchmarks while being orders of magnitude faster.

81. 【2605.28131】Better heads do not guarantee better binarized constituency parsing

链接https://arxiv.org/abs/2605.28131

作者:Zeyao Qi,Yige Chen,Eitan Klinger,Vivaan Wadhwa,Jungyeul Park

类目:Computation and Language (cs.CL)

关键词:binary parser supervision, revisit punctuation-aware tree, improves binary parser, parser supervision, dependency-induced headedness improves

备注

点击查看摘要

Abstract:We revisit punctuation-aware tree binarization for constituency parsing and ask whether dependency-induced headedness improves binary parser supervision. Although learned heads substantially outperform rule-based heads in intrinsic head prediction, they do not yield consistent parsing gains after debinarization. In particular, punctuation-conditioned evaluation shows that learned headedness underperforms rule-based binarization in macro-average punctuation-sensitive $F_1$, despite a small overall gain on CTB. Similar instability appears under cross-treebank transfer. These results suggest that \ycc{linguistically grounded} headedness is not necessarily parser-optimal when used as a binarization control signal. The paper presents a negative result: better head prediction does not imply better punctuation-sensitive constituency parsing.

82. 【2605.28128】Chinese Word Boundary Recovery through Character Alignment Projection

链接https://arxiv.org/abs/2605.28128

作者:Lusha Wang,Yuchen Li,Su Yuan,Jungyeul Park

类目:Computation and Language (cs.CL)

关键词:character-level divergences disrupt, Chinese Penn Treebank, word boundaries assumed, non-standard text, fragile in non-standard

备注

点击查看摘要

Abstract:Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper formulates Chinese word boundary recovery as an alignment-based projection task. Given a noisy source sentence and a cleaner target counterpart, we first align the two strings at the character level and then project target-side word boundaries back onto the source. Beyond the recovery method itself, we introduce two evaluation resources: a manually checked learner Chinese benchmark based on MuCGEC and a controlled synthetic benchmark derived from the Chinese Penn Treebank. Experiments show that direct segmentation remains vulnerable to compound fragmentation in learner input, whereas the proposed two step projection method corrects many over-segmentation errors by using the corrected target to recover source-side word spans. The results show that word boundary recovery is distinct from ordinary segmentation and that alignment projection provides a principled mechanism for stabilizing Chinese annotation and evaluation under noisy input.

83. 【2605.28123】Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models

链接https://arxiv.org/abs/2605.28123

作者:Yuang Huang,Yafeng Zhang,Yu Zilan

类目:Computation and Language (cs.CL)

关键词:large vision-language models, remains poorly understood, vision-language models, poorly understood, Prompt-based verification

备注: 7 pages, 1 figures, submitted to ACL ARR 2026 May (EMNLP)

点击查看摘要

Abstract:Prompt-based verification is widely used to mitigate hallucinations in large vision-language models (LVLMs), yet when it helps remains poorly understood. We systematically study verification prompting across two representative LVLM architectures and hallucination benchmarks, and find that it is a risk-bearing intervention: its corrections increase with input difficulty, while newly introduced errors persist across difficulty levels. As a result, always-on prompting helps on hard inputs but offers little benefit -- and can harm -- easier ones. Our analysis further shows that this behavior is associated with a conservative output shift. Verification prompts redistribute attention from visual tokens toward instruction tokens and induce a distinct middle-layer entropy pattern absent in a neutral-prompt control, suggesting instruction-conditioned attention redistribution rather than uniformly improved visual grounding. Motivated by this input-dependent risk, we propose Risk-aware Selective Prompting (RSP), a training-free approach that uses pre-generation uncertainty signals to trigger verification selectively. RSP mitigates the degradation of always-on prompting while preserving baseline performance, and reveals that effective selection signals vary across architectures.

84. 【2605.28122】SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

链接https://arxiv.org/abs/2605.28122

作者:Yubin Qu,Yi Liu,Gelei Deng,Yanjun Zhang,Yuekang Li,Ying Zhang,Leo Yu Zhang

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:sequence of shell, network actions, quietly exceed, exceed the authorized, Adaptive Reward-guided Elicitation

备注

点击查看摘要

Abstract:A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair's run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model's 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

85. 【2605.28120】LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

链接https://arxiv.org/abs/2605.28120

作者:Zerui Chen,Qinggang Zhang,Zhishang Xiang,Zhimin Wei,Linfeng Gao,Xiao Huang,Zhihong Zhang,Jinsong Su

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Graph-based Retrieval-Augmented Generation, Graph-based Retrieval-Augmented, Retrieval-Augmented Generation, enabling more coherent, coherent and effective

备注: 30 pages, 18 figures, ACL 2026 Main Conference. Project page: [this https URL](https://github.com/XMUDeepLIT/LegalGraphRAG)

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at this https URL.

86. 【2605.28116】MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

链接https://arxiv.org/abs/2605.28116

作者:Ruoqi Guo,Yi Liu,Gelei Deng,Yiheng Xiong,Yuekang Li,Ying Zhang,Leo Yu Zhang,Lida Zhao,Ji Jie,Yuxiao Lu

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:separate trusted interface, trusted interface elements, Realistic Adversarial GUI, graphical user interface, Mobile graphical user

备注

点击查看摘要

Abstract:Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

87. 【2605.28112】A Wolf in Sheep's Clothing: Targeted Routing Hijacking in Federated RAG

链接https://arxiv.org/abs/2605.28112

作者:Junjie Mu,Qiongxiu Li

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Federated Retrieval-Augmented Generation, Retrieval-Augmented Generation, data remain local, raw data remain, remain local

备注: Under review. Code available at [this https URL](https://github.com/Junjie-Mu/routing-hijacking-fedrag)

点击查看摘要

Abstract:Federated Retrieval-Augmented Generation (FedRAG) is attractive for privacy-sensitive applications because raw data remain local. As a result, routing must rely on client-provided semantic profiles, creating a new opportunity for manipulation. We introduce Routing Hijacking, a routing-stage attack in which a malicious client forges its profile to attract target queries despite having irrelevant underlying data. We show that this vulnerability is severe. Across three representative FedRAG routing architectures, Routing Hijacking consistently misroutes target queries and leads to downstream disruptions and failures, including missing evidence, poisoning, incorrect answers, and hallucinations. In a high-stakes MedQA-USMLE case study, we further show that poisoned retrieved evidence can mislead models across scales, leading to incorrect answers, hallucinations, and sycophantic failures. Existing defenses do not close this gap: encrypted routing preserves the exploited ranking, and Byzantine-robust Federated Learning (FL) rules transfer poorly to heterogeneous routing profiles. To address this gap, we propose a trust-aware post-routing framework that reweights clients using returned-evidence feedback, including retrieval relevance, profile consistency, and cross-client agreement; online experiments show that it suppresses persistent hijacking over recurring queries and transfers to a learned neural router. Our findings establish routing integrity as a new security challenge in FedRAG and highlight the need for stronger defenses for secure federated retrieval.

88. 【2605.28108】Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents

链接https://arxiv.org/abs/2605.28108

作者:Bin Wu,Guanyun Zou,Bingbing Wang,Huan Zhao,Chuan Shi

类目:Computation and Language (cs.CL)

关键词:long-lived LLM, long-lived LLM agent, long-lived LLM agents, LLM agents, LLM

备注

点击查看摘要

Abstract:A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request. Yet today's agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long-lived LLM agents: an agent cannot act on a preference it never obtained. As users delegate more of their affairs to agents, the impact of this gap grows. We isolate one concrete, controllable slice of this gap as Ask-to-Remember (ATR): the agent decides whether to ask now for a reusable user preference that the current task does not need but a later session with the same user will. ATR is hard even to evaluate: the right question is underdetermined and its payoff deferred to tasks that may never arise. ATRBench, to the best of our knowledge the first ATR benchmark, makes it measurable by fixing each user's preferences as hidden ground truth, so success demands asking, not recall. Across eight frontier LLM agents, defaults fall at least 62 points below an oracle handed the relevant preference, and prompting closes little of it. Diagnostics identify acquisition as the bottleneck. ATRBench surfaces this proactivity gap in current agents and offers a diagnostic testbed for closing it.

89. 【2605.28093】ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering

链接https://arxiv.org/abs/2605.28093

作者:Yikai Zhu,Kunfeng Chen,Qihuang Zhong,Juhua Liu,Bo Du

类目:Computation and Language (cs.CL)

关键词:large language models, enhancing large language, Retrieval-augmented generation, multi-hop question answering, language models

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG methods generally focus on either query-side task decomposition or corpus-side knowledge graph construction. Despite their progress, these methods still struggle to achieve satisfactory performance on complex multi-hop QA tasks. To this end, we propose ConRAG, a consensus-driven multi-view RAG framework that effectively boosts LLMs on complex multi-hop QA. The core of ConRAG is to systematically optimize both the query and corpus sides and to leverage multi-view evidence (relation, entity, and text signals) for more accurate retrieval. Extensive experiments on three multi-hop QA benchmarks show that ConRAG consistently outperforms all baselines by a clear margin, e.g., up to +26.9% average performance gains over vanilla RAG, and enables Gemma-4-31B to achieve a new state-of-the-art record on the challenging MuSiQue benchmark.

90. 【2605.28084】SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

链接https://arxiv.org/abs/2605.28084

作者:Lee Jung-Mok,Kim Sung-Bin,Joohyun Chang,Lee Hyun,Tae-Hyun Oh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:complex social signal, conveys communicative intent, intent beyond amusement, Laughter, real-world laughter understanding

备注

点击查看摘要

Abstract:Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: this https URL.

91. 【2605.28079】ATLAS: All-round Testing of Long-context Abilities across Scales

链接https://arxiv.org/abs/2605.28079

作者:Deli Huang,Cunguang Wang,Hongyin Tang,Zhe Tang,Linsen Guo,Dongyu Ru,Ruoshi Yuan,Ziyue Zhu,Xiaoyu Li,Ziwen Wang,Chen Zhang,Anchun Gui,Wen Zan,Jiaqi Zhang,Xuezhi Cao,Jingang Wang,Xunliang Cai,Yixin Cao

类目:Computation and Language (cs.CL)

关键词:narrow task family, advertise context windows, evaluations typically report, millions of tokens, task family

备注: 29 pages, 13 figures. Preprint

点击查看摘要

Abstract:Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and strong retrieval need not transfer to downstream use. We present ATLAS, a benchmarking framework that redefines long-context evaluation as length-dependent capability profiling. ATLAS contributes three methodological principles:(i) a layered taxonomy separating foundational operations from application workloads so failures can be attributed, (ii) length-aware AUC scoring that integrates score-length curves over a fixed 8K-1M grid, replacing single-point metrics with full degradation profiles, and (iii) ATLAScore, a harmonic-mean aggregate over taxonomy categories that penalizes imbalanced profiles, with end-to-end uncertainty propagation from subset scores through the nonlinear final aggregate. We instantiate the framework across eight capability dimensions with nine auditable components and 6,438 instances, and evaluate 26 models. Gemini-3.1-Pro-Preview leads at 128K, Claude-Opus-4.6 leads at 1M. Rankings reshuffle substantially between ATLASscore@8K-128K and ATLASscore@8K-1M: 7 models move by at least two ranks, and the two taxonomy layers share only 61% of cross-model variance, with individual rank gaps up to 12 positions. These results support reporting long-context quality by capability and length, not by a single headline score.

92. 【2605.28074】SilentRetrieval: Hijacking Retrieval-Augmented Generation via Semantically-Preserving Adversarial Data Poisoning

链接https://arxiv.org/abs/2605.28074

作者:Jiachen Qian

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:mitigates LLM hallucinations, Coordinated Beam Search, corpus integrity, critical vulnerability, hijacks RAG systems

备注: 12 pages, 4 figures, KDD '26 camera-ready version

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations but introduces a critical vulnerability: corpus integrity. We present SilentRetrieval, a two-stage data poisoning attack that hijacks RAG systems through adversarially crafted yet fluent documents. Stage 1 uses Coordinated Beam Search, a multi-token joint optimization method with a fluency-similarity objective, to keep a poisoned host document retrievable while constraining perplexity. Stage 2 uses Context-Adaptive Trigger Generation, a lightweight trigger-fusion step driven by a frozen LLM, to integrate manipulation triggers into document content. Under a one-poisoned-document-per-query evaluation with synthetic target answers, SilentRetrieval achieves 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO, while maintaining near-benign perplexity. Cross-model evaluation across four target LLMs shows nontrivial effectiveness under a fixed trigger generator, and transfer tests against unseen retrievers, including ColBERT and commercial embedding models, yield 64.7% average HR@10 under the same injected-corpus protocol. In a sampled Wikipedia-scale evaluation, SilentRetrieval retains 74.2% HR@10 at a 0.016% poisoning ratio. Combined retrieval-side and generation-side defenses reduce attack success substantially but incur a latency trade-off. Human evaluation shows substantially lower flag rates than disfluent baselines, while remaining numerically more suspicious than benign content at the current sample size.

93. 【2605.28073】StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

链接https://arxiv.org/abs/2605.28073

作者:Hanwen Cui,Yuting Mei,Yuhang Fu,Dingyi Yang,Qin Jin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:preserving plot consistency, adapt existing narratives, aims to adapt, adapt existing, preserving plot

备注: 16 pages, 7 figures, 15 tables

点击查看摘要

Abstract:Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

94. 【2605.28066】PromptEmbedder:: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

链接https://arxiv.org/abs/2605.28066

作者:Yu-Che Tsai,Kuan-Yu Chen,Yuan-Hao Chen,Yu-Han Chang,Ching-Yu Tsai,Yu-Hsiang Chuang,Shou-De Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, demonstrated remarkable efficacy, current adaptation methods

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

95. 【2605.28062】ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor

链接https://arxiv.org/abs/2605.28062

作者:Taiheng Pan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:long-term memory retrieval, conversational long-term memory, cross-encoder teacher supervision, reranker for conversational, conversational long-term

备注: 15 pages. Technical report

点击查看摘要

Abstract:We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory's learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.

96. 【2605.28060】Challenges in Explaining Pretrained Clinical Text Classifiers

链接https://arxiv.org/abs/2605.28060

作者:Kristian Miok,Matej Klemen,Blaz Škrlj,Marko Robnik Šikonja

类目:Computation and Language (cs.CL)

关键词:unstructured medical texts, clinical NLP remains, tasks involving long, NLP remains, complex tasks involving

备注: 9 pages, 7 figures. Accepted at the First Workshop on Responsible Healthcare using Machine Learning (RHCML 2025), co-located with ECML PKDD 2025

点击查看摘要

Abstract:Explaining the predictions of neural models in clinical NLP remains a significant challenge, especially for complex tasks involving long, unstructured medical texts. While post-hoc methods like LIME and SHAP are widely used, they often fall short when applied to clinical narratives. In this paper, we identify core limitations of token-level and perturbation-based explanation techniques through targeted demonstra- tions on a hospital length-of-stay prediction task. Our findings reveal issues such as overemphasis on non-informative tokens, instability in at- tributions, and high-confidence predictions for incoherent input variants. These results underscore the need for explanation strategies that are clin- ically meaningful, semantically grounded, and robust to linguistic noise.

97. 【2605.28058】Prompting Is All You Need: Multi-view Prompting Large Language Models for Aspect-Based Sentiment Analysis

链接https://arxiv.org/abs/2605.28058

作者:Nils Constantin Hellwig,Niklas Donhauser,Jakob Fehle,Udo Kruschwitz,Christian Wolff

类目:Computation and Language (cs.CL)

关键词:Aspect-Based Sentiment Analysis, Large Language Models, Recent work explored, Sentiment Analysis, Large Language

备注

点击查看摘要

Abstract:Recent work explored the capabilities of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA) through few-shot prompting, requiring substantially fewer annotated examples while achieving notable improvements over zero-shot baselines. However, a performance gap remained compared to models fine-tuned on hundreds of examples, and the computational costs of LLM inference present practical barriers to deployment. We introduce LLM-based Multi-View Prompting (LLM-MvP), which adapts the multi-view principle of considering multiple element orderings to LLM prompting. By combining schema-constrained decoding with a context-free grammar and prefix batching, LLM-MvP achieves performance competitive or superior to fine-tuned approaches while substantially reducing computational overhead. Extensive experiments across five benchmark datasets demonstrate that LLM-MvP closes the gap between few-shot prompting and fine-tuned models, offering a practical and efficient solution for ABSA.

98. 【2605.28047】Knowledge Dependency Estimation for Reliable Question Answering

链接https://arxiv.org/abs/2605.28047

作者:Chaodong Tong,Qi Zhang,Nannan Sun,Lei Jiang,Yanbing Liu

类目:Computation and Language (cs.CL)

关键词:Reliable question answering, question answering requires, answering requires identifying, Reliable question, answer is correct

备注: 12 tables, 9 figures

点击查看摘要

Abstract:Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emph{knowledge dependency estimation}: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbf{Knot}, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.

99. 【2605.28046】MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

链接https://arxiv.org/abs/2605.28046

作者:Zihan Li,Xingyu Fan,Feifei Li,Wenhui Que

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:flat passage lists, single query triggers, query triggers one-shot, triggers one-shot retrieval, systems universally follow

备注

点击查看摘要

Abstract:Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

100. 【2605.28042】Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

链接https://arxiv.org/abs/2605.28042

作者:Liu O. Martin,Lucas Bandarkar,Nanyun Peng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, broad generalists largely, generalists largely trained, Modern large language, language models

备注

点击查看摘要

Abstract:Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

101. 【2605.28037】Personality, Role, and Expressive Style in Large Language Models: An Interactionist Analysis

链接https://arxiv.org/abs/2605.28037

作者:Moe Nagao,Koichiro Terao,Mikio Nakano,Naoto Iwahashi

类目:Computation and Language (cs.CL)

关键词:large language model, designing large language, Prompt-based personality control, Prompt-based personality, language model

备注: 26 pages

点击查看摘要

Abstract:Prompt-based personality control is a key technique for designing large language model (LLM) dialogue agents that behave consistently across social contexts. However, specifying Big Five personality traits (BFTs) in a prompt does not ensure that the intended traits are expressed in generated utterances. This paper investigates this mismatch from an interactionist perspective, viewing personality expression as a context-dependent outcome shaped by the interplay between trait specification and situational factors. We analyze how perceived BFT expression in LLM-generated dialogue is influenced by three prompt factors: personality traits, dialogue roles, and expressive styles. Using a factorial design that combines six personality conditions, three roles, and three expressive-style conditions, we generate 1,080 LLM-agent dialogues in each of English and Japanese. We then evaluate the target agent's utterances using an LLM-as-a-judge framework to estimate expressed Big Five traits. The results show that expressed personality is shaped not only by explicit trait specification, but also by dialogue role and expressive style. These effects are trait-specific: dialogue role strongly influences Openness, expressive style substantially shapes Conscientiousness and Agreeableness, and explicit trait specification dominates Neuroticism. Even without explicit personality-trait specification, social and expressive conditions induce distinct personality-like impressions. Cross-linguistic comparisons show broadly similar patterns between English and Japanese dialogues, with noticeable differences only under specific combinations of personality, role, and expressive style. These findings suggest that personality control in LLM agents should be understood not as a direct consequence of trait prompting, but as a context-dependent process involving personality specification, social role, and expressive style.

102. 【2605.28025】MIRA: A Bilingual Benchmark for Medical Information Response Audit

链接https://arxiv.org/abs/2605.28025

作者:Mengyu Xu,Qiaoxin Yang,Qianqian Wang,Xiwei Dai,Weiyi Wu,Chongyang Gao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:existing safety evaluations, safety evaluations overlook, comparable medical information, preserve comparable medical, Large language models

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

103. 【2605.28023】VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

链接https://arxiv.org/abs/2605.28023

作者:Xingyu Lu,Jinpeng Wang,Yi-Fan Zhang,Yankai Yang,Yancheng Long,Yiyang Fan,Xuanyu Zheng,Haonan Fan,Kaiyu Jiang,Tianke Zhang,Changyi Liu,Bin Wen,Fan Yang,Tingting Gao,Han Li,Chun Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:capture visual content, visual content faithfully, omission and hallucination, Visual captioning requires, content faithfully

备注: 28 pages, 8 figures

点击查看摘要

Abstract:Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

104. 【2605.28022】Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

链接https://arxiv.org/abs/2605.28022

作者:Le Bronnec Florian,Alexandre Verine,Rio Yokota,Benjamin Negrevergne

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:finite sampling budget, multiple candidate programs, sampling budget, commonly evaluated, evaluated in repeated-sampling

备注: Preprint under review

点击查看摘要

Abstract:LLMs for code generation are commonly evaluated in repeated-sampling settings using Pass@k, where multiple candidate programs are executed against unit tests under a finite sampling budget. While recent verifier-based reinforcement learning (RLVR) methods improve executable correctness, how these objectives affect redundancy among sampled programs remains poorly understood. In this work, we study implementation-level redundancy in code generation using JPlag, a plagiarism-detection system for code. Across models and benchmarks, we show that correctness-only RLVR often concentrates generations around repeated implementations, whereas Pass@k-aware objectives maintain lower redundancy and improve larger-budget performance. Motivated by these observations, we augment RLVR with direct anti-redundancy rewards based on JPlag similarity. Across 3 models and 3 benchmarks, discouraging near-duplicate generations reliably improves finite-budget executable performance, often matching or outperforming specialized Pass@k-aware objectives.

105. 【2605.28020】he Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

链接https://arxiv.org/abs/2605.28020

作者:Shaobo Wang,Guo Chen,Ziyue Wang,Zhengyang Tang,Qingyang Liu,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang

类目:Computation and Language (cs.CL)

关键词:large language models, reliably evaluating, increasingly important, rapid progress, progress of large

备注: 26 pages, 5 figures, 8 tables

点击查看摘要

Abstract:With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token prediction and often fail to follow instructions or produce well-formed answers under standard prompting and direct decoding. As a result, benchmark performance can conflate model capability with decoding-induced failures to produce task-oriented outputs, while exposing such behavior often relies on costly post-training. Recent decodingonly approaches attempt to reshape output distributions, but such methods can be inefficient and brittle across open-ended tasks. To address these limitations, we propose Energy-Based Decoding (EBD), a training-free, reward-guided framework for activating task-oriented behaviors from frozen pre-trained LLMs across both open-ended and objective tasks. EBD augments decoding with an external lightweight reward model, steering generations toward high-utility responses while anchoring them to the pre-trained model prior through a reward-tilted target distribution. We show that EBD shifts base-model outputs toward more instructionfollowing behavior, increasing behavioral similarity to post-trained counterparts and enabling a fairer inference-time evaluation of accessible pre-trained-model behavior. Empirically, EBD outperforms baselines across five models and six benchmarks, improving Qwen3-8B-Base on AlpacaEval2.0 from 8.8 to 44.5, reducing Mistral-7B Math500 latency by 18.9x relative to prior decoding work, and remaining robust to reward-model size.

106. 【2605.28014】ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

链接https://arxiv.org/abs/2605.28014

作者:Ziqi Zhao,Xinyu Ma,Liu Yang,Yujie Feng,Daiting Shi,Jingzhou He,Xin Xin,Zhaochun Ren,Xiao-Ming Wu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:providing dense token-level, large language models, dense token-level supervision, Reflective On-policy Self-Distillation, On-policy self-distillation

备注: Preprint

点击查看摘要

Abstract:On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at this https URL.

Comments:
Preprint

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2605.28014 [cs.CL]

(or
arXiv:2605.28014v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.28014

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
107. 【2605.28013】KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

链接https://arxiv.org/abs/2605.28013

作者:Yongwoo Kim,Sojung An,Yunjin Park,Jungwon Yoon,Dujin Lee,HyunBeom Cho,Jaewon Lee,Wonhyuk Lee,Youngchol Kim,JeongYeop Kim,Donghyun Kim

类目:Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, safety

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.

108. 【2605.28009】MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

链接https://arxiv.org/abs/2605.28009

作者:Hyeonjeong Ha,Jeonghwan Kim,Cheng Qian,Jiayu Liu,William M. Campbell,Yue Wu,Yuji Zhang,Kathleen McKeown,Dilek Hakkani-Tur,Heng Ji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Memory-augmented large language, large language models, language models extend, fixed context window, Memory-augmented large

备注

点击查看摘要

Abstract:Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

109. 【2605.28006】Integrated and Cross-Architecture Interpretation of LLM Reasoning

链接https://arxiv.org/abs/2605.28006

作者:Leonardo Matthew Yauw,Wei-Bin Kou,Yujiu Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Mutual Information Peak, patterns remain opaque, practical asymmetry, remain opaque, reason is hindered

备注

点击查看摘要

Abstract:Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR's generalizable interpretation capabilities across architectures.

110. 【2605.28004】Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG

链接https://arxiv.org/abs/2605.28004

作者:Jiaming Zhang,Yibo Zhao,Jing Yu,Jianxiang Yu,Xiang Li

类目:Computation and Language (cs.CL)

关键词:extends retrieval-augmented generation, enabling graph-based retrieval, explicit knowledge graphs, enabling graph-based, GraphRAG extends retrieval-augmented

备注: 15 pages, 5 figures, 8 tables

点击查看摘要

Abstract:GraphRAG extends retrieval-augmented generation by organizing corpora as explicit knowledge graphs, enabling graph-based retrieval for complex question answering. However, existing frameworks extract entities and relations within individual chunks, leaving cross-chunk relations -- those whose evidence spans multiple passages -- systematically absent from the index. Exhaustive LLM-based recovery of such relations is impractical due to the combinatorial explosion of chunk combinations. We present CrossAug, a GNN-guided CROSS-Chunk Graph AUGmentation method that enriches GraphRAG indices with cross-chunk relational structure as an offline step before query-time retrieval. CrossAug derives training supervision through self-supervised graph corruption, uses a topology-aware GNN to score subgraphs for missingness, and applies evidence-grounded LLM completion only to selected high-scoring regions. Experiments on three LLM-based GraphRAG frameworks across four multi-hop and long-document QA benchmarks demonstrate that CrossAug consistently improves performance, confirming the benefit of cross-chunk graph augmentation for retrieval-based question answering. Our code is available at this https URL.

111. 【2605.28003】ResearchMath-14K: Scaling Research-Level Mathematics via Agents

链接https://arxiv.org/abs/2605.28003

作者:Guijin Son,Seungyeop Yi,Minju Gwak,Hyunwoo Ko,Wongi Jang,Youngjae Yu

类目:Computation and Language (cs.CL)

关键词:human intervention, frontier of mathematics, mathematics is defined, remains unclear, unclear whether language

备注: Work in progress. Dataset available at: [this https URL](https://huggingface.co/datasets/amphora/ResearchMath-14k)

点击查看摘要

Abstract:The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, $220$K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce $5.6\times$ more references and $5.0\times$ more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by $9.2$ points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

112. 【2605.27997】Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

链接https://arxiv.org/abs/2605.27997

作者:Himanshu Beniwal,Mayank Singh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:existing mitigation methods, mitigation methods rely, frequently generate toxic, toxicity originates internally, Large language models

备注

点击查看摘要

Abstract:Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

113. 【2605.27993】Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

链接https://arxiv.org/abs/2605.27993

作者:Jingwen Wu,Xijun Zhang,Ge Song

类目:Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, deployment of Multimodal

备注: 15 pages, 5 figures

点击查看摘要

Abstract:Object hallucination remains a primary obstacle to the reliable deployment of Multimodal Large Language Models (MLLMs). Current inference-time mitigation methods mainly assume hallucinations stem from visual neglect, steering models to enhance visual reliance. In contrast, our systematic interventions on multiple MLLMs show that pushing toward more visual reliance may exacerbate hallucinations on some models, while less may mitigate hallucinations. This result suggests that attributing hallucinations solely to visual insufficiency is underdetermined. We argue that the image, as a context, simultaneously competes with the model's parametric knowledge and the textual context. For this, we propose a training-free framework, Context-Preference Activation Steering (CAS). It extracts two semantically distinct Context Preference Vectors (CPVs) via two small sets of designed conflict samples and applies them via single-pass signed residual injection at mid-early MLP layers during inference to control information reliance. Experiments show that CAS substantially mitigates object hallucinations without increasing decoding latency and preserves native text-generation quality.

114. 【2605.27988】Auditing Stance Asymmetry in Generative Explanations

链接https://arxiv.org/abs/2605.27988

作者:Jiarui Han

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:made substantial progress, stereotype association, overt derogation, made substantial, substantial progress

备注

点击查看摘要

Abstract:Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations raise a different problem: they guide interpretation by assigning responsibility, legitimacy, context, and grievance. A model can avoid hostile language while making one side structurally understandable and another personally at fault, overreacting, or less worth taking seriously. We call this stance-bearing asymmetry in generative explanations. We propose Symmetry Decomposition Evaluation (SDE), which tests paired situations with concrete group labels, structural-role rewrites, and explicit support or counter-evidence. In a controlled 32-family prototype suite, this decomposition shows that surface differences are not all alike: some weaken under structural or evidence control, while others remain as stable differences in how the model assigns blame, context, or legitimacy. Targeted case review and judge comparison suggest a broader difficulty for evaluating open-ended framing asymmetries: judge readings shift across operationalizations, and scalar scores can flatten distinctions that readers use to interpret explanatory stance. SDE therefore reframes generative bias evaluation as an audit of explanatory stance -- what stance each side receives, how it changes under decomposition, and where automatic scoring becomes unstable.

115. 【2605.27986】An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences

链接https://arxiv.org/abs/2605.27986

作者:Dhawa Sang Dong,Mausam Gurung,Suraj Kandel

类目:Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)

关键词:Messenger RNA, therapeutics require optimized, require optimized design, ensure efficient translation, Large Language Model

备注

点击查看摘要

Abstract:Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.

116. 【2605.27984】KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

链接https://arxiv.org/abs/2605.27984

作者:Haechan Kim,Seungjun Chung,Inkyu Park,Jihoo Lee,Jonghyun Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:extending large language, achieved substantial progress, large language models, Speech language models, large language

备注: 16 pages, 4 figures

点击查看摘要

Abstract:Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

117. 【2605.27980】Periodic RoPE for Infinite Context LLMs

链接https://arxiv.org/abs/2605.27980

作者:Simin Huo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:perform long-horizon tasks, process ultra-long contexts, large language models, long-horizon tasks, ability to process

备注: 5 pages

点击查看摘要

Abstract:The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \href{this https URL}{this https URL}.

118. 【2605.27971】Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses

链接https://arxiv.org/abs/2605.27971

作者:Kerui Peng,Feifei Li,Xingyu Fan,Wenhui Que

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, term Cross-Style Collapse, severely limited, large language, language models

备注

点击查看摘要

Abstract:When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

119. 【2605.27969】Boundary Suppression Asymmetry in Post-trained Assistants: Over-expansion as a Controllability Cost

链接https://arxiv.org/abs/2605.27969

作者:Jiarui Han

类目:Computation and Language (cs.CL)

关键词:Post-trained language-model assistants, Post-trained language-model, encouraging complete, Post-trained, proactive responses

备注

点击查看摘要

Abstract:Post-trained language-model assistants are often optimized to avoid under-answering, encouraging complete, helpful, cautious, and proactive responses. We ask whether this optimization creates asymmetric controllability costs: when users explicitly request narrower answers, which assistant behaviors remain suppressible, and which continue to shape the response? We study this problem as boundary-suppression asymmetry. Prompt-side probes across multiple high-level response dimensions suggest a selective cost, concentrated around `too-much assistant' directions such as over-completion, extra help, and anti-underanswering. Using controlled assistant-policy variants derived from a shared base model, we find that anti-underanswering policies are harder to pull back than the baseline under matched boundary-control evaluations, while minimal-boundary variants generally avoid this anti-side upward shift in the direct boundary-control comparisons. Mechanism-oriented probes point beyond longer default outputs, pure EOS failure, uncertainty compensation, and local continuation bias, while robustness checks preserve the main anti-over-baseline ordering under shared-system and larger-scale settings. The evidence supports a mixed planning/stopping account, where content-budget overshoot and continuation persistence jointly make boundary correction harder. Overall, post-training may create direction-specific controllability costs: some helpful assistant tendencies remain easy to invoke, yet harder to locally suppress.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.27969 [cs.CL]

(or
arXiv:2605.27969v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.27969

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
120. 【2605.27958】Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

链接https://arxiv.org/abs/2605.27958

作者:Sachin Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:report AUROC exceeding, trained on LLM, LLM activations, activations are increasingly, increasingly proposed

备注: Accepted at the GEM Workshop @ ACL 2026

点击查看摘要

Abstract:Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

121. 【2605.27957】DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

链接https://arxiv.org/abs/2605.27957

作者:Zhitong Chen,Kai Yin,Weifeng Zhang,Zhiyuan Wang,Xiangjue Dong,Chengkai Liu,Zhewei Liu,Yiming Xiao,Ali Mostafavi,James Caverlee

类目:Computation and Language (cs.CL)

关键词:severe societal impacts, Disasters cause severe, demanding rapid coordination, Toggle, coherent multi-step workflows

备注

点击查看摘要

Abstract:Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: this https URL

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.27957 [cs.CL]

(or
arXiv:2605.27957v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.27957

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Zhitong Chen [view email] [v1]
Wed, 27 May 2026 04:50:23 UTC (354 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints, by Zhitong Chen and 9 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CL

prev

|
next

new
|
recent
| 2026-05

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

122. 【2605.27955】Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

链接https://arxiv.org/abs/2605.27955

作者:Xinze Li,Yuhang Zang,Yixin Cao,Aixin Sun

类目:Programming Languages (cs.PL); Computation and Language (cs.CL)

关键词:LLM agents ship, concrete invocation syntax, Markdown skill libraries, ship as free-form, invocation syntax

备注: Preprint. Code: [this https URL](https://github.com/InternLM/Skill-as-Pseudocode)

点击查看摘要

Abstract:Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused - re-retrieve - still confused" loop in which the agent issues a partially-correct action, receives uninformative environment feedback, and re-retrieves the same prose. We propose Skill-as-Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four-check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134-game ALFWorld unseen split with gpt-4o-mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph-of-Skills (GoS) baseline (pooled McNemar p = 8.2e-5), at -22.8 +/- 6.4% input tokens and -14.5 +/- 4.1% LLM calls per game.

123. 【2605.27934】GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

链接https://arxiv.org/abs/2605.27934

作者:Shengmin Piao,Sanghyun Park

类目:Computation and Language (cs.CL)

关键词:sparse outcome rewards, verifiable rewards improves, rewards improves language, improves language model, Reinforcement learning

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.

124. 【2605.27932】When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

链接https://arxiv.org/abs/2605.27932

作者:Yuan Tian,Bing Hu,Fang Wu,Xiaomin Li,Binghang Lu,Neil Zhenqiang Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:remain poorly understood, implications remain poorly, reasoning is emerging, poorly understood, large vision-language models

备注: 17 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

125. 【2605.27921】Show, Don't TELL: Explainable AI-Generated Text Detection

链接https://arxiv.org/abs/2605.27921

作者:Aldan Creo,Suraj Ranganath

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:achieving high in-distribution, number of approaches, approaches to discern, high in-distribution performance, AI-generated text detection

备注

点击查看摘要

Abstract:Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

126. 【2605.27916】OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

链接https://arxiv.org/abs/2605.27916

作者:Xuanzhao Dong,Wenhui Zhu,Xiwen Chen,Hao Wang,Xin Li,Yujian Xiong,Jiajun Cheng,Jingjing Wang,Xiaobing Yu,Haiyu Wu,Shao Tang,Zhipeng Wang,Langechuan Liu,Shan Lin,Oana Dumitrascu,Yalin Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, shown great potential

备注

点击查看摘要

Abstract:The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

127. 【2605.27914】Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

链接https://arxiv.org/abs/2605.27914

作者:Yuming(Rapheal)Huang,Yao Liu,Lei Wang,Junchen Wan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:calibrated emotional tone, LLM behavior, Subjective evaluation, evaluation of LLM, Subjective

备注

点击查看摘要

Abstract:Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target's training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus does not extend to capabilities where humans themselves disagree. We propose a replication-first paradigm: instead of anchoring on one rater group, we certify the instrument via four orthogonal properties -- reliability across K runs, cross-instrument replication across architecturally distinct judges, historical-footprint calibration via judges from earlier training cohorts, and pre-registered prediction. We test it on emotional accompaniment by letting the rubric self-evolve data-driven across iterations: the dimensions are not pre-stipulated and the procedure stabilizes to a 9-dimension set. Pre-registration applies to 10 falsifiable hypotheses and 11 forward predictions, committed before any test data was collected. Applied to 49 models across 8 families, the paradigm surfaces what aggregate scores hide. On advice-restraint -- whether a model refrains from giving unsolicited solutions in empathic contexts -- gpt-5 falls 1.87 points from gpt-4.1 and Opus-4.7 falls 0.629 from Opus-4.6, while aggregate scores stay flat. The regression survives three user-proxy swaps (95% of magnitude), replicates across a 5-family judge stack and a 17-month cohort gap, and persists on 74 held-out real ESConv conversations (rho in [0.749, 0.850]); the instrument reaches ordinal Krippendorff alpha = 0.91. As a by-product, the paradigm acts as a saturation-source diagnostic, separating instrumental ceilings (breakable by rubric refinement) from structural ceilings (needing scenario or roster intervention).

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.27914 [cs.CL]

(or
arXiv:2605.27914v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.27914

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
128. 【2605.27908】ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

链接https://arxiv.org/abs/2605.27908

作者:Jie Zhu,Huaixia Dou,Shuo Jiang,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang,Fang Kong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:coarse strategy supervision, offering limited interpretability, Existing emotional support, systematic skill improvement, Existing emotional

备注

点击查看摘要

Abstract:Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at this https URL.

129. 【2605.27905】AI Research Agents Narrow Scientific Exploration

链接https://arxiv.org/abs/2605.27905

作者:Yixuan Tang,Yi Yang

类目:Computation and Language (cs.CL)

关键词:run code, research, raising the possibility, possibility of large-scale, ideas

备注

点击查看摘要

Abstract:AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted scientific discovery. Many current agent frameworks explicitly encourage the generation of novel and high-impact ideas. Yet it remains unclear whether AI-assisted ideation broadens scientific exploration or mainly concentrates around existing work. We study AI research agents as scientific search systems. Using four AI research-agent frameworks and six large language models, we generate 37,802 scientific ideas from shared seed literature across citation-defined research areas in AI and machine learning. We then compare the resulting AI ideas against human-authored papers from the same research areas, follow-on human research emerging from the same seed literature, and the seed literature itself. Across experiments, four consistent patterns emerge. First, AI-generated ideas are substantially more concentrated than human-authored papers from the same research areas. Second, AI-generated ideas remain much closer to their starting literature than later human follow-on work does. Third, papers most similar to AI-generated ideas tend to receive lower subsequent citations. Fourth, when AI-generated ideas differ from prior work, the differences arise primarily from recombining existing technical methods rather than introducing fundamentally new research questions. Overall, current AI research agents appear better suited to local elaboration than to broadening scientific exploration.

130. 【2605.27901】he Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

链接https://arxiv.org/abs/2605.27901

作者:Eric Onyame,Runtao Zhou,Kowshik Thopalli,Bhavya Kailkhura,Chirag Agarwal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:detecting misaligned behavior, promising safety mechanism, mechanism for detecting, behavior in large, large language models

备注

点击查看摘要

Abstract:Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \href{this https URL}{\textcolor{blue}{here}}.

131. 【2605.27896】FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations

链接https://arxiv.org/abs/2605.27896

作者:Xuesi Hu,Peng Wang,Jinpeng Miao,Xilin Tao,Caiwei Li,Yue Ma,Jie He,Qiancheng Zhang,Yuntao Zou,Dagang Li

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)

关键词:large language models, dynamic trading tasks, simple dynamic trading, achieved superior performance, large language

备注: Preprint

点击查看摘要

Abstract:Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.

132. 【2605.27882】VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

链接https://arxiv.org/abs/2605.27882

作者:Xiaohongshu Inc

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:persistent evaluation-experience gap, LLM-based agents score, users consistently find, find results unsatisfying, consistently find results

备注

点击查看摘要

Abstract:LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

133. 【2605.27881】Retrieval, Reward, and Training Protocols: What Matters in Training Search Agents?

链接https://arxiv.org/abs/2605.27881

作者:Yibo Zhao,Zichen Ding,Jiayi Wu,Zun Wang,Xiang Li

类目:Computation and Language (cs.CL)

关键词:autonomously decompose queries, large language models, retrieve information, decompose queries, multi-step reasoning

备注: 18pages, 4 figures, and 15 tables

点击查看摘要

Abstract:Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi-step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under-explored dimensions of search agent training. First, we identify a critical data-coverage issue in the widely used Wikipedia 2018 corpus and show that correcting it alone yields larger gains than the differences between training algorithms. Second, we systematically compare outcome-based and process-based reward methods across three base models, finding that the simplest outcome-based approach achieves competitive or superior performance in most settings, and that process-level credit assignment can over-correct agent behavior. Third, we analyze training data diversity, off-policy data utilization, and search budget scaling, distilling practical guidelines for training effective search agents. Our code is available at this https URL.

134. 【2605.27878】Narrative Flattening: How Post-Training Compresses Thematic, Affective, and Stylistic Variation in LLM Fiction

链接https://arxiv.org/abs/2605.27878

作者:Zehan Li,Yutong Zhu,Siyang Wu,Honglin Bao,James A. Evans

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, creative output, output is widely, produce fluent fiction

备注

点击查看摘要

Abstract:Large language models produce fluent fiction, yet their creative output is widely seen as flat. We ask where this quality originates in the training and whether it affects different domains of human fiction equally. We construct a matched story-continuation paradigm across StoryStar (public-platform), TMAS (prompt-guided), and The New Yorker (professional literary)-and compare continuations from four OLMo 32B checkpoints (Base, SFT, DPO, RLVR) against matched human text. Because these checkpoints share architecture, scale, tokenizer, and pretraining, the design isolates the post-training effect. We measure each continuation along three sentence-level dimensions: thematic motion, affective prevalence, and linguistic diversity. Across all three, post-training compresses dynamic variation: thematic transitions become more uniform, high-intensity emotions give way to neutrality, and stylistic diversity across stories shrinks. We term this progressive loss narrative flattening. The effect is directionally stable across story domains but gap size depends on the human baseline: professional literary fiction is compressed most, while public-platform and prompt-guided stories show smaller gaps, consistent with their human baselines sitting closer to the model's default rhythm. Post-trained endpoints converge across domains, suggesting alignment produces a continuation regime largely insensitive to the source domain's narrative texture.

135. 【2605.27874】Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

链接https://arxiv.org/abs/2605.27874

作者:Nghia Hieu Nguyen,Quan Ngoc Hoang,Long Hoang Huu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen

类目:Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, systems formulate transcription, Speech Recognition, Automatic Speech, systems formulate

备注

点击查看摘要

Abstract:Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the phonetic structure of speech and often require large vocabularies to maintain adequate coverage. In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. Our approach explicitly captures the phonological composition of syllables, enabling the decoder to generate valid syllabic structures from a compact phonemic inventory. This design more closely aligns with the phonetic realization of speech while significantly reducing vocabulary size. Experimental results on two benchmarks: LSVSC, representing standard speech, and UIT-ViMD, a multi-dialect corpus containing diverse regional pronunciations, show that our method consistently outperforms strong previous baselines, especially pretrained baselines such as PhoWhisper and Wav2Vec2, despite using a substantially smaller vocabulary and no additional training resources. These results highlight the effectiveness of phoneme-based syllabic modeling for ASR in this language. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

136. 【2605.27866】GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

链接https://arxiv.org/abs/2605.27866

作者:Parth Bhalerao,Jeromy Chang,David Chou,Oana Ignat

类目:Computation and Language (cs.CL)

关键词:tutor responses requires, locate errors, provide guidance, Evaluating AI tutor, factual correctness

备注: 16 pages, 7 figures

点击查看摘要

Abstract:Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations. Gemma3-12B performs best for single-task evaluation, while Gemma3-27B in 8-bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification. We further show that LoRA fine-tuning on structured classification objectives interferes with instruction-following behavior under thinking mode, redirecting generation away from the required evaluation format. Carbon analysis shows that model choice and reasoning mode substantially affect emissions. Overall, GRADE shows that carefully selected open-source LoRA pipelines can match or surpass proprietary and ensemble-based systems on key pedagogical dimensions, with code and data available at this https URL.

137. 【2605.27865】MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

链接https://arxiv.org/abs/2605.27865

作者:Zixuan Yang,Yibo Zhao,Weicong Liu,Xiang Li

类目:Computation and Language (cs.CL)

关键词:coarse proxy signals, conflate general relatedness, expensive human annotations, require expensive human, major venues

备注: 22pages, 8 figures, 12 tables

点击查看摘要

Abstract:Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer's prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor's predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at this https URL.

138. 【2605.27858】DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

链接https://arxiv.org/abs/2605.27858

作者:Shubhashis Roy Dipta,Ankur Padia,Francis Ferraro

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:produce inspectable traces, decomposition-based methods produce, methods produce inspectable, inspectable traces, Claim verification splits

备注

点击查看摘要

Abstract:Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at this https URL

139. 【2605.27849】FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

链接https://arxiv.org/abs/2605.27849

作者:Loc Pham,Lang Hong Nguyet Anh,Thanh Le-Cong

类目:Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Scala chronically underexplored, performing substantially worse, functional programming languages, imperative languages, frontier models performing

备注

点击查看摘要

Abstract:Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

140. 【2605.27835】CAREF: Calibration-Aware Regularization for Explanation Faithfulness Without Rationale Supervision

链接https://arxiv.org/abs/2605.27835

作者:Naphat Nithisopa,Teerapong Panboonyuen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:jointly optimizes predictive, optimizes predictive accuracy, parameter-efficient fine-tuning framework, explanation faithfulness, framework that jointly

备注: 10 pages

点击查看摘要

Abstract:We introduce CAREF, a parameter-efficient fine-tuning framework that jointly optimizes predictive accuracy and explanation faithfulness via calibration-aware regularization. At its core, CAREF couples entropy-based calibration with token-level sparsity control through a single unified loss, the Calibration-Aware Regularization for Explanation Faithfulness (LSCED), without requiring rationale supervision. Evaluated on four NLE benchmarks (COS-E, ECQA, ComVE, e-SNLI) with Flan-T5, our lightweight CAREF-AQ variant attains the best average accuracy (89.04) and explanation alignment (81.00 nBERT) using only 6.43% of trainable parameters, outperforming LoRA and AdaLoRA. To our knowledge, CAREF is the first method to unify entropy and sparsity regularization in a single training objective for interpretable LLM fine-tuning.

141. 【2605.27832】Playing with Words, Improving with Rewards: Training Language Models for Creative Association

链接https://arxiv.org/abs/2605.27832

作者:Vijeta Deshpande,Namrata Shivagunde,Sherin Muckatira,Hadrien Glaude,Mikhail Gronas,Claire Stevenson,Roger Beaty,Anna Rumshisky

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, increasingly difficult problems, Language Models, applied to increasingly

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.

142. 【2605.27824】Revealing Algorithmic Deductive Circuits for Logical Reasoning

链接https://arxiv.org/abs/2605.27824

作者:Phuong Minh Nguyen,Tien Huu Dang,Naoya Inoue

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, few-shot learning settings, incorporating functional symbolic

备注

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

143. 【2605.27808】ARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

链接https://arxiv.org/abs/2605.27808

作者:Xinyu Wang,Ziyu Zhao,Ke Bai,Silin Meng,Dongming Shen,Xiao-Wen Chang,Yixuan HE

类目:Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Data-aware post-training quantization, implicitly weighting positions, per-token reconstruction loss, Data-aware post-training, small calibration corpus

备注

点击查看摘要

Abstract:Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech \textbf{R}ecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbf{W}ord \textbf{E}rror \textbf{R}ate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.

144. 【2605.27805】ChildEval: When large language models meet children's personalities

链接https://arxiv.org/abs/2605.27805

作者:Yanyan Luo,Xue Han,Chunxu Zhao,Ruiqiao Bai,Yaxing Zhang,Qian Hu,Lijun Mei,Junlan Feng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:personalization remains unclear, child-centered personalization remains, enable personalized chatbots, remains unclear, personalization remains

备注: 8 pages of main text (ACL Findings format), with references and appendix

点击查看摘要

Abstract:While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at this https URL.

145. 【2605.27789】A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

链接https://arxiv.org/abs/2605.27789

作者:Camilo Chacón Sartori,José H. García

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Retrieval-augmented generation, large language model, RAG, multi-hop RAG, LLM

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2605.27789 [cs.AI]

(or
arXiv:2605.27789v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.27789

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
146. 【2605.27788】Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

链接https://arxiv.org/abs/2605.27788

作者:Abhijit Kumar,Zoey Wu,Mohit Suley

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:warrants a calculator, textbf, Humans, CARL, calls

备注

点击查看摘要

Abstract:Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model's own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model's domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53\% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

147. 【2605.27787】Long Live the Librarian! A Persistent Search Sub-Agent for Energy-Efficient Multi-Agent Software Engineering Systems

链接https://arxiv.org/abs/2605.27787

作者:Seunghyuk Cho,Sunghyun Choi,Jaeseung Heo,Youngbin Choi,Saemi Moon,MoonJeong Park,Dongwoo Kim

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:autonomous software engineering, raise sustainability concerns, substantially advanced autonomous, advanced autonomous software, demands raise sustainability

备注: 19 pages, 4 figures, 12 tables

点击查看摘要

Abstract:Multi-agent systems (MAS) have substantially advanced autonomous software engineering (SWE), but their growing inference energy demands raise sustainability concerns. In this paper, we demonstrate that this cost is concentrated in an overlooked source: redundant output tokens generated across agents. Two empirical findings ground this claim. First, our per-token energy attribution for MAS reveals a sharp asymmetry: an output token consumes 30 to 1,000 times more energy than an input or cached token. Second, MAS inflate per-episode output because agents repeatedly re-explore overlapping repository regions. To address this inefficiency, we propose Librarian, a persistent search sub-agent that tracks repository-search history and suppresses redundant exploration actions across agents. By returning short references to file regions instead of full file excerpts, Librarian further reduces output-token volume. On SWE-Bench Verified, Librarian reduces per-episode GPU energy consumption of existing multi-agent SWE systems by up to 25% while preserving task performance.

148. 【2605.27773】Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

链接https://arxiv.org/abs/2605.27773

作者:Pruthvinath Jeripity Venkata

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:document contradicting, contradicting its training, follow the document, document or trust, document

备注: 12 pages, 8 tables, 3 appendices

点击查看摘要

Abstract:When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model's chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.

149. 【2605.27767】UniMaia: Steering Chess Policies with Language for Human-like Play

链接https://arxiv.org/abs/2605.27767

作者:Sherman Siu(1),Lesley Istead(1, 2) ((1) University of Waterloo, (2) Carleton University)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:controlling complex systems, enabled natural language, Recent advances, large language models, domain-specific inductive biases

备注

点击查看摘要

Abstract:Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

150. 【2605.27750】Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

链接https://arxiv.org/abs/2605.27750

作者:Antonia Karamolegkou,Nicolas Angleraud,Benoît Sagot,Thibault Clérice

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:Recent work, optical character recognition, Ancient Greek critical, low-resource Ancient Greek, producing plausible Greek

备注

点击查看摘要

Abstract:Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

151. 【2605.27741】Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

链接https://arxiv.org/abs/2605.27741

作者:Cihan Xiao,Yiwen Shao,Chenxing Li,Xiang He,Zhenwen Liang,Steve Yves,Sanjeev Khudanpur,Liefeng Bo

类目:Computation and Language (cs.CL)

关键词:omni-modal large language, large language models, language models exhibit, models exhibit impressive, exhibit impressive cross-modal

备注

点击查看摘要

Abstract:Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

152. 【2605.27740】UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

链接https://arxiv.org/abs/2605.27740

作者:Keqi Deng,Shaoshi Ling,Ruchao Fan,Jinyu Li

类目:Computation and Language (cs.CL)

关键词:large language models, self-attention key-value, Top-k sparse attention, inference in large, large language

备注

点击查看摘要

Abstract:Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both training-free use and sparsity-aware training, remains challenging. This paper proposes UNIQUE, a universal top-k sparse attention framework that addresses both requirements and stays consistently effective across LLM modalities. UNIQUE operates at the granularity of KV pages and estimates per-page importance with a simple yet accurate score combining the mean of the page's keys as a representative vector with their standard deviation as an offset term. To further close the train-inference gap, this paper introduces a soft-mask sparsity-aware training scheme that uses the top-k score boundary as a per-query threshold and a sigmoid soft mask around it, requiring neither auxiliary losses nor architectural changes. Experiments on text and speech LLMs show that UNIQUE preserves task performance on long-context benchmarks such as LongBench Pro and on long-form speech recognition, while delivering up to 11.4x attention-kernel speedup over FlashInfer dense attention and at least 5.3x end-to-end decoding speedup over a vLLM-based dense model.

153. 【2605.27721】UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

链接https://arxiv.org/abs/2605.27721

作者:Cheng Qian,Jiayu Liu,Heng Ji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:building effective agent, central to building, building effective, user, user mental state

备注: 19 Pages, 4 Figures, 2 Tables

点击查看摘要

Abstract:Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user's perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user's mental state. This misses the core structure of the problem: users act based on their beliefs, which are updated through observations of the environment; beliefs and intentions jointly determine actions, which in turn change the environment; and social reasoning often requires nested beliefs about what others believe or intend. We propose UserHarness, a simple framework that reframes ToM reasoning as explicit user-mind reconstruction. UserHarness decomposes the user's mental state, its relation to the external environment, and the actions that follow from it, enabling agents to track what the user observes, believes, intends, and does. Across five benchmarks, UserHarness reaches up to 95.94% macro accuracy, improving over existing inference methods by more than 15% relative and over the strongest prompt-only harness by about 20% relative. These results suggest that robust user understanding requires reasoning from the roots of the user's mind, positioning user harnessing as a promising foundation for more adaptive future assistants.

154. 【2605.27715】Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

链接https://arxiv.org/abs/2605.27715

作者:Jiaqiao Zhang,Zhoujun Li,Raoyuan Zhao,Jian Lan,Thomas Seidl,Michael A. Hedderich,Hinrich Schütze,Yihong Liu

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, achieve strong mathematical, Large reasoning, achieve strong, Large

备注: preprint

点击查看摘要

Abstract:Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements. We show that this view is incomplete: even when the problem is given in English, controlling the model's reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies. This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages. Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.

155. 【2605.27709】ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

链接https://arxiv.org/abs/2605.27709

作者:Raoyuan Zhao,Yihong Liu,Yupei Du,Hinrich Schütze,Michael A. Hedderich

类目:Computation and Language (cs.CL)

关键词:evaluating large language, separate genuine reasoning, large language models, vital for evaluating, evaluating large

备注

点击查看摘要

Abstract:Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

156. 【2605.27706】Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

链接https://arxiv.org/abs/2605.27706

作者:Joan Vendrell Gallart,Solmaz Kia,Russell Bent,Michael Grosskopf

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Chain-based Adaptive Reconfiguration, Chain-based Adaptive, Adaptive Reconfiguration, large language models, test-time hallucination reduction

备注

点击查看摘要

Abstract:We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

157. 【2605.27690】RACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

链接https://arxiv.org/abs/2605.27690

作者:Jiaqian Li,Yanshu Li,Boxuan Zhang,Ruixiang Tang,Kuan-Hao Huang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:intermediate steps long, agents increasingly operate, LLM agents increasingly, environment interaction, final outcome

备注

点击查看摘要

Abstract:LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

158. 【2605.27668】Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

链接https://arxiv.org/abs/2605.27668

作者:Hui Dai,Ryan Teehan,Parsa Torabian,Mengye Ren

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Probabilistic forecasting estimates, uncertain future events, Probabilistic forecasting, uncertain future, Probabilistic

备注

点击查看摘要

Abstract:Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(\alpha, \beta)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

159. 【2605.27654】Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

链接https://arxiv.org/abs/2605.27654

作者:Samyak Savi,Chavi Gupta,Shreyas Gantayet,Tanay Sodha,Dhruv Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Generative translation systems, socially meaningful cues, specific grammatical systems, Generative translation, culturally specific grammatical

备注: 10 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

160. 【2605.27649】Disentangling Language Roles in Multilingual LLM Task Execution

链接https://arxiv.org/abs/2605.27649

作者:Qishi Zhan,Minxuan Hu,Seoyeon Jang,Lei Zhao,Ziheng Chen,Man Liang,Xinyue Xiang,Jiaxin Liu,Guansu Wang,Liang He

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:required response languages, required response, source content, multilingual instruction-following evaluation, text

备注

点击查看摘要

Abstract:Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

161. 【2605.27642】Learning to Translate from Soft to Hard LLM Prompts

链接https://arxiv.org/abs/2605.27642

作者:Pitipat Kongsomjit,Suryansh Goyal,Jacob Whitehill

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Soft prompt tuning, specific tasks, Soft prompt, adapting LLMs, LLMs to specific

备注: 8 Pages, 11 tables, 4 Figures

点击查看摘要

Abstract:Soft prompt tuning is a parameter-efficient method for adapting LLMs to specific tasks, but suffers from a lack of interpretability. Building on recent work on interpreting soft prompts (Ramati et al., 2024), we explore how training a dedicated soft prompt to natural language translation model can yield higher translation quality. In particular, in both quantitative and qualitative comparisons on multiple Datasets of Datasets (DoDs), we demonstrate that our translator produces fluent, accurate verbalizations that outperforms existing training-free methods like InSPEcT. In addition to advancing interpretability, our work suggests a promising downstream application: soft prompts optimized on small, open-source models can be translated into portable text prompts that, when deployed on larger closed-API models, exceed the performance of the original soft prompt and, in some cases, even few-shot learning.

162. 【2605.27636】Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

链接https://arxiv.org/abs/2605.27636

作者:Hadi Bayrami Asl Tekanlou,Mahdi Bakhtiyarzadeh,Jafar Razmara

类目:Computation and Language (cs.CL)

关键词:general reasoning tasks, Large Language Models, demonstrate excellent capabilities, general public domain, culturally grounded knowledge

备注: 6 pages, 3 figures, accepted to the Everyday Knowledge Across Diverse Languages and Cultures shared task at SemEval2026

点击查看摘要

Abstract:Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.

163. 【2605.27621】Agents that Matter: Optimizing Multi-Agent LLMs via Removal-Based Attribution

链接https://arxiv.org/abs/2605.27621

作者:Mingyu Lu,Yushan Huang,Chris Lin,Su-In Lee

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:system optimization, increasingly complex, multi-agent systems, critical for system, individual agents

备注

点击查看摘要

Abstract:As multi-agent systems (MAS) become increasingly complex, identifying the contributions of individual agents is critical for system optimization. However, existing approaches lack a rigorous, unified framework for credit assignment. In this work, we formalize agent attribution as a cooperative game, parameterized by the coalition distribution, removal protocol, and target metric. Using this framework, we show that Leave-One-Out (LOO) identifies bottleneck agents as effectively as combinatorial methods, but at a fraction of the computational cost. We also demonstrate that removal protocols induce distinct games: Agent ablation isolates structural bottlenecks, whereas introspective LLM judges fail to faithfully approximate this behavior. Furthermore, to evaluate the utility of specific agent backbones, we introduce attribution via model replacement. By substituting underlying models of low-contribution agents, we improve task performance by up to 17% while reducing cost by up to 35% across three benchmarks. Finally, we apply our framework to audit a medical MAS, revealing that agent contributions to diagnostic accuracy and ethical behavior are often decoupled. By intervening on counterproductive roles, we observe an increase in ethics alignment while maintaining diagnostic accuracy. Overall, this work provides a principled approach for cost-effective MAS attribution and intervention.

164. 【2605.27596】Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

链接https://arxiv.org/abs/2605.27596

作者:Saptarshi Sengupta,Suhang Wang

类目:Computation and Language (cs.CL)

关键词:Small Language Models, Small Language, large language models, lower hardware demands, interest in Small

备注

点击查看摘要

Abstract:Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.

165. 【2605.27586】You Only Align Once: Propagating Cooperative Behaviors in Multi-Agent Systems through Seed Agents

链接https://arxiv.org/abs/2605.27586

作者:Nicole Hsing,Asuka Yuxi Zheng,Yi Zhao,Haoqin Tu,Jen-Tse Huang

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:systems remains challenging, Ensuring agent behaviors, distributed open multi-agent, open multi-agent systems, multi-agent systems remains

备注

点击查看摘要

Abstract:Ensuring agent behaviors in distributed open multi-agent systems remains challenging, especially as populations grow and unaligned agents may exist. We show that a single aligned agent can propagate cooperative behaviors to untrained agents purely through natural language interaction, a phenomenon we term Alignment Propagation. We study this in the Red-Black Game, a team-based iterated Prisoner's Dilemma in which teammates deliberate and vote to determine their team's collective action. By distilling the cooperative reasoning and persuasive dialogues of a teacher model into a Qwen-3-14B, we obtain a seed agent that, when placed among four untrained teammates, doubles the cooperation rate from 24.8% to 62.2%, outperforming the teacher model and a vanilla Gemini-3.1-Pro. Remarkably, a seed trained exclusively on the RedBlack Game transfers zero-shot to Sugarscape, a spatially grounded survival simulation with pairwise trading, achieving a 91.5% trade success rate versus a 21.6% baseline. Our results reframe multi-agent alignment from an exhaustive per-agent training problem to a scalable social capability that can be engineered through strategic seed placement.

166. 【2605.27571】Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

链接https://arxiv.org/abs/2605.27571

作者:Gaetano Rossiello,Dharmashankar Subramanian

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)

关键词:continuously evolving data, Modern analytics systems, fundamentally reactive, requiring users, Modern analytics

备注: Accepted at Supporting Our AI Overlords (SAO) at the ACM Conference on AI and Agentic Systems (CAIS), May 26 2026, San Jose, CS, USA

点击查看摘要

Abstract:Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

167. 【2605.27567】Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

链接https://arxiv.org/abs/2605.27567

作者:Amartya Roy,Sonali Parbhoo

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Causal Bayesian Optimization, scientific reasoning, open question, Agentic Causal Bayesian, cornerstone of scientific

备注: 9 pages, 3 figures

点击查看摘要

Abstract:Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

168. 【2605.27564】he Future of Facts: Tracing the Factual Generation-Verification Gap

链接https://arxiv.org/abs/2605.27564

作者:Tim R. Davidson,Anja Surina,Caglar Gulcehre

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Language models, default interface, verify outputs, outputs more reliably, factual knowledge

备注: Code for this project is available at [this https URL](https://github.com/anjasurina/factgap) , blog post at [this https URL](https://www.trdavidson.com/fact-gap)

点击查看摘要

Abstract:Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi-verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.

169. 【2605.27546】Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static Taxonomies

链接https://arxiv.org/abs/2605.27546

作者:Abeer Badawi,Will Aitken,Lydia Sequeira,Jocelyn Rankin,Maia Norman,Elham Dolatabadi

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:rapidly assess thousands, identify mental health, mental health concerns, youth SMS conversations, youth SMS

备注

点击查看摘要

Abstract:Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.

170. 【2605.27545】PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

链接https://arxiv.org/abs/2605.27545

作者:Snehasis Mukhopadhyay

类目:Computation and Language (cs.CL)

关键词:systems remain underexplored, unsafe image generation, remain underexplored, systems remain, severe consequences

备注

点击查看摘要

Abstract:Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple yet effective adaptive jailbreak framework that bypasses refusal training in state of the art multimodal text to image models. Building on prior findings that past tense reformulations can evade safeguards, PAST2HARM systematically exploits this vulnerability in multimodal generative AI. We characterize the attack along two dimensions. First, breadth: through temporal deepening, the framework incrementally strengthens historical anchoring and archival cues, eroding refusal boundaries across models with varying alignment strength. Second, depth: via iterative escalation after initial compliance, we probe the upper bound of harmful generation, measuring severity using a scalar severity jailbreak metric evaluated by a language model acting as a judge. We find that mid conversation turns form peak vulnerability windows, where harmfulness increases before plateauing and eventually undergoing semantic inversion. We evaluate PAST2HARM on three models Gemini Nano Banana Pro, GPT Image 2, and SD XL achieving attack success rates of 83 percent, 67 percent, and 100 percent in a black box, gradient free setting. Adversarial prompts also transfer across models, with cross model success rates above 50 percent. The attack elicits diverse harmful outputs, including explicit sexual content, political disinformation, historical denial narratives, hate speech, and self harm glorification. We further release a curated benchmark of prompts, reformulations, and outputs as a resource for red teaming and alignment. Our results expose fundamental brittleness in current safeguards and highlight the need for stronger multimodal safety training.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.27545 [cs.CL]

(or
arXiv:2605.27545v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.27545

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
171. 【2605.27531】Agentic Separation Logic Specification Synthesis

链接https://arxiv.org/abs/2605.27531

作者:Tarun Suresh,David Korczynski,Julien Vanegue

类目:Programming Languages (cs.PL); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:automatically inferring formal, inferring formal specifications, important for refactoring, task of automatically, automatically inferring

备注: 9 pages, 3 appendices

点击查看摘要

Abstract:Specification synthesis, the task of automatically inferring formal specifications from program implementations and natural language, is important for refactoring, transpilation, optimization, and verification, yet remains an open challenge for large C++ repositories. Existing LLM-based approaches fail to simultaneously scale to such repositories, produce specifications expressive enough to capture systems-code features such as dynamic memory and heap-allocated data structures, and systematically validate those specifications to rule out incorrect candidates. We present Spec-Agent, an agentic system for synthesizing expressive, well-validated specifications across large C++ codebases. Spec-Agent targets a ladder of specification languages: propositional logic, first-order logic, propositional separation logic, and first-order separation logic. For each function, Spec-Agent uses static analysis and runtime heap tracing to select the appropriate target specification language, generalizes existing functional tests into fuzz harnesses, and iteratively refines LLM-generated candidates via counterexample-guided feedback. We evaluate Spec-Agent on open source C++ codebases comprising millions of lines of code. Spec-Agent synthesizes valid specifications for 85% of target functions, with no false positives observed under fuzzing and expert validation, outperforming Claude Code Opus 4.6 at 10x lower token cost.

172. 【2605.27494】Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

链接https://arxiv.org/abs/2605.27494

作者:Syed Huma Shah(Duke University)

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Modern retrieval-augmented generation, deployments increasingly rely, Modern retrieval-augmented, reduce token cost, deployments increasingly

备注: 19 pages, 9 figures, 10 tables. Code: [this https URL](https://github.com/syedhumarahim/grounded-cache-router)

点击查看摘要

Abstract:Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

173. 【2605.27483】Debate Helps Weak Judges Reward Stronger Models

链接https://arxiv.org/abs/2605.27483

作者:Ethan Elasky,Frank Nakasako,Naman Goyal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:mixed empirical results, produced mixed empirical, judge, theoretical promise, empirical results

备注

点击查看摘要

Abstract:Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2605.27483 [cs.CL]

(or
arXiv:2605.27483v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.27483

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
174. 【2605.27458】Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

链接https://arxiv.org/abs/2605.27458

作者:Yongjin Cui,Xiaohui Fan,Huajun Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:heterogenous attention structures, attention structures, heterogenous attention, propelled the development, development of artificial

备注

点击查看摘要

Abstract:Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

175. 【2605.27402】REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

链接https://arxiv.org/abs/2605.27402

作者:Chengshuai Zhao,Fan Zhang,Kumar Satvik Chaudhary,Yiwen Li,Lo Pang-Yun Ting,Ying-Chih Chen,Huan Liu

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:manual grading remains, grading remains time-consuming, time-consuming and costly, central to equitable, equitable and personalized

备注

点击查看摘要

Abstract:Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.

176. 【2605.27393】StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

链接https://arxiv.org/abs/2605.27393

作者:Qingyu Meng,Min Chen,Dingming Liu,Yifan Mo,Yue Su,Xin Sun,Koen Hindriks,Jiahuan Pei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, prior works lack, works lack situational, Large language, dynamic strategy control

备注: ACL2026

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi-LLM agent framework for controllable MI dialogue generation, where questionnaire-based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI-coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi-turn conversation. We propose a two-level evaluation protocol: lexical metrics and MI-specific measures of macro-level counseling strategies, alongside LLM-as-judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire-story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open- and closed-source LLMs. Our results show that situational grounding and macro-level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi-agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

177. 【2605.27390】EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

链接https://arxiv.org/abs/2605.27390

作者:Shuyu Zhang,Lingfeng Pan,Qicheng Wang,Yaqi Shi,Yueyang Tan,Ruyu Yan,Jiaqi Chen,Lixing Du,Lu Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:accelerates Large Language, Large Language Model, Speculative decoding accelerates, decoding accelerates Large, Large Language

备注

点击查看摘要

Abstract:Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this overhead, they suffer from precipitous drops in acceptance rate in specialized domains or topic-switching scenarios due to their inability to capture dynamic distribution shifts. To address this, we introduce EvoSpec, a framework that enables real-time evolution of the draft model through dynamic vocabulary and parameter adaptation. Unlike static or purely retrieval-based approaches, EvoSpec employs a context-aware mechanism that retrieves critical long-tail tokens via efficient semantic and statistical indexing. Furthermore, we propose a lightweight online alignment strategy utilizing curriculum learning to continually minimize the distributional gap between the draft and target models. Extensive evaluations across specialized domains (coding, law, and medicine) confirm that EvoSpec overcomes the limitations of static baselines. On EAGLE-3, it achieves a 1.13x speedup in these settings over the state-of-the-art static baseline FR-Spec, with 27\% lower memory overhead than standard online adaptation.

178. 【2605.27389】Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

链接https://arxiv.org/abs/2605.27389

作者:Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:educational recommender system, conditioning context shapes, teacher-facing educational recommender, context shapes personalization, recommender system

备注: Accepted to ITS 2026

点击查看摘要

Abstract:We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual conditioning based on the current student question with memory-based conditioning using persistent learner information. Using deviation correlation and paired statistical tests, we find that contextual recommendations exhibit stronger question-level responsiveness, while memory-based recommendations exhibit history-dependent behaviors, including learner-specific differentiation under identical input. Teacher-facing evaluation signals suggest these recommendations are interpretable and actionable. These results indicate that embedding-based similarity metrics capture responsiveness to the current question but do not characterize personalization grounded in learner history, motivating behavior-level diagnostics for studying conditioning effects.

179. 【2605.27388】Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

链接https://arxiv.org/abs/2605.27388

作者:Nuan Wen,Xuezhe Ma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

关键词:Large language models, Large language, thick descriptions, critical challenge, increasingly utilized

备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent "realism gap": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

180. 【2605.27387】From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

链接https://arxiv.org/abs/2605.27387

作者:Xiangyu Ma,Teng Xiao,Zuchao Li,Lefei Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:pre-trained Autoregressive, Diffusion models promise, models promise efficient, bidirectional attention, creating a structural

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at this https URL.

181. 【2605.27384】From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

链接https://arxiv.org/abs/2605.27384

作者:Danai Korre

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:position paper reflects, large-scale within-subjects study, paper reflects empirical, position paper, paper reflects

备注: 4 pages, 5 figures, ACM CHI 2026 workshop paper

点击查看摘要

Abstract:This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a highly human-like, spoken embodied conversational agent (ECA) against a low human-like text base agent (no embodiment, text bubble only) within a mobile, Unity-developed game about pre-decimal UK currency. The game included two agents with different roles-an Instructor (Alex) and a Shopkeeper/Collaborator. Users interacted using voice and mouse input. The quantitative data I collected included a usability questionnaire (CCIR MINERVA) and the Agent Persona Instrument. Data was analyzed using paired t-test, repeated measures ANOVA and multiple linear regression to identify correlations between the persona and usability. The results showed a statistically significant preference for the version of highly human-like agents, with a large effect size. This is further discussed alongside qualitative findings from observations and exit interviews. The results are framed for Human-Agent collaboration, especially for how roles, mixed-initiative dialogue, and breakdowns/repairs become apparent in goal-oriented tasks. I conclude with questions on timing, user expectations, and role-specific interactions. This submission does not propose new frameworks; it reports empirical findings and questions I hope to workshop with the community.

182. 【2605.27383】Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

链接https://arxiv.org/abs/2605.27383

作者:Yizhong Geng,Yanliang Li,Jinghan Yang,Tianhan Jiang,Boxun An,Ya Li,Xiaoyu Shen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Spoken Language Models, Spoken Language, Language Models, bypassing explicit, promising paradigm

备注

点击查看摘要

Abstract:Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

183. 【2605.27382】he Alignment Floor: When Persona Customization Is Safe

链接https://arxiv.org/abs/2605.27382

作者:Xing Zhang,Guanghui Wang,Yanwei Cui,Wei Qiu,Ziyuan Li,Bing Zhu,Peiyang He

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:systems respect diverse, respect diverse user, behavioral adaptation, communication styles, key promise

备注

点击查看摘要

Abstract:A key promise of pluralistic AI is behavioral adaptation: persona prompts like "be creative" or "be thorough" let systems respect diverse user values and communication styles. But how much customization can a model absorb before its alignment breaks? We present the first controlled study of the alignment-customization tradeoff, testing seven persona conditions across five tasks on two models with different alignment strengths (1,800 runs). We discover the alignment floor: on a strongly-aligned model (Claude Sonnet), persona prompts have zero effect on sycophancy -- all conditions produce ~15%, a stable platform on which rich personalization is safe. On a weakly-aligned model (Nova Lite), the same personas shift sycophancy from 5% to 50% -- the floor is absent and customization becomes a safety liability. Surprisingly, Agreeableness is not the worst offender; Extraversion (+20pp) and Openness (+15pp) cause greater degradation. The constructive finding is the Skeptic defense: a critical-thinking persona reduces sycophancy to 5% even on the weak model -- the single largest effect in the study. Cross-model transfer of persona effects is near-zero ($\rho = 0.006$), meaning alignment testing must be per-model. We propose the alignment floor as a design principle: measure it before deploying persona customization, and layer safety-oriented personas underneath user-facing ones to enable personalization without compromising alignment.

184. 【2605.27380】BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

链接https://arxiv.org/abs/2605.27380

作者:Yi Wang,Corina Dima,Liangyu Zhong,Steffen Staab

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:biomedical NLP applications, biomedical entity linking, biomedical knowledge base, biomedical NLP, NLP applications

备注: 12 pages, 3 figures

点击查看摘要

Abstract:Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

185. 【2605.27379】Soro: A Lightweight Foundation Model and Chatbot for Tajik

链接https://arxiv.org/abs/2605.27379

作者:Stanislav Liashkov,Haitz Sáez de Ocáriz Borde,Azizjon Azimi,Khushbakht Shaymardonov,Shuhratjon Khalitbekov,Bonu Boboeva

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Tajik-specialized conversational large, large language models, conversational large language, family of Tajik-specialized, Tajik-specialized conversational

备注

点击查看摘要

Abstract:We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

186. 【2605.27378】OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

链接https://arxiv.org/abs/2605.27378

作者:Jing Hao,Siyuan Dai,Yongxin Zhang,Yuci Liang,Jiamin Wu,Jiahao Bao,Yuxuan Fan,Zanting Ye,Yanpeng Sun,Xinyu Zhang,Ming Hu,Liang Zhan,James Kit Hon Tsoi,Linlin Shen,Junjun He,Kuo Feng Hung

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:supporting accurate diagnosis, image analysis plays, plays a pivotal, pivotal role, role in supporting

备注: 14 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at this https URL.

187. 【2605.27377】RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge

链接https://arxiv.org/abs/2605.27377

作者:Yidong Gan,David D. Nguyen,Yang Lin,Peter Zhong,Thanh Vu,Long Duong,Yuan-Fang Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:coding, present RAG-Coding, agentic method, external knowledge sources, RAG-Coding

备注: Additional experiments and analyses are in progress

点击查看摘要

Abstract:We present RAG-Coding, an agentic method for automated ICD-10-CM coding. RAG-Coding orchestrates four large language model (LLM) agents and grounds their coding decisions in external knowledge sources (e.g. the official coding tabular list and guidelines). By retrieving and cross-referencing relevant knowledge in these sources, the agents enhance coding accuracy and ensure clinical compliance. On the MDACE dataset, RAG-Coding outperforms the best LLM-based baseline by 8-13\% in micro-F1 and 2-8\% in macro-F1 across multiple LLM backbones. Compared to the state-of-the-art pretrained language model method, PLM-ICD, RAG-Coding exhibits higher micro recall (+11\%), while PLM-ICD exhibits higher micro precision (+6\%), yielding comparable micro- and macro-F1. Ablations show stepwise gains, highlighting the importance of incorporating external knowledge. We also release MDACE-2025, updating the original dataset with expert re-annotations with the latest 2025 ICD-10-CM guidelines. This update features more fine-grained code labels and enables evaluation against current clinical standards.

188. 【2605.27376】Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

链接https://arxiv.org/abs/2605.27376

作者:Jaehoon Kang,Yejin Lee,Yoonji Park,Kyuhong Shim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:limited fine-grained control, enable natural language-driven, natural language-driven speaking, provide limited fine-grained, models enable natural

备注

点击查看摘要

Abstract:While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

189. 【2605.27375】LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

链接https://arxiv.org/abs/2605.27375

作者:Jiayong Wan,Jiawei Chen,Zhaoxia Yin,Liu Shuyuan,Hang Su

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, in-context reward hacking, maximize proxy objectives, harmful side effects

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model's own over-optimization. To mitigate this issue, we propose \textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model's actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.

190. 【2605.27374】ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

链接https://arxiv.org/abs/2605.27374

作者:Zhipeng Bian,Jieming Zhu,Qijiong Liu,Wang Lin,Guohao Cai,Zhaocheng Du,Jiacheng Sun,Zhou Zhao,Zhenhua Dong

类目:Computation and Language (cs.CL)

关键词:multimodal large language, Recent advances, large language models, AI-generated content, advances in multimodal

备注: Published in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12268-12278, EMNLP 2025. Official version: [this https URL](https://doi.org/10.18653/v1/2025.emnlp-main.617)

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.

191. 【2605.27373】Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

链接https://arxiv.org/abs/2605.27373

作者:Eduardo de la Cruz Fernández,Marcelo Karanik,Sascha Ossowski

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:unlike traditional utility-maximisation, scientific community focuses, traditional utility-maximisation models, creating decision-making mechanisms, Large Language Models

备注: 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

点击查看摘要

Abstract:As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

信息检索

1. 【2605.28810】Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

链接https://arxiv.org/abs/2605.28810

作者:Audrey Chan,Aaron Labbé,Jacob Lavoie,Jordan Bannister,Arsène Fansi Tchango,Guillaume Lajoie,Laurent Charlin

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Sound (cs.SD)

关键词:Functional music applications, distinctive recommendation problem, Music Recommendation System, listener affective state, Functional music

备注

点击查看摘要

Abstract:Functional music applications, from consumer focus and sleep aids to clinical interventions, share a distinctive recommendation problem: success is defined by the listener's affective state, but online experimentation on emotion is ethically constrained, particularly for clinical populations who cannot reliably skip a song or report distress. We describe AMRS, the Affective Music Recommendation System deployed on LUCID's health-and-wellness platforms, which serve clinical users (primarily older adults with neurocognitive conditions) and consumer-wellness users across energize, focus, calm, and sleep modes. AMRS is built around a rollout-based world model: a causal transformer trained on logged listening data to jointly predict engagement, binary rating, and self-reported valence and arousal. The world model serves both as an in-silico simulator for offline policy training and as a stress-testing tool before deployment. A recommender policy initialized by behaviour cloning is fine-tuned offline with Direct Preference Optimization (DPO) against a configurable multi-objective utility function. Under a strict cold-start protocol, the world model predicts both behavioural and affective signals with usable fidelity; DPO improves predicted valence and arousal over the cloned baseline while maintaining a similar diversity profile and avoiding the distributional collapse produced by greedy optimization. We position the work as an early deployed validation of a methodology for affective recommendation when online experimentation is ethically untenable.

2. 【2605.28806】Personal Visual Memory from Explicit and Implicit Evidence

链接https://arxiv.org/abs/2605.28806

作者:Viet Nguyen,Thao Nguyen,Vishal M. Patel,Yuheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:remain largely text-centric, methods remain largely, largely text-centric, methods remain, remain largely

备注: Project Page: [this https URL](https://viettmab.github.io/visualmem-page/)

点击查看摘要

Abstract:Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states -- both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual--text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

3. 【2605.28787】Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

链接https://arxiv.org/abs/2605.28787

作者:Shiyu Chen,Tarfah Alrashed,Alon Halevy,Natasha Noy

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Google Dataset Search, Semantic Agent, Large Language Models, Baseline Agent, semantic

备注

点击查看摘要

Abstract:In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like this http URL has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using this http URL. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

4. 【2605.28641】Subtraction Gets You More: Gap-Aware Retrieval for Multimodal Multi-Hop QA

链接https://arxiv.org/abs/2605.28641

作者:Sunah O,Jay-Yoon Lee

类目:Information Retrieval (cs.IR)

关键词:multi-hop question answering, multimodal multi-hop question, evidence set completion, retrieving missing evidence, initial retrieval stage

备注

点击查看摘要

Abstract:In multimodal multi-hop question answering, we focus on the initial retrieval stage via two distinct tasks: (1) evidence set completion, retrieving missing evidence given context, and (2) sequential pool construction, iteratively building the top-$K$ pool from the scratch. Under these settings, we point out that conventional iterative retrieval frameworks often suffer from Semantic Anchoring, where previously fetched evidence traps the retriever and yields entity-centric redundancy. To break this trap, we propose GRAIL (Gap-aware Retrieval via Adaptive Implicit Localization), a paradigm that performs implicit query rewriting directly at the embedding level. By context-subtractive query steering, GRAIL excels at compositional cross-modal reasoning, while additive embedding updates show strength on localized information aggregation. By dynamically routing queries based on task type, our Hybrid Framework achieves a 40.3\% macro-averaged performance gain on MultimodalQA. Extensive evaluations demonstrate that sequential GRAIL retrieves in a superior, noise-resilient manner, significantly expanding the search horizon through iterative gap-aware optimization.

5. 【2605.28565】Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

链接https://arxiv.org/abs/2605.28565

作者:Yongsik Seo,Wooseok Jeong,Eunyoung Kim,Hyeonseo Jang,Dongha Lee

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:rarely verify, verify the cited, cited pages, citation, search-augmented LLMs rely

备注: Working Progress

点击查看摘要

Abstract:Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

6. 【2605.28522】Search for Coverage: Learning Coverage-Aware Retrieval with Augmented Sub-Question Answerability

链接https://arxiv.org/abs/2605.28522

作者:Jia-Huei Ju,Eugene Yang,Trevor Adriaanse,Suzan Verberne,Andrew Yates

类目:Information Retrieval (cs.IR)

关键词:Long-form Retrieval-Augmented Generation, Retrieval-Augmented Generation, comprehensive relevant nuggets, comprehensive output, brings the challenge

备注

点击查看摘要

Abstract:Long-form Retrieval-Augmented Generation (RAG) brings the challenge of coverage-based ranking, because ranking methods must ensure the inclusion of comprehensive relevant nuggets (i.e., facts), which can thereby be synthesized into a comprehensive output. In this work, we propose CoveR (Our code is available at this https URL ) a dense retrieval method optimized for coverage-aware retrieval scenarios. CoveR is a bi-encoder trained with the coverage-based contrastive and distillation objectives, which enables CoveR to capture diverse aspects of information needs. To train CoveR, we create the SCOPE dataset, (Our training data is available at this https URL ) which comprises 90K training pairs from Researchy Questions with synthetic coverage signals augmented from sub-question answerability judgments generated by LLMs. Our empirical experiments show that CoveR enhances nugget coverage by 10\% over strong dense retrieval baselines without sacrificing its relevance-based retrieval capability. Further ablation studies validate the importance of our proposed learning method, showing that CoveR achieves a superior trade-off between relevance- and coverage-based ranking, which is essential for long-form RAG.

7. 【2605.28510】Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

链接https://arxiv.org/abs/2605.28510

作者:Andrea Gurioli,Davide D'Ascenzo,Federico Pennino,Maurizio Gabbrielli,Stefano Zacchiroli

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, Large language, language models, software development, authorship attribution

备注

点击查看摘要

Abstract:Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows = 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

8. 【2605.28493】Looking Farther with Confidence: Uncertainty-Guided Future Learning for Sequential Recommendation

链接https://arxiv.org/abs/2605.28493

作者:Ziqiang Cui,Xing Tang,Peiyang Liu,Xiaokun Zhang,Shiwei Li,Xiuqiang He,Chen Ma

类目:Information Retrieval (cs.IR)

关键词:Sequential recommendation effectively, dynamic user interests, face challenges related, Sequential recommendation, models dynamic user

备注

点击查看摘要

Abstract:Sequential recommendation effectively models dynamic user interests but continues to face challenges related to data sparsity. While self-supervised learning has alleviated this issue to some extent, most existing methods focus exclusively on immediate next-item prediction during training, thereby neglecting the rich information embedded in longer-term future interactions. Although a few studies have explored the utilization of future data, existing attempts typically apply future supervision signals with uniform intensity across all samples, which may lead to suboptimal solutions. In this paper, we propose an adaptive future learning framework, UFRec, which encourages the model to look further ahead when it is confident in the current state, while focusing on the immediate task when it is uncertain. Specifically, UFRec incorporates an Uncertainty-Guided Future Supervision module that dynamically modulates the weight of multi-step future supervision based on the model's confidence in the primary next-item prediction task. Furthermore, we complement step-wise future supervision with a Future-Aware Contrastive Learning module that treats the future trajectory as a holistic entity. Notably, both auxiliary modules are utilized exclusively during training and incur no inference overhead. Extensive experiments on four benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches by effectively leveraging future data.

9. 【2605.28483】From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints

链接https://arxiv.org/abs/2605.28483

作者:Ngoc Luyen Le,Marie-Hélène Abel,Bertrand Laforge

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Learning Management Systems, Linking learning resources, Management Systems, Linking learning, enabling competency-based search

备注

点击查看摘要

Abstract:Linking learning resources to a structured competency framework is key to enabling competency-based search and curriculum analytics in Learning Management Systems (LMS). However, manual tagging is labor-intensive, and fully automatic methods often lack transparency. In this paper, we present an end-to-end alignment pipeline that uses a large language model (LLM) as a constrained, evidence-producing tagger. LMS resources -both instructional content and assessments -are first segmented into meaningful pedagogical fragments. For each fragment, a small set of candidate competencies is retrieved from structured competency profiles enriched with graph-based context. The LLM then selects the most relevant competencies from this set and provides supporting evidence spans from the fragment text. These predictions are refined using the structure of the competency graph and aggregated at the resource level. We evaluate our approach on a dataset built from the Computer Science department's competency referential at the Université de Technologie de Compiègne (UTC), covering 22 competencies across multiple course materials. Our LLM+BM25+Graph (LBG) pipeline achieves strong results, with a micro-F1 of 0.57 and macro-F1 of 0.50 at the fragment level, 0.51 macro-F1 at the resource level, and an MRR of 0.82outperforming zero-shot and few-shot LLM variants, retrieval/similarity baselines, and supervised classifiers -while also producing more mechanically traceable evidence spans to support human auditing and educational analysis.

10. 【2605.28222】Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

链接https://arxiv.org/abs/2605.28222

作者:Evgenii Palnikov,Elizaveta Gavrilova

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:documentation-grounded retrieval-augmented generation, Low-Rank Adaptation, Reciprocal Rank Fusion, retrieval-augmented generation, documentation-grounded retrieval-augmented

备注: 13-page main body plus extended appendix; 6 figures; benchmark, LoRA adapters, and code at [this https URL](https://github.com/EugPal/rag-lora-tradeoffs)

点击查看摘要

Abstract:We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at this https URL.

11. 【2605.28187】Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

链接https://arxiv.org/abs/2605.28187

作者:Annabella Sánchez-Guzmán,Lukas Eberhard,Denis Helic,Lisette Espín-Noboa

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)

关键词:Large language models, Large language, expert in academia, Large, audits remain English-centric

备注: 25 pages (10 main, 2 references, 13 appendix), 6 figures in main, 13 figures in appendix (under-review)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

12. 【2605.28175】Mixture-of-Experts Knowledge Graph Retrieval-Augmented Generation for Multi-Agent LLM-based Recommendation

链接https://arxiv.org/abs/2605.28175

作者:Shijie Wang,Chengyi Liu,Yujuan Ding,Shanru Lin,See-Kiong Ng,Xu Xin,Wenqi Fan

类目:Information Retrieval (cs.IR)

关键词:Large language models, understand user intent, item semantics, Large language, recently been adopted

备注: Accepted by KDD 2026 Research Track

点击查看摘要

Abstract:Large language models (LLMs) have recently been adopted for recommendations due to their ability to understand user intent and item semantics. However, LLM-based recommender systems often rely on parametric knowledge and suffer from outdated knowledge, motivating knowledge graph retrieval-augmented generation (KG-RAG) to ground recommendations on structured, up-to-date KGs. Despite this promise, effective KG-RAG in recommendations faces great challenges. First, users' queries vary in complexity and require KG knowledge at different granularities, whereas existing methods adopt a one-size-fits-all retrieval strategy, leading to over-retrieval for simple queries and under-retrieval for complex ones. In addition, augmenting LLMs with KG knowledge requires translating graph-structured data into linear text, which may introduce noise and cause structural information loss. Moreover, the selection of retrieval granularity lacks direct supervision and must be inferred from the final recommendation after alignment and downstream utilization, making query-aware retrieval hard to learn end-to-end. To address these issues, we propose MixRAGRec, a cooperative multi-agent framework for KG-RAG recommendations. MixRAGRec integrates a Mixture-of-Experts Retrieval Agent that routes each query to a KG retrieval expert with different granularities, a Knowledge Preference Alignment Agent that converts structured knowledge into LLM-friendly natural language, and a Contrastive Learning-reinforced Recommendation Agent trained with contrastive preference feedback. Notably, we introduce Mixture-of-Experts Multi-Agent Policy Optimization (MMAPO) to train three agents under a unified objective. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework.

13. 【2605.28112】A Wolf in Sheep's Clothing: Targeted Routing Hijacking in Federated RAG

链接https://arxiv.org/abs/2605.28112

作者:Junjie Mu,Qiongxiu Li

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Federated Retrieval-Augmented Generation, Retrieval-Augmented Generation, data remain local, raw data remain, remain local

备注: Under review. Code available at [this https URL](https://github.com/Junjie-Mu/routing-hijacking-fedrag)

点击查看摘要

Abstract:Federated Retrieval-Augmented Generation (FedRAG) is attractive for privacy-sensitive applications because raw data remain local. As a result, routing must rely on client-provided semantic profiles, creating a new opportunity for manipulation. We introduce Routing Hijacking, a routing-stage attack in which a malicious client forges its profile to attract target queries despite having irrelevant underlying data. We show that this vulnerability is severe. Across three representative FedRAG routing architectures, Routing Hijacking consistently misroutes target queries and leads to downstream disruptions and failures, including missing evidence, poisoning, incorrect answers, and hallucinations. In a high-stakes MedQA-USMLE case study, we further show that poisoned retrieved evidence can mislead models across scales, leading to incorrect answers, hallucinations, and sycophantic failures. Existing defenses do not close this gap: encrypted routing preserves the exploited ranking, and Byzantine-robust Federated Learning (FL) rules transfer poorly to heterogeneous routing profiles. To address this gap, we propose a trust-aware post-routing framework that reweights clients using returned-evidence feedback, including retrieval relevance, profile consistency, and cross-client agreement; online experiments show that it suppresses persistent hijacking over recurring queries and transfers to a learned neural router. Our findings establish routing integrity as a new security challenge in FedRAG and highlight the need for stronger defenses for secure federated retrieval.

14. 【2605.28074】SilentRetrieval: Hijacking Retrieval-Augmented Generation via Semantically-Preserving Adversarial Data Poisoning

链接https://arxiv.org/abs/2605.28074

作者:Jiachen Qian

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:mitigates LLM hallucinations, Coordinated Beam Search, corpus integrity, critical vulnerability, hijacks RAG systems

备注: 12 pages, 4 figures, KDD '26 camera-ready version

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations but introduces a critical vulnerability: corpus integrity. We present SilentRetrieval, a two-stage data poisoning attack that hijacks RAG systems through adversarially crafted yet fluent documents. Stage 1 uses Coordinated Beam Search, a multi-token joint optimization method with a fluency-similarity objective, to keep a poisoned host document retrievable while constraining perplexity. Stage 2 uses Context-Adaptive Trigger Generation, a lightweight trigger-fusion step driven by a frozen LLM, to integrate manipulation triggers into document content. Under a one-poisoned-document-per-query evaluation with synthetic target answers, SilentRetrieval achieves 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO, while maintaining near-benign perplexity. Cross-model evaluation across four target LLMs shows nontrivial effectiveness under a fixed trigger generator, and transfer tests against unseen retrievers, including ColBERT and commercial embedding models, yield 64.7% average HR@10 under the same injected-corpus protocol. In a sampled Wikipedia-scale evaluation, SilentRetrieval retains 74.2% HR@10 at a 0.016% poisoning ratio. Combined retrieval-side and generation-side defenses reduce attack success substantially but incur a latency trade-off. Human evaluation shows substantially lower flag rates than disfluent baselines, while remaining numerically more suspicious than benign content at the current sample size.

15. 【2605.28062】ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor

链接https://arxiv.org/abs/2605.28062

作者:Taiheng Pan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:long-term memory retrieval, conversational long-term memory, cross-encoder teacher supervision, reranker for conversational, conversational long-term

备注: 15 pages. Technical report

点击查看摘要

Abstract:We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory's learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.

16. 【2605.28017】Can It Reach the Generator? Investigating the Survival of Prompt-Injection Attacks in Realistic RAG Settings

链接https://arxiv.org/abs/2605.28017

作者:Yu Yin,Shuai Wang,Bevan Koopman,Guido Zuccon

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:Recent generative engine, LLM recommendation list, generative engine optimisation, Recent generative, strongest attacks reporting

备注: 18 pages, 6 figures

点击查看摘要

Abstract:Recent generative engine optimisation (GEO) research has shown that prompt-injection attacks can push a target product to the top of an LLM's recommendation list, with the strongest attacks reporting around $80\%$ success and raising serious security concerns about RAG-based recommendation. However, these results assume the attacked document is always fed directly to the generator, bypassing the retriever and reranker. This is unrealistic: in deployed RAG systems, the attack modifies the document content, which can in turn change whether the document is retrieved and reranked highly enough to reach the generator at all. In this paper, we re-evaluate seven GEO attacks under a realistic three-stage pipeline (retriever\,$\to$\,LLM reranker\,$\to$\,LLM generator). We find that prior protocols substantially overstate attack effectiveness: gradient-based and instruction override attacks largely collapse before reaching the generator, and only LLM-driven prompt injections remain effective end-to-end. Our analysis further reveals that current GEO attacks are easily detectable: a lightweight prompt-injection guard finetuned on a small attack dataset already detects every attack. Our code and data are available at this https URL.

17. 【2605.27951】Beyond Similarity: Task-Aligned Retrieval for Language Models

链接https://arxiv.org/abs/2605.27951

作者:Zhixing Sun,Shenghe Xu,Tao Li

类目:Information Retrieval (cs.IR)

关键词:Retrieval-augmented generation, implicitly assuming, reliable indication, ranks passages, Retrieval-augmented

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) ranks passages by semantic similarity to the input, implicitly assuming that semantic similarity is a reliable indication of applicability in downstream tasks. This assumption breaks down when task success depends not on topical relevance but on applying the correct rules, constraints, or procedural guidance. In such settings, the most useful context may be the rule triggered by the input rather than the most semantically similar passage. We propose Task-Aligned Retrieval (TAG), a retrieval framework that replaces similarity-based retrieval with applicability-based rule selection. TAG transforms source documents into traceable condition-action rules, identifies which rules apply to a given input through pairwise LLM judgments, and generates the output conditioned only on the selected actions. We empirically observe that across Wikipedia NPOV rewriting, HumanEval with PEP~8 compliance, and NBA transaction reasoning on RuleArena, TAG consistently outperforms standard RAG, with the largest gains in high-mismatch settings (up to 12.2\%) while reducing retrieved context by up to 93\%. These results suggest that, in rule- and instruction-governed tasks, retrieval should optimize for applicability rather than for semantic similarity alone.

18. 【2605.27856】Fine-Tuned LLM as a Complementary Predictor Improving Ads System

链接https://arxiv.org/abs/2605.27856

作者:Hui Yang,Daiwei He,Kevin Jiang,Taejin Park,Kungang Li,Jiajun Luo,Yuying Chen,Xinyi Zhang,Sihan Wang,Haoyu He,Yu Liu,Lakshmi Manoharan,David Xue,Shubham Barhate,Runze Su,Duna Zhan,Ling Leng,Siping Ji,Jinfeng Zhuang,Alice Wu,Leo Lu,Han Sun,Zhifang Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, real-world industry setups, production-scale real-world industry

备注

点击查看摘要

Abstract:Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances in Large Language Models into Recommendation Systems (RecSys) gains remains rare, particularly in advertising and production-scale real-world industry setups. Prior real-world LLM successes typically fall into three buckets: (a) generative retrieval that directly predicts the next items for candidate generation, (b) late-stage re-ranking that uses LLMs, and (c) auxiliary signal enrichment with LLMs. We introduce a complementary paradigm for ads: a fine-tuned open-source LLM used not as a ranker, but as an ads-specific ancillary predictor, forecasting likely advertisers from user profiles and histories. This LLM-driven advertiser prediction augments conventional candidate generation and provides informative priors to downstream ranking. Developed in a large-scale production advertising system, our approach produces substantial offline improvements and measurable online business impact, demonstrating that LLM world knowledge and predictive capacity can be efficiently harnessed. Beyond validating LLMs for ads applications, our results show that targeted ancillary predictions can unlock end-to-end gains across both retrieval and late-stage ranking, offering a practical path to LLM-enhanced recommendation at scale.

19. 【2605.27810】LRanker: LLM Ranker for Massive Candidates

链接https://arxiv.org/abs/2605.27810

作者:Tao Feng,Zijie Lei,Zhigang Hua,Yan Xie,Shuang Yang,Ge Liu,Jiaxuan You

类目:Information Retrieval (cs.IR)

关键词:Large language models, high computational costs, recently shown strong, shown strong potential, capturing semantic relevance

备注

点击查看摘要

Abstract:Large language models (LLMs) have recently shown strong potential for ranking by capturing semantic relevance and adapting across diverse domains, yet existing methods remain constrained by limited context length and high computational costs, restricting their applicability to real-world scenarios where candidate pools often scale to millions. To address this challenge, we propose LRanker, a framework tailored for large-candidate ranking. LRanker incorporates a candidate aggregation encoder that leverages K-means clustering to explicitly model global candidate information, and a graph-based test-time scaling mechanism that partitions candidates into subsets, generates multiple query embeddings, and integrates them through an ensemble procedure. By aggregating diverse embeddings instead of relying on a single representation, this mechanism enhances robustness and expressiveness, leading to more accurate ranking over massive candidate pools. We evaluate LRanker on seven tasks across three scenarios in RBench with different candidate scales. Experimental results show that LRanker achieves over 30% gains in the RBench-Small scenario, improves by 3-9% in MRR in the RBench-Large scenario, and sustains scalability with 20-30% improvements in the RBench-Ultra scenario with more than 6.8M candidates. Ablation studies further verify the effectiveness of its key components. Together, these findings demonstrate the robustness, scalability, and effectiveness of LRanker for massive-candidate ranking.

20. 【2605.27706】Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

链接https://arxiv.org/abs/2605.27706

作者:Joan Vendrell Gallart,Solmaz Kia,Russell Bent,Michael Grosskopf

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Chain-based Adaptive Reconfiguration, Chain-based Adaptive, Adaptive Reconfiguration, large language models, test-time hallucination reduction

备注

点击查看摘要

Abstract:We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

21. 【2605.27704】Joint Optimization of Relevance and Engagement in Multi-Task Ranking for E-Commerce with Efficient LLM Supervision

链接https://arxiv.org/abs/2605.27704

作者:Luming Chen,Jiaqi Xi,Raghav Saboo,Kenny Chi,Martin Wang,Sudeep Das,Danny Nightingale,Aditya Dodda,Elyse Winer,Akshad Viswanathan

类目:Information Retrieval (cs.IR)

关键词:Optimizing industrial search, Optimizing industrial, industrial search ranking, introduces systematic biases, satisfy semantic intent

备注

点击查看摘要

Abstract:Optimizing industrial search ranking models solely for user engagement signals often introduces systematic biases, prioritizing popular or price-anchored items that may not satisfy semantic intent. We present a production-scale multi-task ranking system that integrates semantic relevance as a primary optimization objective, enabling explicit and controllable relevance-engagement trade-offs. Our architecture employs an ordinal relevance head that predicts cumulative probabilities over relevance thresholds, preserving the inherent ordering of labels. These outputs are integrated with engagement heads through a unified value model scoring function, enabling systematic balancing of semantic quality and short-term behavioral signals. To provide high-quality supervision for this multi-task framework, we utilize fine-tuned lightweight Large Language Models (LLMs) to generate three-level ordinal relevance labels: irrelevant, moderately relevant, and highly relevant. We address challenges regarding label distribution sensitivity and ensure high alignment with human annotations to enable efficient labeling for over 100 million query-item pairs. Evaluation across offline metrics, including NDCG@10, and online A/B experiments demonstrates that our approach significantly improves semantic alignment while preserving core engagement objectives.

22. 【2605.27656】Developing an Intelligent Job Recommendation System Using Semantic Retrieval and Explainable AI Techniques

链接https://arxiv.org/abs/2605.27656

作者:Hussein Al Awad,Khaled Fathi Omar

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Online recruitment platforms, recruitment platforms require, Online recruitment, platforms require recommendation, require recommendation methods

备注: 11 pages, 5 figures, IEEE-style paper on semantic retrieval and explainable AI for intelligent job recommendation

点击查看摘要

Abstract:Online recruitment platforms require recommendation methods capable of retrieving relevant job opportunities from large and heterogeneous collections of job postings. Keyword-based search is efficient and interpretable, but it may fail to retrieve relevant postings when equivalent roles are expressed using different terminology. This study presents a metadata-driven job recommendation system that combines TF-IDF lexical matching, Sentence-BERT semantic retrieval, query-aware filtering, optional Cross-Encoder re-ranking, and explanation generation. The proposed system utilizes structured metadata fields including job title, company name, location, seniority level, job function, employment type, and industry without relying on full job descriptions or user interaction histories. Experiments conducted on a cleaned LinkedIn job posting dataset containing 31262 records demonstrate that the best hybrid configuration achieved a Precision at 10 score of 0.8032 and an nDCG at 10 score of 0.9496. Under the internal evaluation protocol, Cross-Encoder re-ranking improved Precision at 10 from 0.7896 to 0.7948 and nDCG at 10 from 0.9666 to 0.9739. These findings indicate that lexical and semantic retrieval techniques can be effectively combined to provide explainable job recommendations when only structured metadata is available.

23. 【2605.27610】Eliot: Interactively $\underline{E}$xploring Fast-Changing Scientific $\underline{Li}$terature Trends with $\underline{O}$nline Da$\underline{t}$a and Learning

链接https://arxiv.org/abs/2605.27610

作者:Bernardo A. Denkvitts,Nitin Gupta,Biplav Srivastava

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:fast-moving areas evolve, rapid growth, publishing has made, made it increasingly, increasingly difficult

备注: Under-review at CIKM Applied Research 2026

点击查看摘要

Abstract:The rapid growth of scientific publishing has made it increasingly difficult to track how fast-moving areas evolve. Search engines and LLM-based assistants retrieve or summarize papers, but often hide how the corpus was selected, organized, or connected to temporal patterns. We present $\texttt{Eliot}$, a publicly deployed interactive system for traceable exploration of evolving scientific literature. Motivated by two studies on Large Language Models (LLMs) and Automated Planning and Scheduling (APS), $\texttt{Eliot}$ generalizes literature-evolution analysis beyond hand-built taxonomies and domain-specific scripts. Given explicit query terms and filters, it retrieves arXiv papers at query time, represents each paper by title and abstract, clusters the corpus into themes, assigns representative keywords, and visualizes each cluster's publication-year distribution. We evaluate $\texttt{Eliot}$ as both an applied system and an interactive research aid. An offline configuration study across eight arXiv domains compares document representations, dimensionality reduction methods, and clustering algorithms using intrinsic clustering and topic-coherence metrics; the results support MiniLM embeddings with 10-dimensional UMAP and Agglomerative Clustering as a practical default. A scenario-based survey and expert focus group assess interpretability and use contexts: participants rated cluster labels as meaningful in 85% of scenario responses, and feedback indicated that $\texttt{Eliot}$ is most valuable for auditable overviews of rapidly changing technical areas. These results suggest that query-time clustering and temporal inspection can complement search and generation tools by helping researchers inspect and refine the evidence behind literature trends.

24. 【2605.27551】On the Origin of Synthetic Information by Means of Steganographic Inheritance

链接https://arxiv.org/abs/2605.27551

作者:Ching-Chun Chang,Isao Echizen

类目:Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:mystery of mysteries, origin of species, mysteries in natural, natural science, synthetic information

备注

点击查看摘要

Abstract:The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is the mystery of mysteries in information science. The question carries a moral weight that a technical account can neither fully resolve nor responsibly ignore, as its impact on truth, trust, and human intellect extends deep into the broader economy and society. The very power of artificial intelligence makes the evolutionary lineage of synthetic information grow ever harder to trace, for a sufficiently capable model may generate offspring that bear little resemblance, at either the structural or signal level, to the parent source from which they were derived. As in genetics, two individuals may share the same phenotype mirroring each other in outward appearance, yet differ fundamentally in their genotype. We propose, by means of steganography, a mechanism analogous to heredity. At the moment an offspring is reproduced, a projector derives a trait from the parent, and a steganographic encoder invisibly hides it within the offspring. This trait persists throughout the offspring's life cycle in a cyber ecosystem. When parentage is queried, a steganographic decoder extracts the trait from the offspring and compares it against the traits of candidate parents in a reference pool, thereby nominating the most likely one. A theoretical analysis characterises phylogenetic accuracy as a function of projector and stegosystem properties, whilst empirical evaluations across multiple projectors and stegosystems demonstrate the viability of the proposed methodology under a broad spectrum of processing operations and semantic modifications. We envision a cyber ecosystem in which synthetic information, endowed with hidden yet traceable lineage traits, branches from a simple beginning into endless forms that have been, and are being, evolved.

25. 【2605.27494】Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

链接https://arxiv.org/abs/2605.27494

作者:Syed Huma Shah(Duke University)

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Modern retrieval-augmented generation, deployments increasingly rely, Modern retrieval-augmented, reduce token cost, deployments increasingly

备注: 19 pages, 9 figures, 10 tables. Code: [this https URL](https://github.com/syedhumarahim/grounded-cache-router)

点击查看摘要

Abstract:Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

26. 【2605.27450】Context Features Are Cheap: Rank-Aware Decomposition for Efficient Feature Interaction in Recommender Systems

链接https://arxiv.org/abs/2605.27450

作者:Yevgeny Tkach

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:industrial recommender systems, Modern industrial recommender, deep ranking model, context features, score N candidates

备注

点击查看摘要

Abstract:Modern industrial recommender systems use a deep ranking model to score N candidates against the same user and context features. Standard implementations broadcast context features early in the forward pass, redundantly computing context-only operations N times per request. We present a rank-aware decomposition applicable to the dominant interaction mechanisms in modern recommender architectures-Factorization Machine (FM) pairwise products, Deep Cross Network (DCNv2) cross layers, self-attention, and fully connected (FC) projection layers-built on a single algebraic principle: any linear or bilinear operation over a rank-partitioned input admits an exact block decomposition that moves context-only computation from once-per-candidate to once-per-request, identity-equivalent to the original model. Closed-form analysis and controlled ablation verify that savings scale quadratically with the number of context features. Applied to a production DLRM-style ranker without any architectural change, the decomposition increases per-pod throughput by 87.5% (a 47% reduction in peak pod count) at identical model predictions. The identity-equivalent decomposition applies only at the first layer of cross networks and self-attention, since each layer mixes ranks in its output. To extend savings across depth, we further introduce rDCN, an architectural variant of DCNv2 that maintains rank discipline across depth and matches DCNv2 accuracy within training noise at 67% fewer total FLOPs, and sketch an analogous architectural variant for self-attention.

Subjects:

Information Retrieval (cs.IR); Machine Learning (cs.LG)

Cite as:
arXiv:2605.27450 [cs.IR]

(or
arXiv:2605.27450v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2605.27450

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
27. 【2605.27449】Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

链接https://arxiv.org/abs/2605.27449

作者:Zhongtian Hua,Yi Luo,Meijia Yu,Yingjie Han

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:multimodal fact checking, claim verification process, downstream claim verification, Multimodal Large Language, DACLR

备注

点击查看摘要

Abstract:In the field of multimodal fact checking, the accuracy of retrieving evidence from different modalities has a significant impact on the downstream claim verification process. Existing general multimodal retrieval methods are often constructed based on semantics, resulting in the retrieved evidence being similar but not relevant to the claim. This paper proposes a \textbf{D}ynamic \textbf{A}daptive \textbf{C}ontrastive \textbf{L}earning method for evidence \textbf{R}etrieval called DACLR to address these issues. DACLR first uses a Multimodal Large Language Model (MLLM) to uniformly convert multimodal evidence and claims into text modalities, and extracts the features of these information at event level. Then, it conducts evidence retrieval through a two-stage retrieval method of recall-rerank. DACLR enhances the model's event perception ability of the retrieval stage by optimizing the contrastive loss and mining hard negative samples. Specifically, DACLR designs three loss functions at two levels (semantic and event) based on the InfoNCE this http URL to these, three sets of hard negative sample candidates are set up. The model dynamically adjusts the ratio based on the accuracy supervision signal of intra-batch samples, allowing the model to learn the correlation between claims and positive samples at the event level without forgetting the semantic retrieval ability. Extensive comparison and ablation experiments demonstrates the effectiveness of DACLR and its internal optimization methods. Further research also prove the advantages of DACLR in the field of multimodal evidence retrieval.

28. 【2605.27445】RAGe: A Retrieval-Augmented Generation Evaluation Framework

链接https://arxiv.org/abs/2605.27445

作者:Larissa Guder,João Pedro de Moura,Arthur Accorsi,Gustavo Losch do Amaral,Maurício Cecílio Magnaguagno,Felipe Meneguzzi,Marcio Sorraglia Pinho,Dalvan Griebler

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Deploying Large Language, Large Language Model, outdated knowledge bases, remains challenging due, high computational demands

备注

点击查看摘要

Abstract:Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging due to high computational demands, outdated knowledge bases, and the need to manually select optimal pipeline components. In this work, we propose a modular framework for benchmarking and guiding the efficient development of RAG applications by focusing on resource telemetry and component recommendation, suggesting the best components for a domain-specific dataset. Our approach leverages core techniques in LLM applications, including document chunking, vector databases, embedding models, and retrievers, to evaluate trade-offs among accuracy, efficiency, and scalability. By directly correlating retrieval and generation quality with underlying hardware constraints, RAGe supports researchers to identify the most effective, domain-specific RAG setups for their specific operational needs, facilitating rapid prototyping even on consumer-grade hardware.

29. 【2605.27444】A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations

链接https://arxiv.org/abs/2605.27444

作者:Ruben Belo,Marta Guimarães,Cláudia Soares

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:operational guidelines, technical documentation, scientific literature, creating challenges, space operations

备注

点击查看摘要

Abstract:The rapid expansion of space activities has led to an unprecedented accumulation of technical documentation, operational guidelines, and scientific literature, creating challenges for timely decision-making in space operations. Effective management in space operations requires tools capable of efficiently processing vast and heterogeneous information sources. This paper systematically evaluates the performance of Retrieval Augmented Generation (RAG) pipelines, combining Large Language Models (LLMs) with information retrieval techniques for extracting and synthesizing actionable knowledge from domain-specific documents. We compare various retrieval strategies, embedding models, and LLM answers to assess their impact on information accuracy, relevance, and reliability. Our results demonstrate that RAG pipelines can significantly enhance knowledge access, reduce uncertainty, and support decision-making in complex space operations.

30. 【2605.27441】A Unified Structured Query Understanding Framework for Industrial Semantic Search

链接https://arxiv.org/abs/2605.27441

作者:Ping Liu,Qianqi Shen,Jianqiang Shen,Chunnan Yao,Kevin Kao,Rajat Arora,Dan Xu,Baofen Zheng,Yunxiang Ren,Benjamin Le,Ali Hooshmand,Igor Lapchuk,Juan Bottaro,Raghavan Muthuregunathan,Caleb Johnson,Liangjie Hong,Jingwei Wu,Wenjing Zhang

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:large-scale industrial search, task-specific components, cascade of disparate, industrial search systems, large-scale industrial

备注: Accepted by KDD-ADS 2026

点击查看摘要

Abstract:Query understanding in large-scale industrial search systems is typically implemented as a cascade of disparate, task-specific components. While individually optimizable, this fragmented architecture incurs high maintenance overhead and results in inconsistent behaviors, particularly for long-tail queries. In this work, we propose and deploy a unified structured query understanding system that consolidates these heterogeneous functions into a single Small Language Model (SLM) that performs schema-constrained generation. To address the data bottlenecks inherent in unified modeling, we introduce Query Illuminator, a dual-purpose framework serving as: (i) a teacher model for high-quality auto-annotation and distillation, and (ii) a surrogate judge for scalable evaluation where human labels are scarce. We validate this approach through extensive offline and online tests within LinkedIn's Job Search system. Furthermore, we demonstrate the framework's horizontal extensibility through a cross-domain case study on People Search. The results show improved user engagement and reduced operational costs, achieved while satisfying strict low-latency serving constraints on limited GPU resources.

31. 【2605.27440】Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

链接https://arxiv.org/abs/2605.27440

作者:Will Jack,Noah Lehman,Keller Maloney,Sarah Xu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:top CRM, CRM, phrases a question, SaaS startup, buyer phrases

备注

点击查看摘要

Abstract:Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) -- both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand's "AI visibility" by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model's behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.

32. 【2605.27439】Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit

链接https://arxiv.org/abs/2605.27439

作者:Will Jack,Noah Lehman,Keller Maloney,Sarah Xu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:answer commercial queries, ChatGPT and Claude, directly nominating brands, search engines, assistants like ChatGPT

备注

点击查看摘要

Abstract:AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating brands rather than returning a list of links. Marketing to AI is therefore a broader problem than "show up in search" -- positioning, content, and product fit matter as much as discoverability. We audit ~37,000 production runs across four model configurations and 215 commercially-framed prompts spanning 19 sectors, evaluated against a 533-brand reference catalog stratified into five prominence tiers (L1 category leaders to L5 regional players) sourced from external authority lists. The ladder proxies a brand's awareness footprint within its sector, not revenue or market share. The failure mode differs sharply by tier. L1 brands appear in nearly every relevant retrieval but win only 25-41% of the recommendation slots they reach -- the leverage is differentiation, not visibility. L2 challengers carry the highest conversion rates of any tier (37-52%) but lose to persona-mediated substitution on the Anthropic models. L3 mid-market brands are the inflection level: aggregate coverage drops to 88%, conversion to 34-40%, and persona effects peak. L4 specialists and L5 regional players face catastrophic invisibility -- 48-52% never surface in any of the 37,000 runs. No uniform optimization recipe wins; the right marketing investment depends on where the brand sits on the prominence ladder.

33. 【2605.27437】MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents

链接https://arxiv.org/abs/2605.27437

作者:Tan Wang,Yunwei Dong

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, long-term dialogue agents, made significant progress

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in dialogue, yet redundant memory contexts severely limit their effectiveness in long-term dialogue agents. External memory systems have been proposed to improve memory maintenance. However, these systems mainly rely on one-shot retrieval, which limits their ability to retrieve sufficient and relevant evidence. Although recent methods introduce reflection into retrieval, their retrieval paths are generated by the LLM from limited evidence, leading to unstable retrieval and additional latency overhead. %These limitations highlight the need for effective retrieval mechanisms. To address these limitations, we propose MGRetrieval, a retrieval strategy that grounds reflective retrieval in the semantic structure of historical memories. Specifically, MGRetrieval consists of two steps: (1) It references the structure of historical memories to construct a more precise retrieval path. (2) The LLM retains critical memories and determines whether accumulated memories are sufficient to stop further iterative retrieval. This allows the retrieval process to follow semantically meaningful paths. Through memory-guided retrieval and critical memory propagation, MGRetrieval gradually constructs concise and sufficient memory contexts. Extensive experiments on LoCoMo show that MGRetrieval outperforms the strongest baseline by 8.91\% in F1 and 11.11\% in BLEU-1 on average across Qwen2.5-14B and Qwen3-14B, while maintaining practical token and latency costs. The code can be found in this https URL.

34. 【2605.27436】RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?

链接https://arxiv.org/abs/2605.27436

作者:Arijit Ghosh,Aritra Bandyopadhyay,Chiranjeev Bindra,Jingfen Qiao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal alignment, bridging the semantic, semantic gap, gap in information, Multimodal

备注

点击查看摘要

Abstract:Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero-shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain-dependent. However, we fail to reproduce the reported learning-from-scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data-Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text-to-video retrieval, and fine-tuning with domain supervision amplifies geometric benefits but reduces cross-dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at this https URL.

35. 【2605.27432】FD-RAG: Federated Dual-System Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.27432

作者:Tianhao Gao,Kai Yang,Yiyang Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:grounding large language, large language models, systems assume centralized, Retrieval-augmented generation, existing RAG systems

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a paradigm for grounding large language models in external knowledge, yet most existing RAG systems assume centralized knowledge access and ample computation. These assumptions break down in edge environments, where knowledge is fragmented across devices, raw data cannot be shared, and repeated LLM calls are prohibitively expensive. We propose FD-RAG, a federated dual-system RAG framework that decouples lightweight memory access from on-demand LLM reasoning for decentralized deployment. Specifically, FD-RAG learns semantic-aware adaptive hypergraphs over local corpora and distills them into compact QA memories. At inference time, it answers well-covered queries via direct memory matching and invokes LLM-based reasoning only when necessary, while tracing retrieved memories to hypergraph-grounded evidence. To mitigate cross-device knowledge fragmentation, FD-RAG aggregates anonymized memories across devices without exposing raw documents. Experiments on QA benchmarks show that FD-RAG improves accuracy by up to 7.8\% while reducing latency by 8.4$\times$ compared with strong local and federated baselines. We also provide theoretical analysis establishing an $\mathcal{O}(1/\epsilon^{2})$ convergence rate for the proposed hypergraph learning, supporting its tractable deployment in edge settings.

36. 【2605.27429】Ocean4Rec: Offline LLM-Derived OCEAN Profiles for Request-Time VOD Reranking

链接https://arxiv.org/abs/2605.27429

作者:Wonkyun Kim,Sehyun Bae,Kwanki Ahn,Mungyu Bae,Saeun Choi,Soyeon You,Chandra Prabhakar,Sehyun Kim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:repeat prompt construction, designs repeat prompt, richer content understanding, token generation, model invocation

备注

点击查看摘要

Abstract:Industrial video-on-demand (VOD) recommenders need richer content understanding, but LLM-as-reranker designs repeat prompt construction, token generation, model invocation, output parsing, and fallback handling for each request. In high-volume latency-sensitive services, these request-time operations complicate throughput planning, tail-latency control, capacity isolation, and predictable operation. This paper presents Ocean4Rec, a reranking layer that uses an LLM only offline to materialize item OCEAN profiles from content metadata. Items are mapped into Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism scores, while user profiles are built by time-decayed aggregation of recently clicked and deep-linked items in the same five-dimensional space. At request time, Ocean4Rec joins precomputed item profiles, user profiles, base recommender scores, and catalog recency, then performs numeric reranking without an LLM call. On anonymized Samsung Smart TV VOD logs, same-candidate Top1000 temporal-holdout offline evaluations show that Ocean4Rec improves NDCG@20 over a stronger non-OCEAN Base+Recency ordering by 7.6% for an NCF generator and 61.5% for a LightGCN generator. HR@20 is inconclusive for NCF and improves by 67.3% for LightGCN, reflecting sparse exact-item replay labels and the strength of recency as an industrial baseline. The result should be read as offline replay evidence for a bounded auxiliary content-taste feature that preserves the deployability advantage of a request-time-LLM-free serving path.

37. 【2605.27392】Will AI be overconfident about academic research findings when reliant on abstracts? (v1)

链接https://arxiv.org/abs/2605.27392

作者:Mike Thelwall

类目:Information Retrieval (cs.IR); Digital Libraries (cs.DL)

关键词:Large Language Models, Large Language, Language Models, DeepSeek and Gemini, including for academic

备注

点击查看摘要

Abstract:Large Language Models (LLMs) like ChatGPT, DeepSeek and Gemini seem to be increasingly used for knowledge discovery, information retrieval, and knowledge summaries, including for academic topics. This can result in users being misled, such as due to hallucinations. These problems may be exacerbated for academic knowledge if LLMs base their answers on journal article abstracts when they lack full text access. To test whether the information content of abstracts can be misleading, full text articles were submitted to the GPT-OSS 120B, an LLM from OpenAI, asking it to assess separately the strength the claims for the main result in the abstract, discussion, and conclusion. Outside the social sciences and humanities, claims tended to be stronger in the abstract and conclusions than the discussion, suggesting that relying on the strength of claims in abstracts would be misleading. Thus, if LLMs ingest abstracts but not full texts, there is a risk that they will be overconfident about the findings and pass it on to users in response to relevant prompts. This is another reason to be cautious about using LLMs for academic-related knowledge discovery and summaries.

38. 【2605.27389】Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

链接https://arxiv.org/abs/2605.27389

作者:Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:educational recommender system, conditioning context shapes, teacher-facing educational recommender, context shapes personalization, recommender system

备注: Accepted to ITS 2026

点击查看摘要

Abstract:We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual conditioning based on the current student question with memory-based conditioning using persistent learner information. Using deviation correlation and paired statistical tests, we find that contextual recommendations exhibit stronger question-level responsiveness, while memory-based recommendations exhibit history-dependent behaviors, including learner-specific differentiation under identical input. Teacher-facing evaluation signals suggest these recommendations are interpretable and actionable. These results indicate that embedding-based similarity metrics capture responsiveness to the current question but do not characterize personalization grounded in learner history, motivating behavior-level diagnostics for studying conditioning effects.

39. 【2605.27377】RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge

链接https://arxiv.org/abs/2605.27377

作者:Yidong Gan,David D. Nguyen,Yang Lin,Peter Zhong,Thanh Vu,Long Duong,Yuan-Fang Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:coding, present RAG-Coding, agentic method, external knowledge sources, RAG-Coding

备注: Additional experiments and analyses are in progress

点击查看摘要

Abstract:We present RAG-Coding, an agentic method for automated ICD-10-CM coding. RAG-Coding orchestrates four large language model (LLM) agents and grounds their coding decisions in external knowledge sources (e.g. the official coding tabular list and guidelines). By retrieving and cross-referencing relevant knowledge in these sources, the agents enhance coding accuracy and ensure clinical compliance. On the MDACE dataset, RAG-Coding outperforms the best LLM-based baseline by 8-13\% in micro-F1 and 2-8\% in macro-F1 across multiple LLM backbones. Compared to the state-of-the-art pretrained language model method, PLM-ICD, RAG-Coding exhibits higher micro recall (+11\%), while PLM-ICD exhibits higher micro precision (+6\%), yielding comparable micro- and macro-F1. Ablations show stepwise gains, highlighting the importance of incorporating external knowledge. We also release MDACE-2025, updating the original dataset with expert re-annotations with the latest 2025 ICD-10-CM guidelines. This update features more fine-grained code labels and enables evaluation against current clinical standards.

计算机视觉

1. 【2605.28820】From Pixels to Words -- Towards Native One-Vision Models at Scale

链接https://arxiv.org/abs/2605.28820

作者:Haiwen Diao,Jiahao Wang,Penghao Wu,Yuhao Dong,Yuwei Niu,Yue Zhu,Zhongang Cai,Weichen Fan,Linjun Dai,Silei Wu,Xuanyu Zheng,Mingxuan Li,Yuanhan Zhang,Bo Li,Hanming Deng,Huchuan Lu,Quan Wang,Lei Yang,Lewei Lu,Dahua Lin,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current vision-language models, inevitably fragments pixel-level, fragments pixel-level signals, early pixel-word interactions, Current vision-language

备注: 13 pages, 6 figures

点击查看摘要

Abstract:Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: this https URL.

2. 【2605.28816】Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

链接https://arxiv.org/abs/2605.28816

作者:Fangfu Liu,Kai He,Tianchang Shen,Tianshi Cao,Sanja Fidler,Yueqi Duan,Jun Gao,Igor Gilitschenski,Zian Wang,Xuanchi Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single control signal, control signal, largely focused, focused on single-agent, future observations

备注: Project Page: [this https URL](https://research.nvidia.com/labs/sil/projects/gamma-world)

点击查看摘要

Abstract:World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.

3. 【2605.28811】HarmoVid: Relightful Video Portrait Harmonization

链接https://arxiv.org/abs/2605.28811

作者:Jun Myeong Choi,Jae Shin Yoon,Luchao Qi,Roni Sengupta,Joon-Young Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:target background scene, adjusting shadows, color tone, background scene, illumination intensity

备注: CVPR 2026

点击查看摘要

Abstract:We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.

4. 【2605.28809】AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

链接https://arxiv.org/abs/2605.28809

作者:Zhen-Hao Xie,Yu-Cheng Shi,Da-Wei Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:real-world learning systems, building real-world learning, CLIP-based CIL, CIL, important in building

备注: Accepted to ICML 2026. Code is available at [this https URL](https://github.com/LAMDA-CL/ICML2026-AREA)

点击查看摘要

Abstract:Class-Incremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similarity between visual and textual embeddings obtained from template prompts, e.g., ``a photo of a [CLASS]''. This seemingly monolithic matching process can be decomposed into two conceptually distinct stages: attribute extraction and attribute aggregation. For example, a model may recognize cat using attributes such as fur texture and whiskers. When learning a new class like car, the model must extract additional attributes like wheels and adjust how they are aggregated in the shared representation space. However, since only data from the current task is available, incremental updates can bias both attribute extraction and aggregation toward new classes, leading to catastrophic forgetting. Therefore, we propose AREA for attribute extraction and aggregation in CLIP-based CIL. To stabilize extraction, we anchor class-level visual and textual attributes on the hyperspherical embedding space via principal geodesic analysis. To stabilize aggregation, we learn lightweight task-specific experts with scoring and residual refinement, regularized by a variational information bottleneck objective. During inference, we perform routing over task attribute manifolds via optimal transport for more concise prediction. Experiments show that AREA consistently outperforms SOTA methods. Code is available at this https URL.

5. 【2605.28806】Personal Visual Memory from Explicit and Implicit Evidence

链接https://arxiv.org/abs/2605.28806

作者:Viet Nguyen,Thao Nguyen,Vishal M. Patel,Yuheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:remain largely text-centric, methods remain largely, largely text-centric, methods remain, remain largely

备注: Project Page: [this https URL](https://viettmab.github.io/visualmem-page/)

点击查看摘要

Abstract:Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states -- both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual--text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

6. 【2605.28805】OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

链接https://arxiv.org/abs/2605.28805

作者:Xinchen Zhang,Bowei Liu,Jiale Liu,Chufan Shi,Yizhen Zhang,Junhong Liu,Youliang Zhang,Zhiheng Li,Yujiu Yang,Ling Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multimodal large language, large language models, outcomes are increasingly, increasingly central, large language

备注: ICML 2026. Project: [this https URL](https://github.com/Cominclip/OmniVerifier)

点击查看摘要

Abstract:Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

7. 【2605.28803】Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

链接https://arxiv.org/abs/2605.28803

作者:Xinyu Wang,Mingze Li,Sicheng Lyu,Dongxiu Liu,Kaicheng Yang,Ziyu Zhao,Yufei Cui,Xiao-Wen Chang,Peng Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:deployment prohibitively expensive, make on-device deployment, on-device deployment prohibitively, heads make on-device, models unify perception

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solutions, compressing the LLM backbone while leaving the DiT action head at full precision, or resorting to mixed-precision schemes, driven by the belief that uniformly quantizing the action head is inherently unstable. We challenge this assumption with Omega-QVLA, the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation. Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps. On LIBERO, Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%. Real-world manipulation experiments further confirm smooth, accurate manipulation where prior methods fail. Code is available at this https URL.

8. 【2605.28780】Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

链接https://arxiv.org/abs/2605.28780

作者:Thomas Vitry,Kieran Edgeworth,Stefan Wermter,Jae Hee Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:achieving high in-distribution, exploit spurious correlations, high in-distribution accuracy, achieving high, distribution shift

备注: Accepted to the 49th German Conference on Artificial Intelligence (KI2026)

点击查看摘要

Abstract:Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at this https URL.

9. 【2605.28779】he Abstraction Gap in Vision-Language Causal Reasoning

链接https://arxiv.org/abs/2605.28779

作者:Chinh Hoang,Mohammad Rashedul Hasan

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:fluent causal explanations, distinguish linguistic plausibility, Vision-language models, Vision-language, Abstraction Gap

备注

点击查看摘要

Abstract:Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

10. 【2605.28741】Self-Prophetic Decoding to Unlock Visual Search in LVLMs

链接https://arxiv.org/abs/2605.28741

作者:Zhendong He,Qiyuan Dai,Guanbin Li,Liang Lin,Sibei Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, true multimodal reasoning, visual search representing, visual search

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

11. 【2605.28735】SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

链接https://arxiv.org/abs/2605.28735

作者:Hongyu Wen,Jia Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transparent objects, daily life, including the transparent, common in daily, important to understand

备注

点击查看摘要

Abstract:Transparent objects are common in daily life, and it is important to understand their multilayer depth, including the transparent surface and the objects behind it. Existing methods for multilayer depth typically extend single-layer prediction. They define layers by the front-to-back ordering of 3D points and predict the layers sequentially. However, as layered geometry can admit multiple valid groupings of 3D points into layers, a predefined grouping strategy is inherently restrictive. In this work, we propose SeeGroup, a multi-layer depth estimation method that avoids imposing a predefined grouping and allows the model itself to adaptively assign surfaces to depth maps. We formulate per-pixel multi-layer depth as a point process, treating depth layers as unordered events along each camera ray. This induces a permutation-invariant likelihood over the observed depth layers, yielding a loss that naturally supports arbitrary layer groupings. Experiments demonstrate that our method significantly advances the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.09%. Code is available at this https URL.

12. 【2605.28691】OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

链接https://arxiv.org/abs/2605.28691

作者:Yunyang Ge,Xianyi He,Zezhong Zhang,Bin Lin,Bin Zhu,Xinhua Cheng,Li Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformers achieve, Diffusion Transformers, Transformers achieve strong, video generation quality, strong video generation

备注

点击查看摘要

Abstract:Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

13. 【2605.28630】EntroAD: Structural Entropy-Guided Prompt Adaptation for Zero-Shot Anomaly Detection

链接https://arxiv.org/abs/2605.28630

作者:Xinyu Zhao,Qingyun Sun,Jiayi Luo,Jianxin Li

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:aims to detect, Zero-Shot Anomaly Detection, Anomaly Detection, unseen domains, Anomaly

备注

点击查看摘要

Abstract:Zero-Shot Anomaly Detection (ZSAD) aims to detect anomalies in unseen domains without target-domain adaptation. Recent CLIP-based methods have shown promising performance by leveraging prompt learning and visual-text alignment. However, most existing approaches rely on a single adaptation pathway, which may be insufficient for heterogeneous anomaly patterns across domains. In practice, anomalies exhibit vastly different characteristics, ranging from salient, localized structural disruptions to subtle, diffuse, and irregular variations. To address this challenge, we propose EntroAD, a structural entropy-guided zero-shot anomaly detection framework. Unlike previous methods, EntroAD introduces a dynamic routing mechanism to process different types of anomalies with specialized adaptation strategies. Specifically, we estimate patch-level structural entropy from self-attention-induced patch relations and use it as a proxy for relational uncertainty to guide anomaly-aware token routing. Based on this routing signal, we construct anomaly-aware routed tokens to better capture anomaly cues with different structural characteristics. We further introduce a confidence-aware dual-branch prompt adaptation module to stabilize visual-text alignment while preserving CLIP's transferable prior. Extensive experiments on 10 industrial and medical benchmarks show that EntroAD achieves state-of-the-art performance in challenging cross-dataset ZSAD settings.

14. 【2605.28619】A Multiscale Kinetic Framework for Image Segmentation: From Particle Systems to Continuum Models

链接https://arxiv.org/abs/2605.28619

作者:Horacio Tettamanti,Giulia Guicciardi,Mattia Zanella

类目:Computer Vision and Pattern Recognition (cs.CV); Adaptation and Self-Organizing Systems (nlin.AO)

关键词:consensus-based image segmentation, multiscale kinetic framework, consensus-based image, multiscale kinetic, present a multiscale

备注: 26 pages, 34 figures

点击查看摘要

Abstract:In this work, we present a multiscale kinetic framework for consensus-based image segmentation. By interpreting an image as a system of interacting particles, each pixel is characterised by its spatial position and an internal feature encoding color information. We introduce a coupled interaction scheme governing the evolution of particles in both position and feature spaces, from which we derive a kinetic formulation for the particle density in the space-feature domain combining transport, aggregation, and diffusion effects. Furthermore, through a suitable scaling, we obtain a first-order macroscopic model describing the evolution of the fraction of pixels carrying information on the fraction of pixels having a certain feature. Based on this reduced-complexity model, we present a data-oriented approach where we make use of particle-based optimisation techniques for the accurate segmentation of images. Numerical tests show the effectiveness of the proposed framework and its robustness under different noise conditions.

15. 【2605.28615】Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

链接https://arxiv.org/abs/2605.28615

作者:Zhuohan Liu,Wujian Peng,Yitong Chen,Zuxuan Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:covering attribute bindings, accurately reflect complex, object relationships, covering attribute, attribute bindings

备注

点击查看摘要

Abstract:Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.

16. 【2605.28609】JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models

链接https://arxiv.org/abs/2605.28609

作者:Jiachen Qian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:detect image tampering, provide natural-language explanations, recently been developed, developed to detect, detect image

备注: 37 pages, 6 figures. Includes supplementary material

点击查看摘要

Abstract:Forensic vision-language models (VLMs) have recently been developed to detect image tampering and provide natural-language explanations. However, their robustness against adversarial manipulation remains underexplored. Existing adversarial attacks typically aim to flip the model's binary judgment, while the accompanying explanation may still reveal forensic cues and contradict the attacked judgment. In this paper, we study judgment-explanation consistent adversarial attacks against forensic VLMs and propose JECA^2, a controlled white-box red-team diagnostic that jointly redirects visual attribution and aligns textual explanations with the target judgment. On the visual side, JECA^2 uses Grad-CAM-guided perturbations to divert attribution from tampered regions toward benign regions. On the textual side, it optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint. Experiments on forensic VLM benchmarks show that JECA^2 achieves higher attack success and automated judgment-explanation consistency than implemented baselines under white-box threat settings, while transfer to closed-source VLMs remains measurable but limited. Our results highlight a consistency failure mode in explanation-based forensic VLMs and motivate future robustness evaluation beyond binary detection accuracy.

17. 【2605.28605】Internally Referenced Low-Light Enhancement

链接https://arxiv.org/abs/2605.28605

作者:Peiyuan He,Hainuo Wang,Hengxing Liu,Mingjia Li,Xiaojie Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Self-supervised low-light image, external paired data, Self-supervised low-light, low-light image enhancement, Internally Referenced LLIE

备注

点击查看摘要

Abstract:Self-supervised low-light image enhancement (LLIE) is highly appealing as it eliminates the reliance on external paired data. However, the lack of external references causes networks to struggle with decoupling entangled illumination, delicate textures, and amplified noise. To resolve this challenge, we propose an Internally Referenced LLIE framework that extracts reliable physical and structural references from the degraded input image itself. First, we introduce a local exposure-simulated scheme to extract a low-frequency pseudo ground-truth. This serves as an internal physical reference to guide global illumination estimation and correct color casts. Second, we propose a dual-domain preservation strategy with spatial and spectral constraints to construct internal structural references. Specifically, an Illumination-Aligned Perceptual loss preserves global structures under illumination shifts, while a Shift-Invariant Spectral Correlation loss captures fine-grained local structures and suppresses high-frequency noise. Finally, we propose a Gain-Adaptive Feature Modulation (GAFM) mechanism to address highly spatially-variant residual noise. By transforming the self-estimated illumination map into an internal spatial gain prior, GAFM dynamically guides a blind-spot network for spatially-aware denoising. Extensive experiments demonstrate that our method achieves state-of-the-art performance, delivering superior noise suppression and textural fidelity. Code will be publicly released at this https URL.

18. 【2605.28604】Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

链接https://arxiv.org/abs/2605.28604

作者:Xiao Wang,Minglei Yang,Bin Yang,Wenke Huang,Zheng Wang,Xin Xu,Mang Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:automated video editing, intelligent surveillance, Identifying key individuals, scenes is essential, essential for applications

备注

点击查看摘要

Abstract:Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at this https URL.

19. 【2605.28587】Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation

链接https://arxiv.org/abs/2605.28587

作者:Yang Gao,Wuyang Li,Po-Chien Luan,Alexandre Alahi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:safe autonomous driving, environments is essential, autonomous driving, essential for safe, safe autonomous

备注: CVPR 2026

点击查看摘要

Abstract:Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing weakly supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under weak supervision, delivering 13.5% gains on human-centric instances and 10.9% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding. The code is publicly available: this https URL

20. 【2605.28551】Resolution-free neural surrogates for geometric parameterization and mapping with spatially varying fields

链接https://arxiv.org/abs/2605.28551

作者:Yanwen Huang,Lok Ming Lui,Gary P. T. Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:require computing spatial, spatial transformations induced, computing spatial transformations, imaging problems require, problems require computing

备注

点击查看摘要

Abstract:Many imaging problems require computing spatial transformations induced by spatially varying intensity, feature, or density fields. Canonical examples include distortion correction, deformable image registration, atlas-based segmentation, and deformation-driven image analysis. These tasks can be formulated as geometric mapping problems in which the transformation is constrained to preserve local structure, control boundary behavior, or regulate angular distortion. Such formulations typically lead to variational models, diffusion processes, or elliptic partial differential equations. However, repeatedly solving high-resolution systems becomes computationally expensive when the underlying parameter fields vary across instances. In this work, we propose a resolution-free neural surrogate for geometric parameterization and mapping problems. Given a spatially varying parameter field $p:\Omega\to\mathbb{R}^m$ and query locations $\{x_i\}_{i=1}^N\subset\Omega$, the model predicts mapped locations $\{u(x_i)\}_{i=1}^N$ on arbitrary structured or unstructured point sets. To avoid dependence on a fixed grid, we use a multi-resolution geometric encoding strategy that conditions the network on coordinate-augmented samples of the parameter field. The model is trained without labeled solution data by enforcing geometry-aware constraints derived from variational energies, diffusion-based density equalization, and quasi-conformal theory. Experimental results on quasi-conformal mapping and density-equalizing mapping problems are presented to demonstrate the effectiveness of our proposed method.

21. 【2605.28548】GEM: Generative Supervision Helps Embodied Intelligence

链接https://arxiv.org/abs/2605.28548

作者:Ruowen Zhao,Bangguo Li,Zuyan Liu,Yinan Liang,Junliang Ye,Fangfu Liu,Diankun Wu,Zhengyi Wang,Xumin Yu,Yongming Rao,Han Hu,Jun Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated impressive performance, Generative-supervised Embodied vision-language, generalization in robotics, demonstrated impressive, impressive performance

备注: Project Page: [this https URL](https://zhaorw02.github.io/GEM/)

点击查看摘要

Abstract:Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at this https URL

22. 【2605.28544】DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

链接https://arxiv.org/abs/2605.28544

作者:Chen Shi,Jinrui Xu,Shaoshuai Shi,Kehua Sheng,Bo Zhang,Li Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:important basis, Pretrained foundation models, driving, Pretrained foundation, Pretrained

备注

点击查看摘要

Abstract:Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

23. 【2605.28495】Janus-LoRA: A Balanced Low-Rank Adaptation for Continual Learning

链接https://arxiv.org/abs/2605.28495

作者:Cheng Chen,Pengpeng Zeng,Yuyu Guo,Lianli Gao,Hengtao Shen,Jingkuan Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paradigm for Continual, Low-Rank Adaptation, Continual Learning, promising paradigm, Adaptation

备注: 9pages, International Conference on Machine Learning

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a promising paradigm for Continual Learning. It independently updates its low-rank factors ($A$ and $B$), creating a composite update to the full weight matrix through their interaction. To prevent catastrophic forgetting, this update should remain orthogonal to the task-specific subspace that contains previously learned knowledge. However, we identify that this composite update systematically violates this orthogonality, reintroducing interference and undermining stability. Furthermore, naively enforcing this orthogonality compromises plasticity, disrupting the delicate stability-plasticity trade-off. To resolve these issues, we propose \textbf{Janus-LoRA}, a framework that restores this balance through two novel components. Specifically, we first introduce Gradient Rectification, a closed-form solution that mathematically decouples LoRA's factor updates, enforcing orthogonality against the historical knowledge subspace identified by an efficient Online Estimation. Next, to enhance plasticity, we introduce a Decoupled Margin Loss that promotes feature-level separation by pushing new feature representations away from old ones, thus creating distinct, low-interference regions for new learning. Comprehensive experiments on challenging benchmarks demonstrate that by harmonizing parameter-level orthogonality with feature-level separation, Janus-LoRA achieves a superior balance and establishes new state-of-the-art performance.

24. 【2605.28491】DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

链接https://arxiv.org/abs/2605.28491

作者:Kaiyang Ji,Bingsheng Qian,Binghuan Wu,Kangyi Chen,Ye Shi,Jingya Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including tempo shifts, audio-responsive character control, generate coherent full-body, coherent full-body motion, interactive frame rates

备注: accepted by ICML 2026

点击查看摘要

Abstract:We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly, including tempo shifts, drops, or user edits. Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

25. 【2605.28490】SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

链接https://arxiv.org/abs/2605.28490

作者:Jiawei Li,Ziyi Liu,Weijie Shi,Long Chen,Jiajie Xu,Xiaofang Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:localizes referred objects, grounding localizes referred, scene from natural, natural language, localizes referred

备注

点击查看摘要

Abstract:3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

26. 【2605.28477】SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

链接https://arxiv.org/abs/2605.28477

作者:Changxuan Li,Nadine Berner,Nassir Navab,Federico Tombari,Stefano Gasperini

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:monocular sequences relies, Self-supervised depth estimation, depth, estimation from monocular, joint learning

备注: Accepted by IEEE RA-L 2026

点击查看摘要

Abstract:Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at this https URL .

27. 【2605.28459】REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection

链接https://arxiv.org/abs/2605.28459

作者:Jun Zhou,Bingwen Hu,Yaxiong Wang,Zhedong Zheng,Yongzhen Wang,Yuchen Zhang,Ping Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:localize tampered regions, Multimodal manipulation detection, imperceptible manipulation traces, simultaneously identify forged, memorizing isolated artifacts

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Multimodal manipulation detection aims to simultaneously identify forged image--text pairs and localize tampered regions, yet existing methods typically rely on memorizing isolated artifacts and struggle with imperceptible manipulation traces or domain shifts. Inspired by human comparative reasoning, we reformulate this task as a reference-grounded verification problem, where authenticity is assessed by comparing a query against retrieved authentic evidence. We propose REVEAL Reference-Enabled Verification for Evidence Analysis and Localization), a framework explicitly designed for this comparative paradigm. To support this paradigm, we construct a large-scale reference library comprising 170K authentic news image--text pairs featuring over 40K public figures. Technically, REVEAL employs a difference-aware fusion mechanism to capture fine-grained discrepancies between the query and retrieved evidence. Furthermore, we introduce a task-decoupled Mixture-of-Experts (MoE) architecture to jointly execute instance-level detection and fine-grained grounding, effectively mitigating optimization conflicts between these heterogeneous objectives. Extensive experiments demonstrate that REVEAL significantly outperforms state-of-the-art methods, and notably enables \emph{training-free domain adaptation} by simply updating the reference library, offering a robust and practical solution for detecting evolving misinformation. Code is available at this https URL.

28. 【2605.28456】Diffusion Large Language Models for Visual Speech Recognition

链接https://arxiv.org/abs/2605.28456

作者:Jeong Hun Yeo,Chae Won Kim,Hyeongseop Rha,Yong Man Ro

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)

关键词:Existing Visual Speech, Visual Speech Recognition, Existing Visual, Speech Recognition, Visual Speech

备注: Code: [this https URL](https://github.com/JeongHun0716/dllm-vsr)

点击查看摘要

Abstract:Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)-based VSR framework, formulating transcription as iterative masked denoising with flexible-order decoding. With confidence-based unmasking, DLLM-VSR commits high-confidence positions early and uses the committed tokens as bidirectional context to refine ambiguous ones. To adapt DLLMs to VSR, we introduce a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. We further observe a performance gap with oracle-length decoding, which assumes access to the true transcript length, indicating that reducing target-length uncertainty can improve DLLM-based VSR. To reduce this gap, we develop length-guided candidate decoding, which uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks candidates using length plausibility and decoding confidence. The proposed method achieves a state-of-the-art WER of 19.5\% on LRS3 using only its labeled training data.

29. 【2605.28450】BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers

链接https://arxiv.org/abs/2605.28450

作者:Jungwook Seo,Yoonsik Park,Changmin Lee,Sungyong Baik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Web power image, web services, Web data, Web, content moderation

备注: Accepted to The Web Conference 2026 (formerly WWW) as an Oral presentation

点击查看摘要

Abstract:Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. However, the raw Web data often contain spurious correlations and social biases, and neural networks are known for their tendency to learn biases present in data. This can reinforce unfairness in web services and the web data, leading to a vicious cycle. In the context of image classification, networks learn bias attributes for a specific class when a majority of images contain the same attribute only for a given class. Hence, training a fair and debiased classifier from a biased dataset demands handling an imbalanced problem between a majority of images with bias attributes (bias-aligned samples) and a minority without (bias-conflict samples). In this work, we introduce BiasEdit, a modular framework that automatically detects bias attributes from the original dataset and edits them to construct a debiased dataset. Specifically, BiasEdit first detects unknown bias attributes via statistical dependence and mutual information analysis of visual-linguistic representations, and then explicitly edits those attributes using text-guided image editing to generate realistic bias-conflict samples. Unlike prior works that assume known bias attributes or relies on synthetic mixing, our method operates without manual annotations and can leverage off-the-shelf vision-language and editing models. BiasEdit addresses a fundamental challenge in Web-sourced visual AI, mitigating dataset-induced bias and achieving state-of-the-art debiasing performance even when training data are fully biased.

30. 【2605.28442】Self-Supervised Online Robot-Agnostic Traversability Estimation for Open-World Environments

链接https://arxiv.org/abs/2605.28442

作者:Julia Hindel,Simon Bultmann,Houman Masnavi,Daniele Cattaneo,Abhinav Valada

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Self-supervised online traversability, Self-supervised online, efficient trajectories, continuously learn, behavior toward safe

备注: 14 pages, 16 Figures

点击查看摘要

Abstract:Self-supervised online traversability estimation enables robots to continuously learn from unlabeled open-world experiences and adapt their navigation behavior toward safe and efficient trajectories. Existing approaches either rely on handcrafted proprioceptive traversability scores, limiting robot-agnosticism, or cluster prior data, preventing online learning. Moreover, many continual learning methods incur substantial memory and computational costs, hindering onboard deployment. We introduce COTRATE, an online learning framework for continuous traversability estimation from multimodal, unlabeled robot experience. Our method first infers robust traversability scores using a robot-agnostic, learning-based online terrain assessment module operating on proprioceptiveand inertial signals. These scores then supervise a visual traversability network through a novel alignment loss that associates visual embeddings with online terrain this http URL mitigate forgetting during continual learning with minimal overhead, we propose a diversity-aware feature selection strategythat preserves performance using a compact replay memory. We further show that the learned traversability representation supports knowledge transfer across different robot platforms with different locomotion kinematics. We evaluate COTRATE on a dataset of \approx 50,000 images collected with two robotic platforms across 11 outdoor terrains, and benchmark it on navigation tasks in three representative outdoor environments. We make the dataset, code, and trained models publicly available.

31. 【2605.28441】Bayesian Gated Non-Negative Contrastive Learning

链接https://arxiv.org/abs/2605.28441

作者:Peng Cui,Jiahao Zhang,Lijie Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Contrastive Learning, self-supervised representation learning, remain highly entangled, revolutionized self-supervised representation, latent representations remain

备注: Accepted by ICML 2026

点击查看摘要

Abstract:While Contrastive Learning (CL) has revolutionized self-supervised representation learning, its latent representations remain highly entangled and opaque, limiting their interpretability in safety-critical applications. We identify that a fundamental cause of this entanglement is the reliance on deterministic similarity measures, which treat all feature dimensions equally. In compositional scenes, this creates an Optimization Conflict: common background features, such as, "blue sky", are encouraged to align in positive pairs but simultaneously repelled in negative pairs, causing gradient oscillations that hinder precise semantic disentanglement. To address this, we propose BayesNCL (Bayesian Gated Non-Negative Contrastive Learning). Unlike standard approaches, BayesNCL introduces a probabilistic gating mechanism that dynamically filters out task-irrelevant, high-frequency common features while selectively retaining discriminative semantics. By formalizing feature selection as a variational inference problem with a sparse Bernoulli prior, our method effectively resolves the optimization conflict. Empirical experimental results on Imagenet-100 demonstrate that BayesNCL achieves a remarkable 142.1% improvement in semantic consistency compared to state-of-the-art baselines, yielding highly interpretable representations without compromising downstream task performance. Code is available at this https URL.

32. 【2605.28428】Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

链接https://arxiv.org/abs/2605.28428

作者:Jungwook Seo,Minjeong Kim,Younkwan Lee,Seungho Shin,Sungyong Baik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Detecting subtle visual, images remains challenging, subtle visual anomalies, Detecting subtle, query patch

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a training-free Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution. We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form. In particular, we do not use the optimized features themselves-the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior. The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve. Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.

33. 【2605.28422】VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

链接https://arxiv.org/abs/2605.28422

作者:Qiaoru Li,Shaotian Liang,Jintao Chen,Haoran Sun,Yuxiang Cai,Jianwei Yin,Yankai Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:continuous hidden states, medical VQA, explicit tokens, avoiding the language, continuous hidden

备注

点击查看摘要

Abstract:Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.

34. 【2605.28401】EgoRelight: Egocentric Human Capture and Illumination Recovery for Relightable and Photoreal Avatar Rendering

链接https://arxiv.org/abs/2605.28401

作者:Jianchun Chen,Yinda Zhang,Rohit Pandey,Thabo Beeler,Marc Habermann,Christian Theobalt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Mixed Reality, virtual humans blend, humans blend indistinguishably, virtual surroundings, headsets promise

备注

点击查看摘要

Abstract:Mixed Reality (MR) headsets promise a future of immersive telepresence where virtual humans blend indistinguishably into real or virtual surroundings. Achieving this vision requires a method for capturing a user's motion, estimating appearance under novel lighting, and understanding the environment - all from the constrained viewpoint of a head-mounted display (HMD). Existing approaches treat these as isolated problems: they either focus on driving avatars with baked-in lighting or rely on studio setups for relighting. In this paper, we present EgoRelight, a holistic framework for egocentric telepresence that simultaneously captures full-body human performance, synthesizes photorealistic and relightable appearance, and estimates high dynamic range (HDR) environment maps from a single HMD. First, to ensure motion and surface reconstruction, we propose an egocentric perception module that leverages stereo down-facing cameras to extract dense depth maps, which serve as geometric control signals to drive a mesh-based avatar. Second, we introduce a novel neural appearance model that learns to synthesize view-dependent specular and view-independent diffuse shading separately. By employing a specialized ray-sampling strategy, our model generalizes to unseen illumination without relying on restrictive analytical BRDF priors. Third, we enable seamless avatar integration into the physical world via a test-time inverse rendering process, which recovers an HDR environment map by matching the pre-trained avatar's appearance to live egocentric camera observations. We demonstrate our system through a social telepresence application, where remote users are coherently relit according to their physical environment. Extensive experiments show that our components and the integrated system significantly outperform state-of-the-art baselines in geometric accuracy and rendering as well as relighting fidelity.

35. 【2605.28397】Adaptive Temporal Gating of Longitudinal Magnetic Resonance Imaging for Alzheimer's Prediction

链接https://arxiv.org/abs/2605.28397

作者:Alireza Moayedikia,Sara Fin,Alicia Troncoso Lora,Uffe Kock Wiil

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Mild Cognitive Impairment, Cognitive Impairment, Mild Cognitive, Alzheimer Disease Neuroimaging, early intervention

备注

点击查看摘要

Abstract:Predicting conversion from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is critical for early intervention. Current deep learning paradigms predominantly rely on cross-sectional structural MRI, neglecting prognostic value in patient-specific anatomical trajectories. We introduce the Temporal Adaptive Fusion Network (TAF-Net), a hybrid CNN-Transformer architecture that models paired longitudinal 3D MRI scans. Central to TAF-Net is a Temporal Fusion Module governed by an Adaptive Temporal Gate, which learns patient-specific weightings to synthesize three spatiotemporal representations: explicit structural change, region-to-region temporal cross-attention, and bilateral feature concatenation. Evaluated on the Alzheimer's Disease Neuroimaging Initiative cohort for three-year MCI-to-AD conversion prediction, TAF-Net achieved the highest discriminative performance among all evaluated methods using only structural MRI, significantly outperforming the strongest baseline and approaching multimodal methods requiring PET, CSF, or genetic data. The architecture exhibited exceptional data efficiency, matching baseline performance with a fraction of training data. Ablation studies demonstrate that longitudinal fusion improves discrimination while reducing predictive variance by 48% compared to single-timepoint evaluation. Interpretability analyses reveal spatial attention aligned with established AD pathology in the medial temporal lobe and ventricles, while the gating mechanism prioritizes explicit volumetric change with strong positive correlation to conversion risk.

36. 【2605.28394】Sketch2Motion: Text-driven 2D Sketch to 3D Animation via Diffusion-guided Skeleton Optimization

链接https://arxiv.org/abs/2605.28394

作者:Gaurav Rai,Ojaswa Sharma

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:visual communication, effective medium, medium for visual, motion, hand-drawn sketches

备注

点击查看摘要

Abstract:Animation of 2D hand-drawn sketches provides an effective medium for visual communication. However, these sketches pose challenges, particularly in handling occlusions and accurately mapping motion. While 3D animation naturally addresses these challenges, estimating 3D motion remains a very complex task. Recent approaches to converting 2D sketches to 3D animations have mainly focused on specific types of motion, such as bipedal movements and facial expressions. We propose Sketch2Motion, a diffusion-guided framework for skeleton-based motion synthesis that combines classical character animation pipelines with deep generative priors. Our method represents motion using skeletal transformations, which are propagated to mesh deformations via linear blend skinning. To guide the resulting animation toward realistic and semantically meaningful motion, we integrate a text-to-video diffusion model via motion-aware score-distillation sampling (MoSDS), enabling optimization without paired motion data. Additionally, we apply physics-inspired smoothness, topological, and contact constraints to stabilize optimization and preserve motion plausibility. Further, we integrate a spring-mass simulator to introduce secondary motion effects. The proposed framework is generalized, fully differentiable, modular, and compatible with biped, quadruped, and non-living articulated characters. Experiments demonstrate that our approach produces temporally coherent, text-aligned animations that outperform baseline motion transfer methods that lack generative priors or explicit physical constraints. We will make our code and dataset publicly available.

37. 【2605.28392】Bound-Constrained Sparse Representation for Electrical Impedance Tomography

链接https://arxiv.org/abs/2605.28392

作者:Chun Zhang,Dong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:electrical impedance tomography, improving conductivity estimation, bound-constrained sparse representation, framework for electrical, impedance tomography

备注

点击查看摘要

Abstract:This study proposes a bound-constrained sparse representation (BC-SR) framework for electrical impedance tomography (EIT), aimed at improving conductivity estimation without explicit regularization. BC-SR adopts a representation-driven strategy, generating conductivity from low-dimensional latent variables via an implicit composite parameterization. Structural priors are embedded using a truncated graph-Laplacian basis, while a bound-preserving nonlinear mapping enforces admissible conductivity ranges and improves conditioning through implicit gradient modulation. The approach ensures robust convergence, even under noisy or incomplete data. Extensive validation on 2D/3D simulations, tank experiments, and in-vivo lung data shows that BC-SR improves physical consistency and structural fidelity, offering enhanced robustness compared to traditional methods. Additionally, BC-SR enables 3D time-difference EIT reconstruction, offering improved spatial resolution and a more coherent representation of 3D conductivity distributions, particularly for in-vivo lung data. This suggests potential for improved performance in EIT, particularly in clinical applications for respiratory monitoring.

38. 【2605.28348】oward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

链接https://arxiv.org/abs/2605.28348

作者:Corentin Seutin,Mohamed Amine Ettaki,Michaël Clément,Pierrick Coupé,Rémi Giraud

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:object categories expressed, leveraging high-level semantic, high-level semantic object, semantic object categories, recently achieved strong

备注: Accepted at the 2026 IEEE International Conference on Image Processing (ICIP 2026)

点击查看摘要

Abstract:Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

39. 【2605.28331】ransfer learning RGB models to hyperspectral images with trainable tensor decompositions

链接https://arxiv.org/abs/2605.28331

作者:Mariette Schönfeld,Laurens Devos,Wannes Meert,Hendrik Blockeel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large vision networks, models' general filters, large vision, specializing their models', models' general

备注

点击查看摘要

Abstract:Transfer learning makes it possible to use large vision networks on a variety of domains, by specializing their models' general filters to new tasks. However, these networks assume the input images to have 3 input channels, making them incompatible with multi- or hyperspectral images. Current approaches that mitigate this incompatibility sacrifice information in either the image, or the model. This work proposes a novel approach that preserves the image and spatial information present in the model by using partially trainable tensor decompositions. We create such decompositions of pretrained convolutional filters, separating the filters into spatial and spectral components. The spectral components are then replaced with trainable components of higher channel dimensionality. This creates hyperspectral filters that can specialize to new datasets, while retaining the spatial patterns of the original filter. Experiments on a variety of hyperspectral datasets show that our approach is more accurate and robust than other hyperspectral transfer learning methods.

40. 【2605.28324】Inpainting-Style Conditional Diffusion for Multivariable Time Series Forecasting

链接https://arxiv.org/abs/2605.28324

作者:Kourosh Kiani,S.M. Muyeen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Probabilistic Models, Denoising Diffusion Probabilistic, framework for multivariable, multivariable time-series solar, conditional diffusion-based framework

备注

点击查看摘要

Abstract:In this paper, we propose a novel conditional diffusion-based framework for multivariable time-series solar power forecasting. The proposed method reformulates temporal PV data as structured two-dimensional representations (images) using a sliding-window patch construction, enabling the application of Denoising Diffusion Probabilistic Models (DDPM) within a unified spatiotemporal learning paradigm. A key contribution of this work is the formulation of solar forecasting as an inpainting problem, where future time steps are treated as missing regions to be reconstructed. This is achieved through a mask-based conditional diffusion mechanism, in which historical observations are preserved as conditioning context while the target (future) region is progressively corrupted and subsequently recovered via reverse diffusion. The model learns to generate coherent future sequences conditioned on observed data, effectively performing time-series inpainting. To fully utilize all available features and ensure compatibility with U-Net architectural constraints, a zero-padding strategy is introduced to construct fixed-size inputs. The model is trained using a supervised denoising objective to predict injected noise, enabling accurate iterative reconstruction during the reverse process. Extensive experiments conducted on benchmark PV dataset, including GEFCom2014, demonstrate that the proposed approach achieves high forecasting accuracy, particularly for short-term horizons. The results highlight the effectiveness of integrating diffusion-based generative modeling with an inpainting formulation for robust, flexible, and high-fidelity solar power forecasting.

41. 【2605.28312】EventShiftFlow: Towards Hardware-efficient FPGA-based Flow Estimation

链接https://arxiv.org/abs/2605.28312

作者:Arianna Alonso Bizzi,Fernando Cladera,C. J. Taylor

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Event-based vision sensors, vision sensors offer, event-based motion estimation, sensors offer asynchronous, low-latency robotic perception

备注: 10 pages, 5 figures. Accepted to the IEEE ICRA 2026 Workshop on Challenges and Opportunities of Neuromorphic Field Robotics and Automation

点击查看摘要

Abstract:Event-based vision sensors offer asynchronous, high-temporal-resolution measurements that are attractive for low-latency robotic perception, but many event-based motion estimation methods are computationally intensive and difficult to map to FPGA hardware. We present a streaming velocity estimator that discretizes asynchronous events into fixed-duration time bins, constructs a 1-bit spatial occupancy grid, and evaluates multiple velocity hypotheses in parallel using only fixed-width integer logic - shift registers, counters, comparators, and small LUT-mapped multiplies - with no dividers and no DSP blocks. It requires no frame reconstruction, no floating-point arithmetic, and no iterative optimization. The method deliberately trades dense sub-pixel optical flow for a sparse, quantized velocity estimate at each active pixel, suited to low-latency tasks such as reactive obstacle avoidance on size-, weight-, and power-constrained platforms. On noisy synthetic data with known ground-truth velocities, the method recovers both magnitude and direction, with magnitude estimates being most challenged when objects of different velocities intersect. On a real event-camera sequence, directional accuracy reaches 99.5% across all four evaluated motion segments, with performance remaining robust across occupancy densities in the 10-40% range. We characterize the algorithm's density-dependent behavior, present a parameter sensitivity analysis, show that the proposed datapath requires less than 2 kB of storage, and implement a single-axis prototype on a low-cost Xilinx Artix-7.

42. 【2605.28272】EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

链接https://arxiv.org/abs/2605.28272

作者:Bohong Chen,Yumeng Li,Yinglin Xu,Youyi Zheng,Yanlin Weng,Kun Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:virtual assistants, Real-time synthesis, pivotal component, component for next-generation, Real-time

备注: SIGGRAPH 2026; Project Page: [this https URL](https://robinwitch.github.io/EchoAvatar-Page)

点击查看摘要

Abstract:Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at this https URL.

43. 【2605.28271】LV-OSD: Language-Vision-Complementary Open-Set Object Detection

链接https://arxiv.org/abs/2605.28271

作者:Yupeng Zhang,Ruize Han,Wei Feng,Song Wang,Liang Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Object detection, open-set object detection, image prompts, computer vision, prompts

备注

点击查看摘要

Abstract:Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.

44. 【2605.28270】Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

链接https://arxiv.org/abs/2605.28270

作者:Leonhard Sommer,Emil Akopyan,Adam Kortylewski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image remains challenging, remains challenging, Estimating, real-world image remains, single real-world image

备注

点击查看摘要

Abstract:Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at this https URL.

45. 【2605.28261】MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations

链接https://arxiv.org/abs/2605.28261

作者:Leiyue Zhao,Tianyu Shi,Daniel Reisenbuchler,Xinzi He,Junchao Zhu,Tianyuan Yao,Yuechen Yang,Yanfan Zhu,Junlin Guo,Gelei Xu,Haichun Yang,Yuankai Huo,Mert R. Sabuncu,Yihe Yang,Ruining Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:kidney functional units, pathology datasets provide, kidney functional, functional units, units is essential

备注

点击查看摘要

Abstract:Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at this https URL.

46. 【2605.28258】GUI Agents for Continual Game Generation

链接https://arxiv.org/abs/2605.28258

作者:Yixu Huang,Bo Li,Na Li,Zhe Wang,Kaijie Chen,Haonan Ge,Qingyi Si,Yuanzhe Shen,Ruihan Yang,Guangjing Wang,Hongcheng Guo

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:game generation, GUI agent, game, treat game generation, improving game generation

备注

点击查看摘要

Abstract:Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at this https URL.

47. 【2605.28257】Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

链接https://arxiv.org/abs/2605.28257

作者:Leonhard Sommer,Artur Jesslen,Basavaraj Sunagad,Adam Kortylewski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental to robotics, object, correspondence, category-level, Abstract

备注: 14 pages, 4 figures. Data and code are publicly available at [this https URL](https://github.com/GenIntel/HouseCorr3D)

点击查看摘要

Abstract:Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at this https URL.

48. 【2605.28241】PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

链接https://arxiv.org/abs/2605.28241

作者:Duanchu Wang,Cheng Li,Junjie Yang,Jing Huang,Zihang Cheng,Zhi Gao,ZhuBohong,Di Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:research remains largely, remains largely centered, Point cloud quality, research remains, cloud quality plays

备注

点击查看摘要

Abstract:Point cloud quality plays a critical role in 3D acquisition, reconstruction, rendering, and perception, yet existing point cloud quality assessment (PCQA) research remains largely centered on scalar score prediction. In practical inspection scenarios, quality assessment often involves identifying defects, characterizing dominant issue types, assessing downstream usability, and providing evidence-supported descriptions, which are not explicitly evaluated by current benchmarks. We introduce PointQ-Bench, a benchmark designed to extend PCQA from scalar scoring toward comprehensive quality understanding. PointQ-Bench consists of 3,083 point clouds spanning authentic scans, simulated distortions, and AI-generated content, covering eight major issue types. Each sample is annotated with mean opinion scores (MOS), quality levels, issue tags, expert-grounded descriptions, and 12,332 question-answer pairs. The benchmark supports three perception-oriented tasks: anomaly sensing, defect diagnosis, and usability grading, as well as a cognition-oriented task of open-ended quality reporting. To evaluate free-form quality descriptions, we further propose SSFRQ-5D, a five-dimensional evaluation protocol validated through human-AI agreement analysis. Extensive experiments on 14 vision-language models and traditional PCQA baselines reveal a consistent perception-diagnosis gap: while current models exhibit emerging abilities in coarse defect perception, they struggle with grounded diagnosis and quality calibration. Strong 2D MLLMs generally outperform existing 3D VLMs, and the benefit of additional views or point-level inputs is non-uniform, varying across tasks, data sources, and models, particularly under boundary-ambiguous conditions. Overall, PointQ-Bench provides a diagnostic testbed for advancing reliable and interpretable point cloud quality understanding.

49. 【2605.28239】Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

链接https://arxiv.org/abs/2605.28239

作者:Runlong Cao,Ying Zang,Chuanwei Zhou,Tianrun Chen,Tong Zhang,Zhen Cui,Chunyan Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semi-supervised referring expression, unlabeled image-text pairs, exploiting unlabeled image-text, Semi-supervised referring, achieve precise pixel-level

备注: 24 pages, 13 figures

点击查看摘要

Abstract:Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.

50. 【2605.28237】POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

链接https://arxiv.org/abs/2605.28237

作者:Ruiyan Gong,Meisheng Zhang,Yuxiang Zhao,Mingchao Sun,Yanfen Shen,Zedong Chu,Zhining Gu,Wei Guo,Xiaolong Cheng,Qiming Li,Kangning Niu,Yanqing Zhu,Xiaolong Wu,Tianlun Li,Mu Xu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Points of Interest, driven by Points, precise POI remains, remains a critical, fundamentally driven

备注: 25 pages, 9 figures

点击查看摘要

Abstract:Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

51. 【2605.28234】Bridging the Sampling Distribution Shift in Radio Map Estimation: A Trajectory-Aware Paradigm

链接https://arxiv.org/abs/2605.28234

作者:Feng Qiu,Zheng Fang,Shuhang Zhang,Kangjun Liu,Longkun Zou,Jing Liu,Ke Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Learning-based radio map, radio map estimation, UAV-assisted wireless sensing, Learning-based radio, map estimation

备注

点击查看摘要

Abstract:Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

52. 【2605.28230】Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

链接https://arxiv.org/abs/2605.28230

作者:Mariam Hassan,Kaouther Messaoud,Wuyang Li,Alexandre Alahi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern video generative, produce visually impressive, frequently violate basic, generative models produce, models produce visually

备注

点击查看摘要

Abstract:Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

53. 【2605.28229】VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

链接https://arxiv.org/abs/2605.28229

作者:Rui Lin,Chuanming Wang,Huadong Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large-scale Vision-Language Models, adapting large-scale Vision-Language, Vision-Language Models, pre-training technologies, adapting large-scale

备注: CVPR2026 camera ready

点击查看摘要

Abstract:With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{this https URL}{this https URL}.

54. 【2605.28217】A Patient-Specific Pulmonary Arterial Tree Digital Twin to Extract Pulmonary Embolism Biomarkers

链接https://arxiv.org/abs/2605.28217

作者:Morgane des Ligneris,Nathan Painchaud,Allan Serva,Laurent Bertoletti,Pierre Croisille,Carole Frindel,Odyssée Merveille

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:acute cardiovascular syndrome, cardiovascular syndrome, acute cardiovascular, Qanadli and Mastora, Pulmonary embolism

备注: 11 pages + 2 pages of supplementary materials. Submitted to special issue of JBHI

点击查看摘要

Abstract:Pulmonary embolism, the obstruction of a pulmonary artery by a blood clot, is one of the leading causes of acute cardiovascular syndrome. In clinical practice, therapeutic decisions after diagnosis via computed tomography pulmonary angiography rely on risk stratification, which categorizes 30-day mortality risk into three categories. This stratification depends on the right-to-left ventricular diameter ratio and blood levels of two cardiac enzymes. However, blood biomarkers are not always available in emergency settings, and manual calculation of established severity scores - such as Qanadli and Mastora - is time-consuming and rarely performed in clinical routine practice. This study introduces an automated pipeline that models a directed graph representation of the pulmonary arterial tree, labeling its hierarchical structure and characterizing pulmonary embolism. The pipeline derives image-based biomarkers, including local artery-level features (morphological information, hierarchical position, clot volume, and resulting obstruction) and global patient-level biomarkers such as automatically calculated severity scores (Qanadli and Mastora) and the total embolic volume distribution by lobes and hierarchical levels. Using artificial-intelligence-generated binary masks of arteries, emboli, lungs, and lobes, it creates a patient digital twin of the arterial structure. Validation of the pipeline through comparison to an existing pipeline, anatomical expectations, and manual severity score calculations demonstrates the pipeline's ability to automatically generate anatomically accurate digital twins and severity scores with strong agreement. This supports the potential of these image-derived biomarkers to automatically provide rapid, precise information on thrombotic burden and spatial clot distribution.

55. 【2605.28176】From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen

链接https://arxiv.org/abs/2605.28176

作者:Francisco Bérchez-Moreno,Riccardo Rosati,Maria Chiara Fiorentino,Víctor M. Vargas,Edoardo Cipolletta,Emilio Filippucci,Luca Romeo,Pedro A. Gutiérrez,César Hervás-Martínez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Pyrophosphate Deposition Disease, Calcium Pyrophosphate Deposition, Background and objective, Conventional Deep Learning, Deep Learning

备注

点击查看摘要

Abstract:Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren--Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p 0.001).

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.28176 [cs.CV]

(or
arXiv:2605.28176v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.28176

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
56. 【2605.28174】FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

链接https://arxiv.org/abs/2605.28174

作者:Jorge L. Rodriguez,Victor Angulo Morales,Areej Alwahas,Mariana Elias Lara,Fida Mohammad Thoker,Kasper Johansen,Bernard Ghanem,Fernando T. Maestre,Matthew F. McCabe

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:current approaches depend, large pretraining datasets, remote sensing representations, transferable remote sensing, Foundation models offer

备注: 29 pages, 9 figures

点击查看摘要

Abstract:Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

57. 【2605.28173】MangaFlow: An End-to-End Agentic Framework for Controllable Story to Manga Generation

链接https://arxiv.org/abs/2605.28173

作者:Muyao Wang,Zeke Xie,Yanhao Chen,Lixin Xiu,Hideki Nakayama

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structured visual storytelling, visual storytelling task, requires story decomposition, storytelling task, task that requires

备注

点击查看摘要

Abstract:End-to-end manga generation is a structured visual storytelling task that requires story decomposition, recurring character and scene grounding, page layout design, panel rendering, page composition, and lettering. However, existing generative models often perform direct page synthesis, entangling these factors in a single visual output and limiting precise control over layout geometry, visual references, and cross-panel consistency. To address these limitations, we propose MangaFlow, an agentic framework for controllable long-form manga generation that decomposes manga creation into planning, grounding, layout construction, reference-conditioned rendering, composition, and text placement. By treating layout and visual references as explicit intermediate variables, MangaFlow enables both simple text-to-manga generation and more precise user-controlled manga creation. This design exposes layout, visual assets, and lettering as editable intermediate controls for refining panel geometry, references, and text placement. To support long-form consistency, MangaFlow introduces a story section memory that links section descriptions with corresponding character, scene, and object references for reuse across panels. We further present a meta-benchmark for evaluating layout controllability, visual consistency, and generation quality. Experiments show that MangaFlow improves layout adherence and cross-panel consistency over direct generation baselines while supporting flexible human control.

58. 【2605.28167】DebFilter: Eradicating Biases Stashed in Value

链接https://arxiv.org/abs/2605.28167

作者:Seung Hyuk Lee,Songkuk Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:score-based generative models, pretrained vision-language models, multi-step denoising process, denoising process guided, text embeddings extracted

备注: 8 pages, 7 figures, supplementary material included, CVPR 2026

点击查看摘要

Abstract:Text-to-image diffusion models, which are theoretically equivalent to score-based generative models, generate images through a multi-step denoising process guided by text embeddings extracted from pretrained vision-language models such as CLIP. However, these text embeddings inherently encode social and semantic biases -- such as those related to gender and age -- that are subsequently propagated and amplified through the guidance mechanism, along with the model's training on large-scale datasets that are imbalanced with respect to these bias-related concepts, often leading to skewed outputs in text-to-image generation. We propose DebFilter, a lightweight and training-free framework for mitigating such biases in text-to-image diffusion models. Observing that the model's error prediction at each denoising step is primarily influenced by cross-attention dynamics, we introduce a bias-correction strategy that adjusts the value components within cross-attention. Specifically, we apply a fixed offset to the slice of guidance embedding, effectively steering the semantic direction of cross-attention values toward unbiased representations. This adjustment reconfigures the score landscape to produce balanced outputs while maintaining alignment with the intended text semantics. Unlike prior approaches that rely on fine-tuning or retraining, DebFilter operates entirely at inference time, requiring no additional data or model updates. Our results demonstrate that this method effectively mitigates social biases in generated images, offering an efficient and scalable pathway toward fairer and more inclusive text-to-image generation.

59. 【2605.28161】MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment

链接https://arxiv.org/abs/2605.28161

作者:Shurui Xu,Siqi Yang,Weiping Ding,Hui Wang,Mengzhen Fan,Yuyu Sun,Shuyan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:injuries requires radiologists, meniscus injuries requires, integrate volumetric MRI, volumetric MRI evidence, produce structured diagnostic

备注: Accepted by IEEE International Conference on Multimedia and Expo (ICME) 2026 (Oral Presentation)

点击查看摘要

Abstract:Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at this https URL.

60. 【2605.28157】Intra-YOLO: A Small Object Detection Model for Caries and Molar-Incisor Hypomineralization in Intraoral Photography Based on Transfer Learning with Reinforcement Learning

链接https://arxiv.org/abs/2605.28157

作者:Po-Lun Chwang,Po-Yu Chang,Wen-Liang Lin,Tung-Sheng Wu,Min-Ching Wang,Yun-Chien Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer-aided diagnosis, system for detecting, molar-incisor hypomineralization, intraoral photographs, CAD

备注

点击查看摘要

Abstract:This study developed a computer-aided diagnosis (CAD) system for detecting caries and molar-incisor hypomineralization (MIH) in intraoral photographs. These lesions share similar appearances, making clinical differentiation challenging, especially given their small size and variability in imaging conditions.

61. 【2605.28151】A novel ordinal multi-view aggregation scheme for oak defoliation

链接https://arxiv.org/abs/2605.28151

作者:Francisco Bérchez-Moreno,Ricardo Enrique Hernández-Lambraño,David Guijo-Rubio,Víctor Manuel Vargas,Francisco José Ruiz-Gómez,Juan Carlos Fernández,Pablo González-Moreno

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:biotic stressors threatens, Forest decline driven, making accurate monitoring, threatens ecosystem functioning, stressors threatens ecosystem

备注

点击查看摘要

Abstract:Forest decline driven by climate and biotic stressors threatens ecosystem functioning, making accurate monitoring of tree health essential. In this work, we address tree defoliation estimation as an ordinal classification problem using ground-level imagery. We propose a novel multi-view ensemble framework that aggregates predictions from Convolutional Neural Networks (CNNs) trained on different perspectives of individual trees (north, south, and crown). This approach leverages complementary visual information while preserving modelling consistency through a homogeneous ensemble design. A comprehensive evaluation is conducted by comparing multiple ordinal classification methods and analysing the contribution of each view and their combinations. Results show that modelling the ordinal structure of defoliation levels improves performance over nominal approaches, while the proposed multi-view ensemble consistently outperforms single-view and pairwise configurations. In particular, the three-view ensemble achieves the most robust and accurate predictions across all evaluation metrics. These findings highlight the potential of combining Deep Learning (DL), Ordinal Classification (OC), and multi-view aggregation for scalable, consistent, and objective forest health assessment in complex ecosystems such as Mediterranean dehesas.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.28151 [cs.CV]

(or
arXiv:2605.28151v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.28151

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
62. 【2605.28137】No Safe Dose: How Training Data Drives Unsafe Image Generation

链接https://arxiv.org/abs/2605.28137

作者:Felix Friedrich,Lukas Helff,Niharika Hegde,Patrick Schramowski,Kristian Kersting

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:ingest unsafe content, inevitably ingest unsafe, trained on large-scale, inevitably ingest, large-scale data

备注

点击查看摘要

Abstract:Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

63. 【2605.28136】SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

链接https://arxiv.org/abs/2605.28136

作者:Toomas Tahves,Mauro Bellone,Junyi Gu,Raivo Sell

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:multi-modal datasets lack, Zenseact Open Dataset, datasets lack pixel-level, autonomous driving, essential for autonomous

备注

点击查看摘要

Abstract:Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

64. 【2605.28132】Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

链接https://arxiv.org/abs/2605.28132

作者:Haozhan Shen,Tiancheng Zhao,Kangjia Zhao,Jianwei Yin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Generation Models, Spatial intelligence requires, intelligence requires visual, objects and geometric, geometric structure

备注: Code is here: \href{ [this https URL](https://github.com/om-ai-lab/Probing-VLM-VGM) }{ [this https URL](https://github.com/om-ai-lab/Probing-VLM-VGM) }

点击查看摘要

Abstract:Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \href{this https URL}{this https URL}.

65. 【2605.28125】CLEAR-NeRF: Collinearity and Local-region Enhanced Accurate 3D Reconstruction in Unbounded Scenes

链接https://arxiv.org/abs/2605.28125

作者:Vladislav Polianskii,Elijs Dima,Isabel Salmerón Marazuela,Gergő László Nagy,Sigurdur Sverrisson,Volodya Grancharov

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Neural Radiance Field, current Neural Radiance, Radiance Field, Neural Radiance, applications demand photorealism

备注

点击查看摘要

Abstract:Many real-world 3D reconstruction applications demand photorealism and metric accuracy across unbounded, complex scenes with challenging lighting and imperfect captures that current Neural Radiance Field (NeRF) pipelines only partly satisfy. This study adapts NeRF-based 3D reconstruction to multi-region of interest unbounded scenes to improve robustness to lighting and pose variation while enforcing metric accuracy suitable for digital-twin applications. Our approach introduces (i) automated local region localization/detection and reconstruction to seamlessly prioritize areas of interest without proliferating submodules, (ii) collinearity-enforcing ray sampling to learn smooth planar and curved surfaces, (iii) depth-localized neighborhood point extraction to suppress surface artifacts, and (iv) geometry-relevant color aggregation to mitigate lighting- and pose-caused variations. Results indicate superior performance of the proposed pipeline over the baseline NeRF models and established Structure from Motion (SfM) - Multi-View Stereo (MVS) solutions.

66. 【2605.28119】ST-ColoNet: Spatio-Temporal Colon Segment Recognition via Hybrid Attention and Edge-Guided Feature Learning

链接https://arxiv.org/abs/2605.28119

作者:Ziyi Wang,Zhengjie Zhang,Jingsheng Gao,Dahong Qian,Suncheng Xiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing automatic recognition, Colo-segment recognition, leading to poor, key requirement, existing automatic

备注

点击查看摘要

Abstract:Colo-segment recognition in colonoscopy videos is a key requirement for many downstream tasks, but existing automatic recognition methods only use colonoscopy images without fully exploiting the use of temporal information, leading to poor performance. Additionally, relevant public video-based datasets are in scarcity. To tackle this problem, we curate and release a labeled dataset specifically for the task of colo-segment recognition. In addition, we propose a two-stage deep learning-based framework, Colo-Segment Recognition via SpatioTemporal Network (ST-ColoNet), for the task of colo-segment recognition from colonoscopy videos which includes the Colorlaus module that uses metric learning to optimize edge-mediated spatial feature extraction, as well as the Full-Temp module which combines three self-attention patterns to better approximate full self-attention on long colonoscopy sequences and optimize temporal feature aggregation. Through extensive ablation experiments, we show that our framework is capable of achieving state-of-the-art performance on the task of colo-segment recognition, achieving an accuracy of 81.0% and F1-score of 70.7%, which is a tremendous improvement over state-of-the-art methods.

67. 【2605.28100】Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

链接https://arxiv.org/abs/2605.28100

作者:Arthur Dérédel,Carlos Crispim-Junior,Pierre Lemaire,Johan Berthet,Laure Tougne Rodet

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:disastrous natural hazards, climate change aggravates, aggravates environmental uncertainties, natural hazards, era where climate

备注: Preprint, 19 pages, 8 figures

点击查看摘要

Abstract:In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

68. 【2605.28091】Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

链接https://arxiv.org/abs/2605.28091

作者:Niantong Li,Guangzheng Hu,Weixu Qiao,Ying Ba,Qichen Hong,Shijun Shen,Jinlin Wang,Fan Zhou,Jianye Kang,Xin Shang,Ziyi He,Wei Wang,Dalin Li,Jiahao Li,Jie Zhang,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiao Xu,Xiaoyue Chen,Yuxiang Chen,Yan Shu,Yanran Zhang,Yilei Chen,Yixian Xu,Zekai Zhang,Zhendong Wang,Zihao Liu,Zikai Zhou,Hongzhu Shi,Yi Wang,Bing Zhao,Hu Wei,Lin Qu,Chenfei Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:simple text-image alignment, longer satisfy users', satisfy users' pressing, users' pressing demands, genuine creative expression

备注

点击查看摘要

Abstract:Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users' pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

69. 【2605.28083】VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

链接https://arxiv.org/abs/2605.28083

作者:Jiyuan Fu,Kaixun Jiang,Jingkai Jia,Zhaoyu Chen,Xueyao Chen,Lingyi Hong,Shuyong Gao,Chenzhi Tan,Dingkang Yang,Wenqiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:powerful generalist policies, patches significantly hinders, adversarial patches significantly, generalist policies, safety-critical domains

备注

点击查看摘要

Abstract:While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm's features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate "phantom embodiment". By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent's true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

70. 【2605.28056】CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

链接https://arxiv.org/abs/2605.28056

作者:He Feng,Yongjia Ma,Donglin Di,Lei Fan,Tonghua Su

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved substantial visual, lip synchronization, motion accuracy, achieved substantial, eye region

备注

点击查看摘要

Abstract:Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

71. 【2605.28051】Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

链接https://arxiv.org/abs/2605.28051

作者:Landi He,Mingde Yao,Shawn Young,Lijian Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:removing redundant visual, redundant visual tokens, Vision-Language Models, redundant visual, Visual token pruning

备注

点击查看摘要

Abstract:Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

72. 【2605.28036】Stay Fair! Ensuring Group Fairness in Diffusion Models Across Guidance Scales

链接https://arxiv.org/abs/2605.28036

作者:Myeongsoo Kim,Eunji Kim,Minwoo Chae,Sangwoo Mo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:models steer conditional, guidance, steer conditional generation, alignment and diversity, Diffusion models steer

备注: 28 pages, 18 figures

点击查看摘要

Abstract:Diffusion models steer conditional generation with a tunable guidance scale to trade off prompt alignment and diversity. However, existing debiasing techniques are optimized for a single scale, degrading fairness when users adjust this parameter. We trace this behavior to a previously overlooked source by decomposing total bias into two components: a model bias and a guidance bias. While prior work primarily targets the former, we show that the guidance bias grows monotonically with the guidance scale, eventually dominating the high-guidance regimes users prefer. To address this, we extend Strong Demographic Parity to guidance and derive a condition under which the target distribution retains its group ratio across guidance scales. We propose StayFair, which leverages this condition to design fair guidance algorithms in both regimes. For classifier guidance, it equalizes the classifier's output distributions across groups; for classifier-free guidance, it shifts the null embedding by a prompt-dependent offset. Because StayFair modifies only the guidance step, it is orthogonal to model debiasing and can be layered onto existing fair diffusion models to extend their fairness across guidance scales. Across class-conditional and text-to-image generation, StayFair decouples fairness from the guidance scale without sacrificing image quality.

73. 【2605.28023】VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

链接https://arxiv.org/abs/2605.28023

作者:Xingyu Lu,Jinpeng Wang,Yi-Fan Zhang,Yankai Yang,Yancheng Long,Yiyang Fan,Xuanyu Zheng,Haonan Fan,Kaiyu Jiang,Tianke Zhang,Changyi Liu,Bin Wen,Fan Yang,Tingting Gao,Han Li,Chun Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:capture visual content, visual content faithfully, omission and hallucination, Visual captioning requires, content faithfully

备注: 28 pages, 8 figures

点击查看摘要

Abstract:Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

74. 【2605.28018】Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

链接https://arxiv.org/abs/2605.28018

作者:Hongtao Yang,Bineng Zhong,Qihua Liang,Yaozong Zheng,Xiantao Hu,Yuanliang Xue,Shuxiang Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:UAV tracking, asymmetric UAV tracking, UAV tracking framework, weakens feature representation, reduce computation

备注: CVPR2026 Highlight

点击查看摘要

Abstract:Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher's capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: this https URL

75. 【2605.28016】Enhancing Ultra-low-field MRI with Segmentation-guided Adversarial Learning

链接https://arxiv.org/abs/2605.28016

作者:James Grover,Andrew Phair,Michael Ferraro,David E.J. Waddington

类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:poor image quality, MRI offers portable, ULF Enhancement Challenge, image quality, offers portable

备注

点击查看摘要

Abstract:Ultra-low-field (ULF) MRI offers portable and low-cost imaging but suffers from poor image quality. To address this, we present our submission to the 2025 ULF Enhancement Challenge (ULF-EnC), where the goal is to synthesise high-field-like MRIs from 64 mT scans. Our pipeline enhances ULF MRI through a combination of anatomical conditioning and model ensembling. We first generate tissue segmentation priors using a Swin UNETR trained solely on challenge-provided data. These priors condition two independent enhancement networks - a CycleGAN and a transformer-based residual enhancement model (T-REX) - each trained to synthesise 3 T-like MRIs. Outputs from both models are combined using a weighted average. Our approach produces enhanced MRIs that were comparable to high-field scans both quantitatively and qualitatively.

76. 【2605.28011】Automated Estimation of Impact Time, Impact Location, and Shuttlecock Speed in Badminton Smashes Using Event Cameras

链接https://arxiv.org/abs/2605.28011

作者:Yudai Washida,Yuto Kase,Kai Ishibe,Ryoma Yasuda,Sakiko Hashimoto

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Quantifying impact phenomena, systems involve trade-offs, conventional measurement systems, measurement systems involve, Quantifying impact

备注: 24 pages, 5 figures

点击查看摘要

Abstract:Quantifying impact phenomena in badminton smashes is important for evaluating both athletic performance and equipment; however, conventional measurement systems involve trade-offs between temporal resolution, data efficiency, and preparation effort. This study proposes a measurement method using two synchronized event cameras to automatically estimate impact time, impact location on the racket face, and post-impact shuttlecock speed in an integrated manner within the same trial. The swing interval was detected from event rate statistics, impact time was estimated from the shuttlecock trajectory inflection in the lateral-view event data, impact location was determined by ellipse fitting to the racket face in the rear-view event image, and shuttlecock speed was calculated in the sagittal plane. To validate the proposed method, Bland-Altman analysis was performed against a high-speed camera-based reference method using 125 smash trials from five players. Impact time and shuttlecock speed were estimated in all 124 analyzable trials, and impact location was estimated in 93.5% (116/124). The bias (95% CI) for impact time, medio-lateral impact location, longitudinal impact location, and shuttlecock speed were 1.84 ms (1.45 to 2.23), 3.45 mm (2.18 to 4.72), -1.92 mm (-2.97 to -0.88), and -1.00 m/s (-2.46 to 0.46), respectively. No proportional bias was observed for any metric. These results suggest that the proposed method can serve as a useful tool for integrated assessment of badminton smash performance and equipment in practical settings.

77. 【2605.27990】Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

链接https://arxiv.org/abs/2605.27990

作者:Seunghyeok Shin,Minwoo Kim,Dabin Kim,Hongki Lim

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:conditions diffusion priors, Diffusion posterior sampling, sampling conditions diffusion, hand-tuned guidance weights, posterior sampling conditions

备注: Code: [this https URL](https://github.com/Seunghyeok0715/CLAMP)

点击查看摘要

Abstract:Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

78. 【2605.27978】ABot-OCR Technical Report

链接https://arxiv.org/abs/2605.27978

作者:Kaitao Jiang,Ruiyan Gong,Xiaolong Cheng,Kangning Niu,Tianlun Li,Mu Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single forward pass, page image directly, clean Markdown, vision-language model, forward pass

备注: 21 pages, 11 figures, technical report

点击查看摘要

Abstract:We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

79. 【2605.27962】Bridging the Generalization Gap in Adverse Weather Segmentation: A Training Recipe Perspective

链接https://arxiv.org/abs/2605.27962

作者:Cong Xu,Pu Luo,Yumei Li,Boyou Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:outdoor scenes degraded, targets semantic segmentation, weather conditions, paper describes, describes our approach

备注

点击查看摘要

Abstract:This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap -- models that perform well on the validation set often collapse on the test set. For instance, SegFormer-B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre-trained SegMAN-S backbone, we systematically study the effects of domain-adaptive fine-tuning, multi-source data mixing, scene-balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9\% mIoU on the official test set while maintaining a validation-test gap of only 6.5 points -- less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather-robust segmentation under limited data.

80. 【2605.27960】Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

链接https://arxiv.org/abs/2605.27960

作者:Xuanzhao Dong,Wenhui Zhu,Peijie Qiu,Xiwen Chen,Xiaobing Yu,Xin Li,Zhipeng Wang,Shao Tang,Gen Li,Yujian Xiong,Hao Wang,Yanxi Chen,Prayag Tiwari,Yalin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, complex background clutter, Multimodal Large, interpret images accurately

备注

点击查看摘要

Abstract:Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

81. 【2605.27959】ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

链接https://arxiv.org/abs/2605.27959

作者:Guannan Lv,Ren Nie,Hongjian Dou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

82. 【2605.27952】Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry

链接https://arxiv.org/abs/2605.27952

作者:Haolan Zhang,Thanh Nguyen Canh,Chenghao Li,Ziyan Gao,Xiongwen Jiang,Nak Young Chong

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Visual odometry, augmented reality, fundamental component, component in robotics, robotics and augmented

备注: Submitted

点击查看摘要

Abstract:Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

83. 【2605.27950】Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment

链接https://arxiv.org/abs/2605.27950

作者:Long Li,Yuning Huang,Heather A. Eicher-Miller,J.Graham Thomas,Fengqing Zhu,Edward Sazonov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:behavior change receptivity, behavior change, change receptivity, healthier eating habits, Accurately assessing dietary

备注

点击查看摘要

Abstract:Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants' self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

84. 【2605.27938】SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild Images

链接https://arxiv.org/abs/2605.27938

作者:Sky Cen,Wufei Ma,Guofeng Zhang,Alan Yuille,Adam Kortylewski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabled impressive, semantic, object models, deformable, SEMAGIC

备注

点击查看摘要

Abstract:Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 PCK@0.1 on SPair-71k, establishing deformable models as effective semantic 3D representations.

85. 【2605.27932】When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

链接https://arxiv.org/abs/2605.27932

作者:Yuan Tian,Bing Hu,Fang Wu,Xiaomin Li,Binghang Lu,Neil Zhenqiang Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:remain poorly understood, implications remain poorly, reasoning is emerging, poorly understood, large vision-language models

备注: 17 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

86. 【2605.27927】Structure-Guided Visual Perturbation Neutralization for LVLMs

链接https://arxiv.org/abs/2605.27927

作者:Yuanhe Zhang,Xueting Wang,YanBin Ren,Haoran Gao,Xinhan Zheng,Zhenhong Zhou,Fanyu Meng,Li Sun,Sen Su

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Vision Language, inputs enable Large, enable Large Vision, pixel-level attack surface, Vision Language Models

备注

点击查看摘要

Abstract:Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87\% defense success rate with only 0.5\% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on this https URL.

87. 【2605.27924】SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

链接https://arxiv.org/abs/2605.27924

作者:Peiyu Zhuang,Jianquan Yang,Haodong Li,Zhuoying Cai,Ruitao Xie,Jishen Zeng,Baoying Chen,Jiwu Huang,Xiaochun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Text-driven image editing, image manipulation localization, manipulations requires image, requires image manipulation, Text-driven image

备注

点击查看摘要

Abstract:Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.

88. 【2605.27923】Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

链接https://arxiv.org/abs/2605.27923

作者:Sudip Vhaduri,Ryan Gammon,Sayanton Dibbo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)

关键词:Support Vector Machine, Classical Support Vector, classical machine learning, Quantum Support Vector, Convolutional Neural Network

备注

点击查看摘要

Abstract:The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

89. 【2605.27920】Rethinking Video-Language Model from the Language Input Perspective

链接https://arxiv.org/abs/2605.27920

作者:Xiang Fang,Wanlong Fang,Changshuo Wang,Xiaoye Qu,Daizong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, Video-Language Models, language models, models, wave of large

备注: Published in AAAI 2026

点击查看摘要

Abstract:Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

90. 【2605.27916】OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

链接https://arxiv.org/abs/2605.27916

作者:Xuanzhao Dong,Wenhui Zhu,Xiwen Chen,Hao Wang,Xin Li,Yujian Xiong,Jiajun Cheng,Jingjing Wang,Xiaobing Yu,Haiyu Wu,Shao Tang,Zhipeng Wang,Langechuan Liu,Shan Lin,Oana Dumitrascu,Yalin Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, shown great potential

备注

点击查看摘要

Abstract:The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

91. 【2605.27900】Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

链接https://arxiv.org/abs/2605.27900

作者:Yuting Ma,Lechao Cheng,Xiaohua Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pre-trained Vision-Language Models, Vision-Language Models, global task adaptation, task adaptation, pre-trained Vision-Language

备注: This work has been accepted by ICML 2026

点击查看摘要

Abstract:Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task this http URL further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

92. 【2605.27894】owards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

链接https://arxiv.org/abs/2605.27894

作者:Xiang Fang,Wanlong Fang,Changshuo Wang,Keke Tang,Daizong Liu,Siyi Wang,Wei Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse computer vision, computer vision applications, demonstrated impressive multi-modal, impressive multi-modal reasoning, multi-modal reasoning capabilities

备注: Published in AAAI 2026

点击查看摘要

Abstract:Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

93. 【2605.27893】SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

链接https://arxiv.org/abs/2605.27893

作者:Lingyu Xiong,Jinjin Shi,Xuran Xu,Cong Luo,Runyu Shi,Ying Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Foundation Models, Vision Foundation, Foundation Models, impressive representational capabilities, demonstrated impressive representational

备注

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

94. 【2605.27891】SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

链接https://arxiv.org/abs/2605.27891

作者:Zhida Zhang,Jie Ma,Zhan Peng,Haoxue Wu,Yang Han,Jun Liang,Jie Cao,Jing Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video fundamentally determines, fundamentally determines, determines its perceptual, video, narrative quality

备注

点击查看摘要

Abstract:The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

95. 【2605.27885】Reflective Dialogue between Teacher and Solver Agents for Video Question Answering

链接https://arxiv.org/abs/2605.27885

作者:Takuya Murakawa,Toru Tamaki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Question Answering, adapt Vision-Language Models, Vision-Language Models, domains for Video, Question Answering

备注: Yhis paper serves as the technical report for the 1st Cross-Domain EgoCross Challenge @ EgoVis Workshop, CVPR 2026

点击查看摘要

Abstract:Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) -- a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.

96. 【2605.27884】A Road-Conditioned Traffic Movie Prediction Network with Spatiotemporal and Structure-Consistent Learning

链接https://arxiv.org/abs/2605.27884

作者:Joshua Kofi Asamoah,Blessing Agyei Kyem,Armstrong Aboah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent transportation systems, City-wide traffic forecasting, entire urban network, City-wide traffic, route guidance

备注: 22 pages (double column), 7 Tables, 11 Figures

点击查看摘要

Abstract:City-wide traffic forecasting is important for congestion management, route guidance, and intelligent transportation systems, but accurate prediction remains challenging when future traffic must be generated as spatial maps over an entire urban network. Existing traffic movie prediction methods have improved frame-level accuracy, yet many still treat forecasting mainly as image reconstruction. This can produce traffic maps that are numerically close to the ground truth but weakly constrained by road layout, connectivity, travel direction, and congestion propagation, especially in cross-city settings where both traffic behavior and road structure change. To address this limitation, this study proposes RCSNet, a road-conditioned spatiotemporal network that reformulates traffic movie prediction as topology-guided future-state generation. RCSNet extracts road-aware representations from static road maps, models multi-horizon traffic dynamics from historical observations, aligns directional traffic features with local road structure, and progressively generates future traffic maps for improved temporal consistency. A structure-consistent learning objective further encourages predictions to remain accurate, road-aligned, and spatially stable. Experiments across multiple cities show that RCSNet improves both forecasting accuracy and structural consistency. In same-city forecasting on Berlin, Antwerp, and Moscow, RCSNet reduces average MAE, MSE, and RMSE by 11.5%, 10.0%, and 5.1%, respectively, compared with the closest baseline. In cross-city testing on unseen Chicago and Bangkok, it reduces RMSE by 10.6% and 10.5% without target-city fine-tuning. Additional horizon-wise, road-structure, explainability, statistical, and efficiency analyses show that RCSNet produces more accurate, transferable, road-aligned, and computationally efficient traffic forecasts.

97. 【2605.27852】ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

链接https://arxiv.org/abs/2605.27852

作者:Yu Zhang,Yidi Shao,Wenqi Ouyang,Yushi Lan,Zhexin Liang,Chengrui Wu,Xudong Xu,Xingang Pan

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:recently achieved remarkable, achieved remarkable success, diverse phenomena traditionally, visual effects, rendering processes

备注

点击查看摘要

Abstract:Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.

98. 【2605.27843】A self-supervised learning approach to deep filter banks for texture recognition

链接https://arxiv.org/abs/2605.27843

作者:Joao B. Florindo,Lucas O.Lyra,Antonio E. Fabris

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:training frequently found, real-world applications, important challenge, limited amount, training frequently

备注

点击查看摘要

Abstract:An important challenge in texture recognition is the limited amount of data for training frequently found in real-world applications. In computer vision in general, a successful strategy to mitigate this issue is the use of a pretraining stage where the neural network learns to identify relations between parts of the data in a self-supervised manner. A well-established framework in this direction is masked autoencoder. Nevertheless, these models usually rely on computationally intensive architectures, such as vision transformers. In the particular case of texture images, most of the relevant information is compacted within a delimited area around each pixel, which suggests that capturing long-range dependence via the attention mechanism may be unnecessary. Based on that assumption, here we propose a framework where the pretraining model is a convolutional autoencoder. To leverage the rich information conveyed by texture patterns, we employ deep filters coupled with Fisher vector pooling. In this way, we improve the performance of texture recognition without adding significant computational burden. Our approach is compared with several state-of-the-art methods in different texture databases, confirming its potential both in terms of classification accuracy and computational complexity.

99. 【2605.27823】Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

链接https://arxiv.org/abs/2605.27823

作者:Xiang Fang,Wanlong Fang

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language, bypass safety mechanisms, exploit semantic ambiguities, Language Models

备注: Published in AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security-critical applications. This paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in input prompts before they are processed by the LLM. The APD framework integrates three key innovations: (1) a mutual information-based semantic decomposition method to isolate adversarial and benign prompt components, ensuring statistical independence; (2) a graph-based intent classification approach that leverages spectral analysis to detect malicious patterns in prompt semantics; and (3) a lightweight transformer-based classifier trained on real-world datasets of toxic and jailbreaking prompts, enabling efficient and accurate adversarial intent detection. Evaluated on diverse datasets containing adversarial prompts, APD demonstrates superior robustness, reducing harmful output generation by over 85\% while maintaining negligible impact on model performance. The framework's computational efficiency supports real-time deployment, making it a practical solution for securing LLMs. Our work addresses critical challenges in machine learning security on novel attacks and integrity methods for ML systems, and offers a scalable, ethically grounded defense against prompt-based adversarial threats.

100. 【2605.27817】urning Video Models into Generalist Robot Policies

链接https://arxiv.org/abs/2605.27817

作者:Sizhe Lester Li,Evan Kim,Xingjian Bai,Tong Zhao,Tao Pang,Max Simchowitz,Vincent Sitzmann

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:promising robotics backbone, robotics backbone, capable of generating, promising robotics, depict the completion

备注: project page: [this https URL](https://vera.csail.mit.edu)

点击查看摘要

Abstract:Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: this https URL.

101. 【2605.27816】Pattern Recognition Tasks with Personalized Federated Learning

链接https://arxiv.org/abs/2605.27816

作者:Md. Arifur Rahman,Isha Das,Mushfiqur Rahman Abir,B. M. Taslimul Haque,Abdullah Al Noman,Abir Ahmed,Md. Jakir Hossen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Personalized Federated Learning, tailors Machine Learning, personalized model updates, standard Federated Learning, Federated Learning

备注: Comprehensive comparative analysis of 7 Personalized Federated Learning algorithms across MNIST, SignMNIST, and Digit5 datasets. The paper presents detailed methodology, workflow architecture, experimental evaluation, and privacy-preserving AI analysis for distributed intelligent systems, secure collaborative learning, and critical infrastructure applications

点击查看摘要

Abstract:Personalized Federated Learning (PFL) constitutes a novel paradigm that tailors Machine Learning (ML) models to individual clients, thereby furnishing personalized model updates whilst upholding stringent data privacy principles. Diverging from conventional standard Federated Learning (FL) approaches, PFL adapts models to distinct client data distributions, engendering heightened levels of accuracy, customization, and data security, all while minimizing communication overhead. This methodology proves particularly salient in contexts marked by pattern recognition tasks reliant upon heterogeneous data sources and underpinned by paramount privacy apprehensions. In the present research endeavor, this article undertake a comprehensive comparative analysis of seven distinct PFL algorithms deployed across three diverse datasets, namely MNIST, SignMNIST, and Digit5. The overarching objective entails ascertaining the preeminent PFL algorithm, within the framework of pattern recognition tasks, through a rigorous evaluation anchored in metrics encompassing Accuracy, Precision, Recall, and F1 Score. Concurrently, an in-depth scrutiny of these PFL algorithms is conducted, elucidating their operative workflows, advantages, and limitations. Through empirical investigation, the findings evince that APPLE, FedGC, and FedProto emerge as stalwart contenders, consistently furnishing superior performance across the spectrum of assessed datasets, while acknowledging the contextual specificity of alternative algorithms and the potential for iterative refinement to realize optimal outcomes.

102. 【2605.27813】Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

链接https://arxiv.org/abs/2605.27813

作者:Calvin Yeung,Prathyush Poduval,Ali Zakeri,Zhuowen Zou,Mohsen Imani

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:models generate images, internal neural layers, neural layers produce, diffusion models generate, layers produce trajectories

备注

点击查看摘要

Abstract:Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of activations rather than single static representations. Sparse autoencoders (SAEs) have recently been used to decompose diffusion activations into interpretable feature directions, but most approaches analyze activations at individual timesteps or condition on time rather than learning directly from full activation trajectories. In this work, we introduce residualized temporal SAEs for diffusion activation trajectories. We collect activations across denoising time, fit linear predictors between neighboring timesteps, and represent each trajectory using an initial activation together with residual components not explained by these linear dynamics. Training an SAE on this residualized representation encourages sparse latents to capture structure beyond what is linearly predictable. The residualized decoder directions can be mapped back into activation space, allowing each latent to be analyzed as a feature trajectory over denoising time. Through reconstruction and ablation studies, spatiotemporal feature analysis, and qualitative steering experiments on Stable Diffusion~1.5, we show that residualized temporal SAEs provide a useful framework for studying temporally structured diffusion activations.

103. 【2605.27800】CuriosAI Submission to the CASTLE Challenge at EgoVis 2026

链接https://arxiv.org/abs/2605.27800

作者:Yuto Kanda,Hayato Tanoue,Takayuki Hori

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-view egocentric video, synchronized multi-view egocentric, multiple-choice questions, hours of synchronized, egocentric video

备注: The 4th place solution for the CASTLE Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles. Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.50 and is our final challenge submission; TMKG reaches 0.35.

104. 【2605.27764】Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

链接https://arxiv.org/abs/2605.27764

作者:Yuchen Guo,Junli Gong,Hongmin Cai,Yiu-ming Cheung,Weifeng Su

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:couple large language, ground complex language, complex language expressions, large language models, Recent segmentation models

备注

点击查看摘要

Abstract:Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.

105. 【2605.27761】AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

链接https://arxiv.org/abs/2605.27761

作者:Yifan Sui,Xin Huang,Hongbing Li,Fang Xu,Jiahe Lv,Haolong Yan,Yeqing Shen,Litao Liu,Zhimin Fan,Ziyang Meng,Jia Wang,Junbo Qi,Kaijun Tan,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Osamu Yoshie

类目:Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)

关键词:leaving real-world closed-source, GUI foundation models, applications largely unevaluated, GUI foundation, spurred numerous evaluation

备注: 11 pages, 6 figures. Preprint

点击查看摘要

Abstract:The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

106. 【2605.27750】Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

链接https://arxiv.org/abs/2605.27750

作者:Antonia Karamolegkou,Nicolas Angleraud,Benoît Sagot,Thibault Clérice

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:Recent work, optical character recognition, Ancient Greek critical, low-resource Ancient Greek, producing plausible Greek

备注

点击查看摘要

Abstract:Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

107. 【2605.27748】Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

链接https://arxiv.org/abs/2605.27748

作者:Niccolò Ferrari,Oligert Osmani,Evelina Lamma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:defects are rare, system design, unavailable during system, visual anomaly detection, standard Euclidean geometry

备注: 57 pages, 7 figures

点击查看摘要

Abstract:Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unavailable during system design. PatchCore-style retrieval suits this setting because it scores test images from a memory bank of normal patch features, but the standard Euclidean geometry ignores feature correlations and its offline construction materialises the full patch pool before subsampling. We introduce Mahalanobis PatchCore, a covariance-aware, streaming-compatible extension of PatchCore. Its artificial intelligence contribution is a retrieval detector that estimates a regularised covariance model in reduced feature space and whitens embeddings, so Euclidean nearest-neighbour search after transformation implements Mahalanobis retrieval. A bounded-memory, re-iterable training pipeline builds the memory bank without storing all normal patches at once, using incremental dimensionality reduction, online covariance estimation, and streaming aggregation. The engineering application is automated industrial inspection, where visual anomaly detection must remain accurate under practical memory limits. We evaluate the method on a public 15-category industrial anomaly-detection benchmark and three industrial datasets covering blow-fill-seal strip-ampoule meniscus inspection, amber-glass-ampoule bottom inspection, and lyophilised-cake vial inspection. Mahalanobis PatchCore preserves most offline PatchCore image-level performance on the public benchmark while reducing peak memory from 5.41 to 2.78 GB, and improves the selected industrial mean image area under the receiver operating characteristic curve from 0.981 to 0.986.

Comments:
57 pages, 7 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

ACMclasses:
I.4; I.5; I.2.6

Cite as:
arXiv:2605.27748 [cs.CV]

(or
arXiv:2605.27748v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.27748

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
108. 【2605.27737】Bounded-Compute Multimodal Regression for Product-Rating Prediction

链接https://arxiv.org/abs/2605.27737

作者:William Leach,Ru He,Sizhuo Ma,Yizhen Jia,Min Cao,Jian Wang,Rick Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strict latency budgets, multimodal quality assessment, autoregressive text generation, Efficient VLM challenge, quality assessment

备注: Accepted to the LoViF Workshop at CVPR 2026. 8 pages, 2 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.

109. 【2605.27736】Explicit Critic Guidance for Aligning Diffusion Models

链接https://arxiv.org/abs/2605.27736

作者:Zhengyang Liang,Qihang Zhang,Ceyuan Yang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Online reinforcement learning, Online reinforcement, aligning diffusion models, non-differentiable objectives, reinforcement learning

备注

点击查看摘要

Abstract:Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

110. 【2605.27726】Asynchronous Remote Sensing Time-Series Fusion for Cloud Removal and Anytime Reconstruction

链接https://arxiv.org/abs/2605.27726

作者:Forouzan Fallah,Chia Yu Hsu,Wenwen Li,Anna Liljedahl,Yezhou Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Frequent cloud cover, cover severely limits, Earth surface monitoring, cloud cover severely, Earth surface

备注: CVPR 2026 MORSE Workshop

点击查看摘要

Abstract:Frequent cloud cover severely limits the usability of Sentinel-2 (S2) optical time series for Earth surface monitoring. Sentinel-1 (S1) SAR provides all-weather complementary observations, but practical S1/S2 fusion remains difficult because acquisitions are irregular and asynchronous. Many existing approaches assume temporally aligned inputs (or require external nearest-date matching) and typically restore only observed timestamps, limiting reconstruction under long gaps and preventing on-demand synthesis. We propose AGFlow (Time Aligned Generative Flow Matching), a spatiotemporal flow-matching model for S1/S2 cloud removal and time-series reconstruction with three capabilities: (1) timestamp-conditioned internal alignment that fuses asynchronous S1 and cloudy S2 observations without preprocessing-based pairing; (2) spatiotemporal, context-aware denoising that models spatial structure jointly with temporal dynamics (rather than independent per-pixel time series); and (3) anytime querying, enabling generation of cloud-free S2 frames at both observed and user-specified timestamps within the monitoring window. We evaluate on the RESTORE-DiT benchmark protocol with quantitative metrics, qualitative comparisons, and component ablations. AGFlow notably improves fully missing-frame reconstruction (MAE and RMSE reduce by 16-19% over RESTORE-DiT) and provides reliable reconstructions under persistent gaps, while also yielding competitive cloud removal performance and flexible temporal querying for downstream tasks such as dense vegetation monitoring.

111. 【2605.27696】Structure over Pixels: Learning Variable-Length Visual Programs

链接https://arxiv.org/abs/2605.27696

作者:Piotr Wyrwiński,Kacper Dobek,Krzysztof Krawiec

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:visual tokenizers translate, Discrete visual tokenizers, discrete visual tokenizer, tokenizers translate images, providing a natural

备注

点击查看摘要

Abstract:Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

112. 【2605.27686】nsor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

链接https://arxiv.org/abs/2605.27686

作者:Kabir Swain,Sijie Han,Daniel Karl I. Weidele,Mauro Martino,Antonio Torralba

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:long token sequences, Transformers process images, flattening space, space and time, time into long

备注

点击查看摘要

Abstract:Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

113. 【2605.27616】Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

链接https://arxiv.org/abs/2605.27616

作者:Zijian Du,Oleg Rybakov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:efficient low-precision inference, Real-time anomaly segmentation, Real-time anomaly, low-precision inference, demands both high

备注

点击查看摘要

Abstract:Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

114. 【2605.27595】Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

链接https://arxiv.org/abs/2605.27595

作者:Partho Ghose,Al Bashir,Prem Raj,Azlan Zahid

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, agricultural imaging applications, rapidly adopted

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities. Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e.g., Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86.8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.

115. 【2605.27590】ForestHG-Trace: Traceable Long-Horizon Ecological Reasoning over Large-Scale Forest Scenes

链接https://arxiv.org/abs/2605.27590

作者:Zihang Cheng,Duanchu Wang,Cheng Li,Jing Huang,Huanzhao Fu,Di Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Remote sensing question, sensing question answering, analysis involves multi-step, Remote sensing, direct semantic prediction

备注: 14 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Remote sensing question answering (RS-QA) often requires more than direct semantic prediction, especially in large-scale forest scenes where ecological analysis involves multi-step filtering, numerical aggregation, neighborhood reasoning, and verifiable evidence. We introduce ForestHG-Trace, a framework for traceable long-horizon ecological reasoning over forest environments. It represents multimodal NEON forest scenes as ecological hypergraphs, where tree instances, spatial units, semantic groups, and neighborhood relations support higher-order reasoning beyond pairwise scene graphs. An LLM-guided agent then invokes deterministic tools for reading, filtering, expansion, aggregation, comparison, and auditing, producing replayable execution traces and compact evidence records rather than only free-form answers. We further construct ForestTraceQA, an executable benchmark for evaluating ecological QA across diverse task types and reasoning depths. Experiments show that ForestHG-Trace substantially improves answer accuracy and execution faithfulness over single-step baselines and scene-graph agents, while highlighting execution depth as the main bottleneck for long-horizon ecological QA.

116. 【2605.27589】What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

链接https://arxiv.org/abs/2605.27589

作者:Kunlin Cai,Rui Song,Jinghuai Zhang,Kaiyuan Zhang,Pranav Bodapati,Alicia Yu,Fnu Suya,Mohammad Rostami,Jiaqi Ma,Yuan Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video generation models, simulators for tasks, Video generation, world simulators, models

备注: 38 pages, World Model Benchmark

点击查看摘要

Abstract:Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

117. 【2605.27582】Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

链接https://arxiv.org/abs/2605.27582

作者:Hongyu Ding,Sizhuo Zhang,Ziming Xu,Jinwen Guo,Hongxiu Liu,Xingzhi Cheng,Zixuan Chen,Haifei Qi,Duo Wang,Hao Xu,Jieqi Shi,Yifan Zhang,Jing Huo,Jian Cheng,Yang Gao,Jiebo Luo

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Embodied navigation requires, Embodied navigation, stream of spatial, spatial actions, Embodied

备注: Project page: [this https URL](https://xetroubadour.github.io/Uni-LaViRA/)

点击查看摘要

Abstract:Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action (VLA) foundation models on ever-larger collections of robot trajectories. This paper argues that, for navigation specifically, generality can be obtained structurally, not only through data scale. The underlying decision structure of navigation reduces to a single Language-Vision-Robot Actions Translation. The language action emits semantic-level directional command and the vision action emits a pixel-level visual target. Both outputs lie inside the natural output manifold of pretrained multimodal large language models (MLLMs), so the task can be reasoned about by an agent rather than learned from robot data. Therefore, we present Uni-LaViRA, a unified agentic architecture that extends the same insight to four task families (VLN-CE, ObjectNav, EQA, and Aerial-VLN) and to four heterogeneous real robots (Wheeled, Quadruped, Humanoid robot, and a self-built UAV) in a zero-shot manner. Two agent-loop mechanisms make this unification practical. TODO List Memory (TDM) rewrites a structured checklist of pending sub-goals at every step, reciting the unfinished items back into the agent's most recent attention window. Second Chance Backtrack (SCB) rolls the robot back to the pre-error state and conditions the agent's next plan on the failed sub-trajectory, turning single-pass navigation into a self-correcting process. With zero training effort, Uni-LaViRA reaches 60.7% SR on VLN-CE R2R, 51.3% on VLN-CE RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV, matching or even surpassing recent training navigation foundation models that consume millions of samples and thousands of GPU-hours.

118. 【2605.27561】Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

链接https://arxiv.org/abs/2605.27561

作者:Elena Sergeevna Kozachok,Sergey Sergeevich Seregin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Toggle, clinical decision support, Toggle Hugging Face, dermoscopy clinical decision, Mobile dermoscopy clinical

备注: 24 pages, 6 figures, 5 tables, 21 references

点击查看摘要

Abstract:Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit screening coverage. Mobile dermoscopy clinical decision support systems (CDSS) offer a promising approach, with model interpretability and standardised patient routing remaining key barriers to adoption. Aim. To develop a quantitative interpretability assessment method for cascade deep learning models and a three-zone patient routing algorithm, and to conduct a preliminary single-centre prospective clinical validation of the Melanoscope AI CDSS in Russian outpatient practice. Material and methods. Two-stage cascade classification of dermoscopic images; attention map visualisation (attention rollout for ViT and Swin; Grad-CAM for ConvNeXt and EfficientNetV2); quantitative IoU-based agreement assessment between activation maps and expert annotations; prospective single-centre validation across four "Melanoma Day" sessions (Orel, Russia, June 2025 - April 2026). Results. On 176 patients: agreement with expert assessment 88.6%; no false negatives among 5 malignant lesions (95% CI: 47.8-100.0%); specificity 88.3%. Three melanomas and two basal cell carcinomas were histologically confirmed; six dysplastic naevi placed under follow-up. Mean IoU (n=180): ViT - 0.69; Swin - 0.64; ConvNeXt - 0.53; EfficientNetV2 - 0.51. Routing thresholds: P0.15 / 0.15-0.50 / =0.50. Conclusion. No false negatives were observed; specificity was 88.3%, supporting screening use. The integrated cascade classification, attention map visualisation with IoU assessment, and three-zone routing provide reproducible, interpretable clinical decision support adaptable to varying resource levels.

Comments:
24 pages, 6 figures, 5 tables, 21 references

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.27561 [cs.CV]

(or
arXiv:2605.27561v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.27561

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Elena Kozachok [view email] [v1]
Tue, 26 May 2026 18:29:53 UTC (2,115 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System, by Elena Sergeevna Kozachok and 1 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

prev

|
next

new
|
recent
| 2026-05

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

119. 【2605.27495】Representation-Conditioned Diffusion Models for Guided Training Data Generation

链接https://arxiv.org/abs/2605.27495

作者:Nithesh Chandher Karthikeyan,Jonas Unger,Gabriel Eilertsen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Data availability remains, deep learning applications, availability remains, remains a critical, critical bottleneck

备注

点击查看摘要

Abstract:Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy). We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2605.27495 [cs.CV]

(or
arXiv:2605.27495v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.27495

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
120. 【2605.27487】Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

链接https://arxiv.org/abs/2605.27487

作者:Andrii Ahitoliev,Pavlo Berezin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Handwritten text generation, text generation, leaving open, existing models generalise, Latin scripts

备注: 16 pages, 7 figures. Submitted to ICTERI 2026

点击查看摘要

Abstract:Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters. We retrain DiffusionPen, a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

121. 【2605.27467】Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

链接https://arxiv.org/abs/2605.27467

作者:Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional Recurrent Neural, Long Short-Term Memory, Recurrent Neural Networks, discrete time steps, real-world physical processes

备注: 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

点击查看摘要

Abstract:Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

122. 【2605.27465】AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

链接https://arxiv.org/abs/2605.27465

作者:Semi Lee,Hyejin Go,Hyesong Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision Transformers, constitutes a fundamental, practical deployment, motivating a vibrant, quadratic cost

备注: 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

123. 【2605.27464】Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

链接https://arxiv.org/abs/2605.27464

作者:Chung-Ta Huang,Leopold Das,Jeffrey Zhou,Faizaan Siddique,Julia Seungjoo Baek,Serena Liu,Andrew Rusli,Todd Y. Zhou,Freddy Yu,Sinclair Hansen,Ziling Hu,Arnav Sharma,Mengyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Inertial Measurement Unit, head-mounted Inertial Measurement, Measurement Unit, Inertial Measurement, offer proactive assistance

备注

点击查看摘要

Abstract:AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at this https URL.

124. 【2605.27460】D$^2$Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

链接https://arxiv.org/abs/2605.27460

作者:Zixiao Hu,Tianyu Li,Guoqing Wang,Wei Li,Guoguo Xin,Xun Liu,Peng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Single-frame atmospheric turbulence, inherently ill-posed due, spatially varying blur, varying blur coupled, Single-frame atmospheric

备注: 14 pages, 7 figures

点击查看摘要

Abstract:Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D$^2$Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D$^2$Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D$^2$Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at this https URL.

125. 【2605.27458】Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

链接https://arxiv.org/abs/2605.27458

作者:Yongjin Cui,Xiaohui Fan,Huajun Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:heterogenous attention structures, attention structures, heterogenous attention, propelled the development, development of artificial

备注

点击查看摘要

Abstract:Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

126. 【2605.27452】Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

链接https://arxiv.org/abs/2605.27452

作者:Takato Yasuno

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Japan requires mandatory, consistent infrastructure management, requires mandatory visual, mandatory visual assessments, Japan requires

备注: 23 pages, 11 figures, 13 tables

点击查看摘要

Abstract:Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability -- a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). We fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining this http URL() and batch processing (batch_size=8) achieves 10.06 seconds per image -- a 70.2% reduction over the unoptimized baseline. Our approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images.

Comments:
23 pages, 11 figures, 13 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

ACMclasses:
I.5.4; I.2.7; J.2

Cite as:
arXiv:2605.27452 [cs.CV]

(or
arXiv:2605.27452v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.27452

Focus to learn more

              arXiv-issued DOI via DataCite</p>
127. 【2605.27451】From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop Competition

链接https://arxiv.org/abs/2605.27451

作者:Dimitrios Kollias,Panagiotis Tzirakis,Alan Cowen,Stefanos Zafeiriou,Irene Kotsia,Eric Granger,Marco Pedersoli,Simon Bacon,Jens Madsen,Soufiane Belharbi,Muhammad Haseeb Aslam,Chunchang Shao,Guanyu Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:held at CVPR, unconstrained environments, Affective Behavior Analysis, advance research, ABAW Competition introduces

备注: accepted at CVPR 2026

点击查看摘要

Abstract:The 10th Affective Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion behavior estimation, affect modelling multimodal learning, benchmarks, datasets evaluation protocols, fairness, robustness deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.

128. 【2605.27436】RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?

链接https://arxiv.org/abs/2605.27436

作者:Arijit Ghosh,Aritra Bandyopadhyay,Chiranjeev Bindra,Jingfen Qiao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal alignment, bridging the semantic, semantic gap, gap in information, Multimodal

备注

点击查看摘要

Abstract:Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero-shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain-dependent. However, we fail to reproduce the reported learning-from-scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data-Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text-to-video retrieval, and fine-tuning with domain supervision amplifies geometric benefits but reduces cross-dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at this https URL.

129. 【2605.27378】OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

链接https://arxiv.org/abs/2605.27378

作者:Jing Hao,Siyuan Dai,Yongxin Zhang,Yuci Liang,Jiamin Wu,Jiahao Bao,Yuxuan Fan,Zanting Ye,Yanpeng Sun,Xinyu Zhang,Ming Hu,Liang Zhan,James Kit Hon Tsoi,Linlin Shen,Junjun He,Kuo Feng Hung

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:supporting accurate diagnosis, image analysis plays, plays a pivotal, pivotal role, role in supporting

备注: 14 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at this https URL.

130. 【2605.28697】Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

链接https://arxiv.org/abs/2605.28697

作者:Thierry Judge,Nicolas Duchateau,Andreas Østvik,Khuram Faraz,Anders Austlid Taskén,Sigve Karlsen,Thor Edvardsen,Harald Brunvand,Md Abulkalam Azad,Havard Dalen,Bjørnar Grenne,Gabriel Kiss,Pierre-Yves Courand,Lasse Lovstakken,Pierre-Marc Jodoin,Olivier Bernard

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Speckle tracking echocardiography, tracking echocardiography, standard for myocardial, myocardial strain estimation, STE

备注: 10 pages

点击查看摘要

Abstract:Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical this http URL this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.

131. 【2605.27796】Benchmarking Ultrasound Foundation Models for Fetal Plane Classification

链接https://arxiv.org/abs/2605.27796

作者:Leya Barrientos,Yuexi Du,Nicha C. Dvornek

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP); Applications (stat.AP)

关键词:obstetric care due, real-time imaging, obstetric care, care due, Ultrasound

备注

点击查看摘要

Abstract:Ultrasound is widely used in obstetric care due to its safety, accessibility, and real-time imaging. However, interpretation remains operator-dependent and susceptible to noise and artifacts. Deep learning models have shown strong performance to solve these problem, but they typically require large annotated datasets that are difficult to obtain in clinical ultrasound. Foundation models (FMs) offer an alternative, using a large number of ultrasound images to learn transferable representations that can generalize with limited labeled data. This work presents a comprehensive benchmark of ultrasound-specific FMs for fetal plane classification. We evaluated four ultrasound FMs (USFM, MOFO, UltraSAM, FetalCLIP) against two CNN baselines (ResNet50, EfficientNet-V2) and a ViT (DINOv3) pretrained on natural images. We trained all models under two complementary settings: full fine-tuning and linear probing with a frozen encoder. All models were trained using 5-fold patient-level cross-validation on a Spanish fetal ultrasound dataset and tested on both in-domain data and an external African cohort to assess cross-population generalization. We found that FetalCLIP achieved the best results in the linear probing setting (F1 = 0.9261 for in-domain, F1 = 0.9731 for out-of-domain), while USFM performed best in the full fine-tuning setting (F1 = 0.9476 for in-domain, F1 = 0.9515 for out-of-domain). MOFO and UltraSAM degraded most in both settings, underperforming natural image pretrained models in some cases. These findings highlight how the choice of pretrained model strongly affects fetal plane classification performance, since different pretraining objectives lead to different levels of transferability.

132. 【2605.27679】On the Equivariant Learning of the $Q$-tensor Order Parameter

链接https://arxiv.org/abs/2605.27679

作者:Julia Navarro,Mark Wilkinson

类目:oft Condensed Matter (cond-mat.soft); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:generated microscopic textures, evaluate group-equivariant neural, group-equivariant neural networks, nematic liquid crystals, synthetically generated microscopic

备注: 15 pages (excluding 7-page appendix); 6 figures

点击查看摘要

Abstract:We construct and evaluate group-equivariant neural networks for the prediction of the two-dimensional $Q$-tensor order parameter of nematic liquid crystals from synthetically generated microscopic textures. Seven architectures, equivariant to cyclic groups $C_k$ of order $k$ for $k=4,\,8,\,16,\,32,\,64,\,128,\, 256$, are built using a combination of weight-sharing constraints, equivariant activations and regularization techniques. To do this, we construct rotation-like permutation matrix groups with elements $\varrho_{C_k}(g)$ that act on row-wise vectorized images, thereby approximating a $\frac{2\pi}{k}$ rotation of the circular subdomain on square images. We show that all seven equivariant models satisfy the $Q$-tensor equivariance constraint to within single-precision floating point accuracy. Comparing against approximate parameter-matched non-equivariant benchmarks, with and without data augmentation, we find that the equivariant models consistently achieve lower errors and generalize more robustly to unseen defect configurations. Performance increases with group order, suggesting that the incorporation of finer rotational symmetry leads to lower errors.

133. 【2605.27454】NL-MambaXCT: Self-Supervised Nested-Learning Mamba for Nomex Honeycomb X-ray CT Defect Classification

链接https://arxiv.org/abs/2605.27454

作者:Ghaleb Aldoboni,Lobna Nassar,Fakhri Karray,Reem Alshamsi

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:X-ray computed tomography, limited labeled data, X-ray computed, supervised models trained, computed tomography

备注

点击查看摘要

Abstract:X-ray computed tomography (XCT) is widely used for non-destructive testing of Nomex honeycomb structures in aerospace manufacturing, but industrial inspection still relies heavily on manual interpretation and supervised models trained on limited labeled data. This work introduces NL-MambaXCT, a Mamba-based framework that combines self-supervised masked image modelling with a Nested Learning (NL) formulation for automated, label-efficient defect classification from production XCT slices. The backbone is a four-stage 2D encoder with RegNet convolutional blocks in the early stages and Mamba-based sequence mixing with attention in the deeper stages. It is pretrained by masked image modelling on 19,961 unlabeled industrial XCT slices and fine-tuned on 2,000 relabeled Nomex XCT slices split by production order. NL is instantiated through two-timescale parameter dynamics: selected projections maintain slow exponential-moving-average traces alongside fast weights, while a deep-momentum optimizer introduces an additional slow parameter-update trajectory. On the held-out test set, the MIM-pretrained NL-MambaXCT model achieves 96.91% accuracy and 96.8% macro F1, outperforming CNN, attention, and single-timescale Mamba baselines by 3.11--10.31 percentage points in accuracy. The results suggest that combining masked self-supervision with NL-style fast/ slow learning dynamics is a promising strategy for robust defect classification in Nomex honeycomb XCT inspection.