本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新711篇论文,其中:
- 自然语言处理138篇
- 信息检索18篇
- 计算机视觉141篇
自然语言处理
1. 【2605.27366】MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
链接:https://arxiv.org/abs/2605.27366
作者:Huawei Lin,Peng Li,Jie Song,Fuxin Jiang,Tieying Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Large language model, Large language, solve complex tasks, language model, rely on reusable
备注: 30 pages, 8 figures, 13 tables, working in progress
点击查看摘要
Abstract:Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.
2. 【2605.27358】MobileMoE: Scaling On-Device Mixture of Experts
链接:https://arxiv.org/abs/2605.27358
作者:Yanbei Chen,Hanxian Huang,Ernie Chang,Jacob Szwejbka,Digant Desai,Zechun Liu,Vikas Chandra,Raghuraman Krishnamoorthi
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:remain largely unexplored, deployment remain largely, largely unexplored, language models, remain largely
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.
3. 【2605.27355】Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
链接:https://arxiv.org/abs/2605.27355
作者:Dongyoon Hahm,Dylan Hadfield-Menell,Kimin Lee
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:align Large Language, Large Language Models, Large Language, Human Feedback, align Large
备注: Accepted at ICML 2026, Source code: [this https URL](https://alignment-tampering.github.io/)
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: this https URL
4. 【2605.27354】Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
链接:https://arxiv.org/abs/2605.27354
作者:Yi Jing,Zao Dai,Jinwu Hu,Zijun Yao,Lei Hou,Juanzi Li,Xiaozhi Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:encode rich information, internals encode rich, ignores rich intrinsic, engineering largely relies, large language model
备注:
点击查看摘要
Abstract:Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.
5. 【2605.27345】MATCHA: Matching Text via Contrastive Semantic Alignment
链接:https://arxiv.org/abs/2605.27345
作者:Siran Li,Ece Sena Etoglu,Carsten Eickhoff,Seyed Ali Bahrainian
类目:Computation and Language (cs.CL)
关键词:Reliable evaluation, understanding large language, today go-to metrics, evaluation is essential, essential for understanding
备注:
点击查看摘要
Abstract:Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token-overlap metrics and embedding-based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual-view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question-answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding-based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE-L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre-existing metrics. Compared with 23 embedding models, including top state-of-the-art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (this https URL).
6. 【2605.27338】2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
链接:https://arxiv.org/abs/2605.27338
作者:Andrea Cuteri,Giuseppe Mazzotta,Francesco Ricca
类目:Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
关键词:Answer Set Programming, extends Answer Set, Set Programming, ASP, extends Answer
备注:
点击查看摘要
Abstract:ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. 2-ASP(Q)^w is a practically relevant fragment of ASP(Q) that is expressive enough to capture optimization problems up to the class Delta_3^P. On the theoretical side, we provide a complete complexity characterization of the main computational tasks for 2-ASP(Q)^w programs, including tight completeness results and the analysis of nontrivial cases that have not been addressed in previous works. On the practical side, we introduce novel strategies for computing (optimal) quantified answer sets in the Casper system, that rely on a Counterexample-Guided Abstraction Refinement (CEGAR) technique tailored to ASP(Q). An experimental evaluation on hard benchmarks from different application domains shows that the proposed techniques are effective in practice.
7. 【2605.27333】FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents
链接:https://arxiv.org/abs/2605.27333
作者:Haoxuan Jia,Yang Liu,Bin Chong,Yingguang Yang,Yancheng Chen,Jiayu Liang,Qian Li,Hanning Lu,Kefu Xu,Hao Zheng,Chongyang Zhang,Hao Peng,Philip S. Yu
类目:Computation and Language (cs.CL)
关键词:multi-step business workflows, simultaneously block prompt-induced, block prompt-induced unauthorized, prompt-induced unauthorized actions, legitimate multi-step business
备注:
点击查看摘要
Abstract:Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges perform auditing only after termination -- too late for intervention and at a computational cost that scales linearly with trace length. We present FinHarness, an inline safety harness that wraps a finance agent end-to-end with three components: a Query Monitor that fuses single-turn intent with cross-turn drift, a Tool Monitor that evaluates each prospective tool call, and a Cascade module that integrates per-step risk and adaptively routes verification between a lightweight and an advanced-tier LLM judge. Fired risk factors are re-injected into the agent input as ex-ante evidence, enabling the agent to refuse, re-plan, or approve on its own. On FinVault, routed FinHarness cuts ASR from 38.3% to 15.0% while largely preserving benign approval ($41.1\% \to 39.3\%$), and uses $4.7\times$ fewer advanced-judge calls than an always-advanced ablation.
8. 【2605.27322】Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech
链接:https://arxiv.org/abs/2605.27322
作者:Felix Ostrowicki,Hubert Plisiecki
类目:Computation and Language (cs.CL)
关键词:Supervised Semantic Differential, extension of Supervised, semantic meaning varies, introduce interaction SSD, Supervised Semantic
备注:
点击查看摘要
Abstract:We introduce interaction SSD, an extension of Supervised Semantic Differential that models how semantic meaning varies across moderators such as groups, traits, or conditions making this variation testable and interpretable. The method estimates a main semantic gradient, an interaction gradient, and conditional gradients, all interpretable through standard SSD tools. We illustrate it on the UC Berkeley Measuring Hate Speech corpus, testing whether annotator racial identity moderates hate-speech judgments of comments targeting people of color. The interaction model detects a significant moderation effect: the shared gradient contrasts dehumanizing hostility with counter-speech, while the interaction gradient reveals smaller group-linked differences in which semantic cues predict hate-speech ratings. Interaction SSD makes moderated meaning-outcome relationships statistically testable and interpretable.
9. 【2605.27315】Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery
链接:https://arxiv.org/abs/2605.27315
作者:Yifan Jiang,Ruoxi Ning,Sheng Yao,Freda Shi
类目:Computation and Language (cs.CL)
关键词:improve language understanding, assumed to improve, improve language, language understanding, understanding in multimodal
备注:
点击查看摘要
Abstract:Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.
10. 【2605.27313】When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection
链接:https://arxiv.org/abs/2605.27313
作者:Weibin Cai,Reza Zafarani
类目:Computation and Language (cs.CL)
关键词:hate speech detection, speech detection, benefit is inconsistent, perspectives in subjective, subjective tasks
备注:
点击查看摘要
Abstract:Demographic information is often used to model annotator perspectives in subjective tasks such as hate speech detection, but its benefit is inconsistent: it improves performance in some settings and behaves as noise in others. This paper asks when demographic features help. We analyze demographic gain as a function of both data split properties and modeling frameworks. For data splits, we measure annotator disagreement, namely how often annotators assign different labels to the same example, along with training size and train-test demographic coverage. We find that demographic gains concentrate in regimes with low training disagreement, high test disagreement, fine-grained ambiguity measurement, sufficient training data, and greater demographic overlap. Motivated by these regimes, we introduce a gated demographic residual model that treats demographics as a selective adjustment to text-only predictions. Experiments on MHS and POPQUORN show that this design is effective, especially on high disagreement or low confidence examples. Overall, our results suggest that demographics should not be assumed useful by default; their value depends jointly on the data regime and the modeling framework.
11. 【2605.27311】Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
链接:https://arxiv.org/abs/2605.27311
作者:Yifan Jiang,Dae Yon Hwang,Jesse C. Cresswell,Freda Shi
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:visual reasoning, benchmarks aim, background knowledge, aim to pose, pose questions
备注:
点击查看摘要
Abstract:Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.
12. 【2605.27298】Self-Ensembling Vision-Language Models for Chart Data Extraction
链接:https://arxiv.org/abs/2605.27298
作者:Thomas Berkane,Qianyi Wang,Maimuna S. Majumder
类目:Computation and Language (cs.CL)
关键词:convey quantitative information, effectively convey quantitative, Charts effectively convey, quantitative information, effectively convey
备注:
点击查看摘要
Abstract:Charts effectively convey quantitative information, but the underlying data are often locked in image form, hindering reuse and analysis. Manually digitizing charts is time-consuming and error-prone, motivating automatic chart-to-table extraction. Recent approaches use specialized vision-language models (VLMs), yet performance still lags on charts with many datapoints or substantial stylistic variation. We propose a VLM self-ensembling method that repeatedly samples multiple tabular outputs from the same VLM for a fixed chart image and aggregates them at the level of individual table cells. We align candidate tables and take per-cell medians over numerical values to produce a more accurate consensus table. Our method also includes convergence detection to stop sampling once the aggregated table stabilizes, and uncertainty estimation based on dispersion across samples to help users assess extraction reliability. Because existing chart extraction benchmarks contain relatively simple plots with limited room for improvement, we introduce WB-ChartExtract, a new benchmark built from World Bank data with more complex and stylistically diverse charts; on average, its charts contain 7 times more datapoints than those in the ChartQA benchmark. Across both ChartQA and WB-ChartExtract, our approach improves extraction accuracy over single-pass VLM outputs, yielding up to 23% relative improvement on WB-ChartExtract after ensembling. More broadly, our method helps unlock tabular data previously siloed in chart images, enabling downstream analysis and reuse.
13. 【2605.27296】Probing Cultural Awareness in LLMs: A Case Study of Cross-Culture Aesthetic Stylistics
链接:https://arxiv.org/abs/2605.27296
作者:Jiashuo Wang,Fenggang Yu,Jian Wang,Chak Tou Leong,Xiaoyu Shen,Chunpu Xu,Jiawen Duan,Wenjie Li,Johan F. Hoorn
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, diverse cultural contexts, evoke cultural resonance
备注: IJCAI 2026 Human-Centred AI track
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed in diverse cultural contexts, yet their ability to master aesthetic stylistics, i.e., the strategic use of language to evoke cultural resonance, remains underexplored. We curate C4STYLI, a benchmark of highly stylized translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland, to evaluate LLMs via the lens of behavioral recognition and productive competence. Extensive evaluations show that LLMs differ from humans in stylistic recognition, and this recognition ability varies across text domains. In addition, stylistic recognition and generation performance in LLMs are not consistently aligned. To further examine whether LLMs genuinely capture stylistic information in stylistic recognition, we conduct structural ablation with logistic regression probes. We find that, in the Hong Kong setting, stylistic recognition in LLMs relies primarily on surface-level linguistic information rather than stylistic structure. This suggests limited sensitivity to Hong Kong-specific stylistic structure.
14. 【2605.27294】Separating Semantic Competition from Context Length in RAG Reading
链接:https://arxiv.org/abs/2605.27294
作者:Vyzantinos Repantis,Ameya Gawde,Harshvardhan Singh,Rohit Alekar,Cien Zhang,Svetlana Karslioglu,Akash Vishwakarma
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Retrieval-augmented generation, systems can respond, respond incorrectly, Retrieval-augmented, passages
备注: 4 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. This passage-reading model is called the reader. Does it fail simply because the context is longer or because the other passages genuinely compete with the correct one? We introduce and demonstrate a matched-control protocol for RAG reading: we keep the number and length of passages fixed, but replace hard competitors with less competitive real passages. We apply this control across two compact open models on SQuAD. This replacement partially restores performance, with the strongest effects on F1 and answer inclusion. For Phi-2, this recovers +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1. For Qwen2.5-1.5B, it recovers +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1. To track how performance changes as competitors accumulate, we also report retention curves and summarize them with a right-censored half-life when the curves do not cross half-retention. Together, these results show the protocol isolates a competition effect distinct from context length, though the effect is clearer for F1 and answer inclusion than for exact match, and also varies with snippet length.
15. 【2605.27288】It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty
链接:https://arxiv.org/abs/2605.27288
作者:Kevin H. Guo,Chao Yan,Avinash Baidya,Katherine Brown,Xiang Gao,Juming Xiong,Zhijun Yin,Bradley A. Malin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, Large language, stance to conform, model epistemic uncertainty, conformity
备注:
点击查看摘要
Abstract:Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.
16. 【2605.27276】SIA: Self Improving AI with Harness Weight Updates
链接:https://arxiv.org/abs/2605.27276
作者:Prannay Hebbar,Yogendra Manawat,Samuel Verboomen,Alesia Ivanova,Selvam Palanimalai,Kunal Bhatia,Vignesh Baskaran
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:building and improving, Humans, task-specific agent, model, held fixed
备注:
点击查看摘要
Abstract:Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.
17. 【2605.27268】Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)
链接:https://arxiv.org/abs/2605.27268
作者:Samer Awad,Javier Conde,Carlos Arriaga,Tairan Fu,Javier Coronado-Blázquez,Pedro Reviriego
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Modern Large Language, vast latent vocabularies, Modern Large, possessing vast latent, Large Language Models
备注: 15 pages, 6 figures
点击查看摘要
Abstract:Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.
18. 【2605.27255】Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs
链接:https://arxiv.org/abs/2605.27255
作者:Wenhui Tan,Minghao Li,Xiaoqian Ma,Siqi Fan,Xiusheng Huang,Liujie Zhang,Ruihua Song,Weihang Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, modern large language, made autoregressive decoding, dominant inference cost, reasoning has made
备注: Project Page: [this http URL](http://GitHub.com/AlbertTan404/PIPO)
点击查看摘要
Abstract:Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing methods target either the input side (latent compression) or the output side (speculative decoding and multi-token prediction, MTP), but the two lines of work have been pursued independently. Moreover, output-side methods must incur an expensive verifier pass to validate the unreliable draft tokens predicted by MTP. To address these issues, we propose \textbf{Pair-In, Pair-Out (PIPO)}, which unifies both sides by viewing a latent compressor and an MTP head as mirror-image operations: the compressor folds two input tokens into one latent representation, while the MTP head unfolds one hidden state into one additional output token. To remove the verifier cost without sacrificing reliability, PIPO trains a lightweight confidence head that decides whether draft tokens should be accepted. We observe that On-Policy Distillation (OPD) naturally matches the rejection-sampling criterion of speculative decoding, so the confidence head can be trained alongside OPD with negligible extra cost. Experiments on AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 with Qwen3.5-4B and 9B backbones show that PIPO improves pass@4 over regular decoding by up to $+7.15$ points, while delivering up to $2.64\times$ first-token-latency and $2.07\times$ per-token-latency speedups.
19. 【2605.27249】Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
链接:https://arxiv.org/abs/2605.27249
作者:Hunter McNichols,Alexander Scarlatos,Mihai Dascalu,Danielle McNamara,Andrew Lan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:effective method, method of teaching, teaching across disciplines, Large Language Models, high-quality work
备注: preprint
点击查看摘要
Abstract:An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $\beta$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.
20. 【2605.27240】ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents
链接:https://arxiv.org/abs/2605.27240
作者:Xing Fu,Yulin Hu,Mengtong Ji,Haozhen Li,Yixin Sun,Weixiang Zhao,Yanyan Zhao,Bing Qin
类目:Computation and Language (cs.CL)
关键词:users' latent emotional, Memory-augmented language agents, Emotional Need-aware Proactive, increasingly deployed, deployed in affective
备注:
点击查看摘要
Abstract:Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users' latent emotional needs is critical. However, existing research often treats memory as a tool for factual retrieval, overlooking its role in shaping users' emotional experiences. In this work, we introduce ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval (ENPMR), a core capability that enables agents to infer users' latent emotional needs and proactively retrieve appropriate memories to support empathetic interaction. Grounded in Maslow's hierarchy of needs, ENPMR-Bench includes over 1,800 memory-augmented dialogues and defines structured mappings between emotional needs and supportive memory types. Experimental results demonstrate that current retrieval paradigms, including both embedding-based and LLM-driven approaches, exhibit substantial deficiencies, with empathy scores significantly lagging behind golden memory conditions. While chain-of-thought prompting improves the alignment between inferred emotional needs and retrieved memories to some extent, a notable performance gap remains. Together, these findings reveal critical limitations in current agents and outline directions for advancing personalized emotional support through need-sensitive memory retrieval.
21. 【2605.27239】mporal Simultaneity Predicts Annotation Quality in Sentiment Corpora
链接:https://arxiv.org/abs/2605.27239
作者:Idris Abdulmumin,Mokgadi Penelope Matloga,Tadesse Destaw Belay,Botshelo Kondowe,Letlhogonolo Mohleleng,Hareaipha Nkopo Letsoalo,Shamsuddeen Hassan Muhammad,Vukosi Marivate
类目:Computation and Language (cs.CL)
关键词:campaigns span weeks, small annotator pools, difficult to sustain, sustain when campaigns, campaigns span
备注:
点击查看摘要
Abstract:Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $\kappa = 0.76$, "excellent," per-batch $\kappa$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $\kappa$ is temporal simultaneity: tweets labeled within one minute achieve $\kappa = 0.98$, while those labeled more than a day apart reach only $\kappa = 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $\kappa$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.
22. 【2605.27220】he Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
链接:https://arxiv.org/abs/2605.27220
作者:Zafar Hussain,Kristoffer Nielbo
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:modern RAG pipelines, RAG pipelines, substantial LLM inference, LLM inference costs, modern RAG
备注:
点击查看摘要
Abstract:In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.
23. 【2605.27204】GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing
链接:https://arxiv.org/abs/2605.27204
作者:Pujun Zheng,Wanying Ren,Jiacheng Yao,Guoxiu He,Star X. Zhao
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Scientific paper evaluation, Scientific paper, assessing a manuscript, Scientific, paper evaluation
备注:
点击查看摘要
Abstract:Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $\rho$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at this https URL.
24. 【2605.27195】EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization
链接:https://arxiv.org/abs/2605.27195
作者:Thomas Berkane,Maimuna S. Majumder
类目:Computation and Language (cs.CL)
关键词:show diminishing headroom, treat extracted points, small alignment shifts, penalizing small alignment, unordered key-value pairs
备注:
点击查看摘要
Abstract:Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dynamic programming, tolerating local temporal shifts and gaps while penalizing them proportionally. Evaluating six methods--three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems--we find the strongest model reaches only 52.3% ECS, and that ECS spreads the four general-purpose VLMs over a 25-point range where key-value metrics (RMS, SCRM) compress them into a 5-point band. We further validate ECS against four downstream epidemiological summary statistics, finding that higher ECS predicts smaller errors in total counts, peak timing, and peak magnitude, and higher growth-rate fidelity; across all four, ECS correlates 1.5--3.6 times more strongly than Dynamic Time Warping, which lacks a gap penalty and therefore cannot distinguish a truncated prediction from a temporally faithful one. EpiCurveBench targets a high-impact public-health application--unlocking decades of outbreak data trapped in published figures--but the benchmark and metric apply directly to any structured time-series chart-extraction setting.
25. 【2605.27194】Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation
链接:https://arxiv.org/abs/2605.27194
作者:Ning Wu,Rui Liu,Xinkun Lin,Weixing Chen,Jinxi Xiang,Tao Wei,Lina Yao,Mingjie Li
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Distilling demonstration effects, hidden-space interventions offers, Distilling demonstration, full finetuning, demonstration effects
备注: Preprint. 20 pages, 6 figures
点击查看摘要
Abstract:Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.
26. 【2605.27190】Learning When to Think While Listening in Large Audio-Language Models
链接:https://arxiv.org/abs/2605.27190
作者:Zhiyuan Song,Weici Zhao,Yang Xiao,Suhao Yu,Cheng Zhu,Jiatao Gu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
关键词:Large Audio-Language Models, interaction increasingly practical, Recent advances, advances in Large, Large Audio-Language
备注: 19 pages, 4 figures, 6 tables
点击查看摘要
Abstract:Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.
27. 【2605.27189】Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy
链接:https://arxiv.org/abs/2605.27189
作者:Serli Kopar,Roshan Prakash Rane,Christian Mychajliw,Lydia Federmann,Gerhard Eschweiler,Daniela Berg,Sam Gijsen,Paula Andrea Perez-Toro,Kerstin Ritter
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
关键词:mild cognitive impairment, study examines, examines the relationship, German neuropsychological assessment, cognitive impairment
备注:
点击查看摘要
Abstract:This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting ``specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting ``generalist'' representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.
28. 【2605.27186】MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation
链接:https://arxiv.org/abs/2605.27186
作者:Haoyu Zheng,Yun Zhu,Shu Yuan,Shangming Chen,Qing Wang,Wenqiao Zhang,Jun Xiao,Yueting Zhuang
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, solve tasks, fully specified prompt, prompt but degrade
备注:
点击查看摘要
Abstract:Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.
29. 【2605.27168】Grounding Text Embeddings in Stakeholder Associations
链接:https://arxiv.org/abs/2605.27168
作者:Jonathan Rystrøm,Sofie Burgos-Thorsen,Zihao Fu,Johan Irving Søltoft,Kenneth C. Enevoldsen,Chris Russell
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:analyse large corpora, Stakeholder Grounding Exercise, large corpora, corpora of complex, Stakeholder Grounding
备注:
点击查看摘要
Abstract:Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $\rho=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.
30. 【2605.27161】Formalization of Malagasy conjugation
链接:https://arxiv.org/abs/2605.27161
作者:Joro Ny Aina Ranaivoarison,Eric Laporte,Baholisoa Simone Ralalaoherivony
类目:Computation and Language (cs.CL)
关键词:Malagasy simple verbs, core linguistic work, linguistic work performed, Malagasy simple, paper reports
备注:
点击查看摘要
Abstract:This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with inflectional affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 other transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.
31. 【2605.27156】LitSeg: Narrative-Aware Document Segmentation for Literary RAG
链接:https://arxiv.org/abs/2605.27156
作者:Ruikang Zhang,Zhanni Chen,Yiqiao Cai,Qi Su
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:enhances Large Language, Large Language Models, incorporating external knowledge, Large Language, enhances Large
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.
32. 【2605.27110】BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning
链接:https://arxiv.org/abs/2605.27110
作者:Xuan Luo,Yue Wang,Geng Tu,Jing Li,Ruifeng Xu
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Boundary-Aware Iterative Trap, Iterative Trap, approaches malicious goals, Boundary-Aware Iterative, three-step jailbreak framework
备注:
点击查看摘要
Abstract:In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.
33. 【2605.27101】Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models
链接:https://arxiv.org/abs/2605.27101
作者:Oscar Chew,Serhii Honcharenko,Qian-Hui Chen,Patricia Lu,Dishant Zaveri,Khoa D. Doan,Kuan-Hao Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Video Large Language, Large Language Models, Large Language, reliably linking subjects, Video Large
备注:
点击查看摘要
Abstract:A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.
34. 【2605.27091】MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition
链接:https://arxiv.org/abs/2605.27091
作者:Anqi Hu,Zhiyuan Wang,Zijun Jia,Bo Fu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reliable set-valued prediction, approaches typically rely, Reliable set-valued, open-ended question answering, existing conformal approaches
备注:
点击查看摘要
Abstract:Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.
35. 【2605.27088】LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring
链接:https://arxiv.org/abs/2605.27088
作者:Unggi Lee,Minchul Shin,Yeil Jeong,Sookbun Lee,Jeongsu Moon,Kyungtae Joo,Eunjoo Lee,Hoilym Kwon
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:math tutoring typically, tutoring typically requires, typically requires RL-based, requires RL-based training, Aligning LLMs
备注: 17 pages, 5 figures
点击查看摘要
Abstract:Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.
36. 【2605.27083】On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning
链接:https://arxiv.org/abs/2605.27083
作者:Xiaotian Ye,Xiaohan Wang,Mengqi Zhang,Shu Wu
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:Large Language Model, generate alternative fictitious, Large Language, Language Model, alternative fictitious knowledge
备注:
点击查看摘要
Abstract:Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools. Our work further discusses the limitations and overhead of the paradigm, aiming to provide insights and actionable guidance for more rigorous LLM unlearning research.
37. 【2605.27072】E3: Issue-Level Backtesting for Automated Research Critique
链接:https://arxiv.org/abs/2605.27072
作者:Yashwardhan Chaudhuri,Sanyam Jain,Paridhi Mundra
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:identifying decision-relevant technical, decision-relevant technical concerns, automated review assistant, assistant that augments, engineering teams
备注:
点击查看摘要
Abstract:We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak baselines, hidden assumptions, threats to validity, and leakage risks. To evaluate E3 without contamination confounds we adopt an issue-level backtesting protocol: the corpus is restricted to papers postdating the training cutoff of every automated source, and for each paper a meta-judge that observes only anonymised reviews labels every issue-source pair as Caught, Partial, or Missed. Applied to 100 ICLR 2026 papers and 4598 judged issue rows, comparing E3 against the ICLR human reviews and two prompt-matched LLM baselines built on gpt-5.4 from OpenAI and claude-opus-4-6 from Anthropic, with meta-judge gpt-5.5, E3 attains the highest recall on every aggregate metric. Partial-inclusive recall reaches 90.2 percent, which is 15.5 points over GPT, 17.1 points over Claude, and 29.2 points over the human reviews, and strict recall preserves the ordering at 65.8 percent. On concerns raised by the human reviewers, E3 recovers 89.6 percent; on concerns the human reviewers missed it surfaces 1635 additional rows admitted into the judged union, 406 above the next-best source. Corpus, baseline prompts, judge prompt template, and evaluation code are released.
38. 【2605.27068】QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
链接:https://arxiv.org/abs/2605.27068
作者:Ye Yuan,Rui Song,Weien Li,Zeyu Li,Haochen Liu,Xiangyu Kong,Changjiang Han,Yonghan Yang,Zichen Zhao,Zixuan Dong,Fuyuan Lyu,Bowei He,Haolun Wu,Jikun Kang,Xue Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:Large Language Model, modeling in Large, Social deduction games, Language Model, Large Language
备注:
点击查看摘要
Abstract:Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at this https URL.
39. 【2605.27066】Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search
链接:https://arxiv.org/abs/2605.27066
作者:Mingyue Wang,Xingyu Xie,Hang Yang,Li Gao,Lixin Su,Ge Chen,Dawei Yin,Daiting Shi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:engines handling queries, search engines handling, Baidu Search, engines handling, handling queries
备注: Accepted at KDD 2026
点击查看摘要
Abstract:Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.
40. 【2605.27062】FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions
链接:https://arxiv.org/abs/2605.27062
作者:Francisco Teixeira,Carlos Carvalho,Mariana Julião,Catarina Botelho,Rubén Solera-Ureña,Sérgio Paulo,Thomas Rolland,Ben Peters,Isabel Trancoso,Alberto Abad
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Automatic Speech Recognition, large-scale labeled corpora, Speech Recognition, Automatic Speech, largely depends
备注: Published in LREC2026
点击查看摘要
Abstract:State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.
41. 【2605.27050】BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation
链接:https://arxiv.org/abs/2605.27050
作者:Param Thakkar,Anushka Yadav,Michael Tiemann,Abhi Mehta,Akshita Bhasin,Shrinivas Khedkar
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:addressing persistent data, persistent data limitations, linguistically enriched English, enriched English, neural machine translation
备注:
点击查看摘要
Abstract:We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.
42. 【2605.27045】ExTax: Explainable Disinformation Detection via Persuasion, Emotion, and Narrative Role Taxonomies
链接:https://arxiv.org/abs/2605.27045
作者:Shang Luo,Yingguang Yang,Zhenchen Sun,Yang Liu,Bin Chong,Jingru Chen,Yancheng Chen,Jiayu Liang,Kefu Xu,Hao Peng,Philip S. Yu
类目:Computation and Language (cs.CL)
关键词:making traditional syntax-semantic, verification increasingly insufficient, traditional syntax-semantic verification, syntax-semantic verification increasingly, highly fluent disinformation
备注:
点击查看摘要
Abstract:The democratization of LLMs has accelerated the generation and circulation of highly fluent disinformation, making traditional syntax-semantic verification increasingly insufficient. Such deception rarely relies solely on surface-level falsity; instead, it often combines persuasive rhetoric, emotional manipulation, and narrative role construction to influence readers' interpretations through multiple cognitive pathways. However, existing detectors typically emphasize isolated signals -- such as syntax, external knowledge, persuasion, or affective cues -- and therefore struggle to capture the multi-faceted manipulative intents underlying disinformation or provide human-auditable explanations. To address this gap, we present \textbf{ExTax}, a taxonomy-aligned framework for explainable disinformation detection. ExTax unifies persuasive rhetoric, emotional manipulation, and narrative roles into a 17-dimensional taxonomic space, covering 6 persuasive-rhetoric strategies, 5 emotional-manipulation methods, and 6 narrative-role categories. It elicits attributes from multiple frontier LLMs, reconciles their disagreements through Entropy-driven Dynamic Label Smoothing, and fuses the resulting taxonomic representations with contextual encodings via Heterogeneous Multi-Head Attention, grounding each prediction in an interpretable manipulation profile. Across five cross-domain and cross-genre benchmarks, ExTax achieves an overall Macro $F_1$ of $0.8456$, outperforming state-of-the-art deep learning and LLM-based baselines. It also remains robust under severe genre imbalance, where the strongest deep baseline degrades from $0.9454$ to $0.6194$.
43. 【2605.27033】racing Computation Density in LLMs
链接:https://arxiv.org/abs/2605.27033
作者:Corentin Kervadec,Iuliia Lysova,Iuri Macocco,Marco Baroni,Gemma Boleda
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Transformer-based large language, wide computational graphs, Transformer-based large, large language models, computational graphs
备注:
点击查看摘要
Abstract:Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.
44. 【2605.27030】Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling
链接:https://arxiv.org/abs/2605.27030
作者:Xinglin Wang,Hao Lin,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Jiayi Shi,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
类目:Computation and Language (cs.CL)
关键词:Test-Time Scaling, allocating additional inference, additional inference compute, large language models, enhances the reasoning
备注: Preprint
点击查看摘要
Abstract:Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during search: intermediate discoveries remain branch-private and cannot guide other branches in time. This information isolation causes substantial redundant exploration, as branches repeatedly rediscover information already found elsewhere and require more search steps to collect complete decision information needed to reach correct answers. To bridge this gap, we propose \textbf{Collaborative Parallel Thinking (CPT)}, a training-free inference framework that enables search-time information sharing across parallel branches. CPT extracts compact intermediate information from ongoing branches, maintains a deduplicated query-level information pool, and broadcasts pool entries through the input context, allowing each branch in subsequent search steps to reuse discoveries made by other branches rather than rediscover the same information. Empirically, experiments on HMMT and AIME benchmarks show that CPT establishes a stronger accuracy--latency Pareto frontier than strong baselines across rollout budgets and model scales, highlighting search-time collaboration as an effective direction for efficient parallel TTS.
45. 【2605.27025】Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations
链接:https://arxiv.org/abs/2605.27025
作者:Mohammad Amine Jradi,Faeze Ghorbanpour,Alexander Fraser
类目:Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:dataset construction challenging, making large-scale dataset, large-scale dataset construction, Hate speech, annotator disagreement
备注:
点击查看摘要
Abstract:Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.
46. 【2605.27016】Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
链接:https://arxiv.org/abs/2605.27016
作者:Yedidia Agnimo,Anna Korba,Annabelle Blangero,Nicolas Chesneau,Karteek Alahari
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:Large language models, hindering reliable deployment, Large language, hindering reliable, reliable deployment
备注: 35 pages, 7 figures, 9 tables
点击查看摘要
Abstract:Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.
47. 【2605.27015】PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions
链接:https://arxiv.org/abs/2605.27015
作者:Ruhallah Niazi,Faeze Ghorbanpour,Alexander Fraser
类目:Computation and Language (cs.CL)
关键词:impressive multilingual capabilities, remain poorly evaluated, large language models, multilingual capabilities, remain poorly
备注:
点击查看摘要
Abstract:Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice questions across eight fine-grained categories spanning spelling, literary devices, grammar, vocabulary, word formation, and conceptual understanding, sourced from materials for the Konkur university entrance examination. We evaluate six LLMs across ten prompting strategies, revealing striking category-level disparities across three tiers of task difficulty: models reach higher accuracy on conceptual similarity tasks but struggle with formal linguistic analysis, with spelling and word formation proving the hardest across all models. Prompting strategy has a significant impact on performance, with explained few-shot examples yielding the best results, particularly on formal linguistic categories. An error analysis identifies three failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting/enumeration errors, suggesting that different categories require different improvement strategies.
48. 【2605.27000】Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning
链接:https://arxiv.org/abs/2605.27000
作者:Yilong Li,Suman Banerjee,Tong Che
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:allocate test-time compute, canonical metric, Repeated sampling, allocate test-time, test-time compute
备注: Code reasoning; pass@K optimization; coordinated planning; verifiable rewards; strategy diversity
点击查看摘要
Abstract:Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@$K$ requires only one correct attempt. We propose Coordinated Pass@$K$ Policy Optimization (CPPO), which turns pass@$K$ generation into joint exploration over strategies: a planner emits a tuple of $K{=}4$ alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, $R_{\mathrm{plan}} = J_\psi \cdot R_{\mathrm{out}}$, assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@$K$ success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@$4$ over direct sampling, planning baselines, planner-only SFT, and pass@$K$-oriented RL under the same $K{=}4$ solver-attempt budget, with statistically significant gains on six of nine model--benchmark cells. The largest single gain is $+0.16$ on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO ($0.588 \rightarrow 0.748$; paired bootstrap, $p 0.05$).
49. 【2605.26999】Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals
链接:https://arxiv.org/abs/2605.26999
作者:Akindoyin Akinrele,Shreyank N Gowda
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:reflect real-world operating, existing detection approaches, Prompt injection poses, real-world operating constraints, large language models
备注:
点击查看摘要
Abstract:Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.
50. 【2605.26978】PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech
链接:https://arxiv.org/abs/2605.26978
作者:Hanif Rahman
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:single ASR round-trip, ASR round-trip word, round-trip word error, word error rate, evaluation for low-resource
备注:
点击查看摘要
Abstract:Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.
51. 【2605.26969】Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling
链接:https://arxiv.org/abs/2605.26969
作者:Alan Zhu,Mihran Miroyan,Carolyn Wang,Andrew Zhou,Lisa Dunlap,Narges Norouzi,Joseph E. Gonzalez
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:past context-action pairs, conversation turns, human-AI collaboration, context-action pairs, enabling the simulation
备注:
点击查看摘要
Abstract:User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.
52. 【2605.26959】MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving
链接:https://arxiv.org/abs/2605.26959
作者:Jinzheng Li,Zeru Zhu,Yuanjie Ren
类目:Logic in Computer Science (cs.LO); Computation and Language (cs.CL)
关键词:prover that replaces, replaces sorry declarations, declarations with kernel-checkable, kernel-checkable proofs, theorem prover
备注:
点击查看摘要
Abstract:MerLean-Prover is an end-to-end Lean4 theorem prover that replaces sorry declarations with kernel-checkable proofs. It is built from three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself, and uses no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. On FormalQualBench, a benchmark of 23 PhD-qualifying-exam theorems, MerLean-Prover solves 10/23, surpassing the strongest published open-source baseline (OpenGauss, 8/23). On Putnam2025, the same harness closes 12/12 with substantially lower total wall-clock than the next-best system that closes the full set. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems, and Haiku closes the two short ones. These results suggest that harness design is a central factor in end-to-end Lean4 theorem proving, alongside raw model capability, and that a relatively simple harness can already be effective.
53. 【2605.26958】ournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation
链接:https://arxiv.org/abs/2605.26958
作者:Zixuan Yang,Yiqun Chen,Wei Yang,Erhan Zhang,Zihan Shen,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Jiaxin Mao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reliable reference answers, challenging because reliable, reliable reference, reference answers, answers and automatic
备注:
点击查看摘要
Abstract:Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.
54. 【2605.26956】LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation
链接:https://arxiv.org/abs/2605.26956
作者:Samy Haffoudhi(IP Paris, LTCI, DIG),Nikola Dobričić(IP Paris),Fabian Suchanek(IP Paris, LTCI),Nils Holzenberger
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:real world application, specific target knowledge, target knowledge bases, downstream NLP systems, downstream NLP
备注:
点击查看摘要
Abstract:Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge bases and domains, limiting their real world application. In this paper, we extend LELA, a modular and domain-agnostic LLM-based entity disambiguation method, into a practical Python library that integrates zero-shot Named Entity Recognition (NER) -thereby providing a complete end-toend pipeline for entity-linking in real-world usage. We provide experimental results validating LELA's performance and robustness across diverse entity linking settings. In our demo, users can play with the system on their own input texts.
55. 【2605.26955】JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
链接:https://arxiv.org/abs/2605.26955
作者:Jiho Jin,Junho Myung,Juhyun Oh,Junyeong Park,Rifki Afina Putri,Sunipa Dev,Vinodkumar Prabhakaran,Alice Oh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:brainstorming creative ideas, drafting personal communications, diverse cultural contexts, large language models, creative ideas
备注:
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries' main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.
56. 【2605.26954】AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian
链接:https://arxiv.org/abs/2605.26954
作者:Wajdi Zaghouani,Kholoud K. Aldous,Isra Fejzullaj
类目:Computation and Language (cs.CL)
关键词:Large Language Models, languages critically underserved, Language Models, leaving low-resource languages, Large Language
备注: Accepted at SIGUL2026 Workshop co-located with LREC2026
点击查看摘要
Abstract:Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation dataset for LLMs in Albanian, a linguistically distinct low-resource language with approximately 7.5 million speakers across Albania, Kosovo, North Macedonia, and the diaspora. The dataset contains 2,951 prompts spanning 11 safety categories, including self-harm, violence, racist content, child exploitation, and radicalization, with an average of 268 prompts per category. Each prompt is provided in Albanian with an English reference translation and a detailed category label. This resource addresses a significant gap in safety evaluation infrastruc-ture for low-resource languages and provides an essential benchmark for developing safer, more inclusive LLMs. The dataset will be provided upon request to support safety evaluation, fine-tuning, red-teaming, and guardrail development for Albanian-speaking communities.
57. 【2605.26952】Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement
链接:https://arxiv.org/abs/2605.26952
作者:Dingwei Chen,Zefang Zong,Zhipeng Ma,Leo Luo,Yang Li,Chengming Li,Peng Chen,Jie Jiang
类目:Computation and Language (cs.CL)
关键词:Agentic reinforcement learning, training LLM-based agents, external tool-use capabilities, knowledge boundary, intrinsic knowledge boundary
备注:
点击查看摘要
Abstract:Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model's intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at this https URL.
58. 【2605.26947】KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models
链接:https://arxiv.org/abs/2605.26947
作者:Wajdi Zaghouani,Shimaa Amer Ibrahim,Aruzhan Muratbek,Olzhasbek Zhakenov,Adiya Akhmetzhanova
类目:Computation and Language (cs.CL)
关键词:large language models, language models, underrepresented in resources, resources for evaluating, behavior of large
备注: Accepted at the SIGUL2026 Workshop co-located with LREC2026
点击查看摘要
Abstract:Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas such as self-harm, violence, child exploitation, sexual content, racist content, radicalization, and regulated goods or illegal activities. The dataset contains 5,717 prompts written natively in Kazakh (Cyrillic), organized by category, with English translations for cross-lingual analysis. Prompts resemble realistic user queries, often in a teen or child style, and are phrased as intent prompts without procedural instructions. We document the writing protocol, labeling procedures (including borderline-case decision rules), and quality-control steps (schema standardization, completeness checks, and deduplication). We also align the categories with widely used safety taxonomies to support integration with existing evaluation pipelines. Baseline results with GPT-4o show an overall refusal rate of 28.2%, varying from 5.5% to 53.8% across categories, indicating that Kazakh prompts expose category-specific safety gaps not captured by English-only evaluation.
59. 【2605.26940】Accountable Human-AI Deliberation with LLMs: Scaling Collective Intelligence through Symbiotic Scaffolding
链接:https://arxiv.org/abs/2605.26940
作者:Wajdi Zaghouani
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, support democratic deliberation, scales previously constrained, language models
备注: Accepted at the LREC 2026 / 2nd Workshop on Language-driven Deliberation Technology
点击查看摘要
Abstract:Large language models (LLMs) can support democratic deliberation at scales previously constrained by turn-taking and facilitation bandwidth. Recent work shows that LLM-generated group statements are often preferred over human-mediated outputs, while theoretical analyses argue that LLMs relax the simultaneity constraints limiting collective intelligence. Yet pure LLM mediation risks collapsing pluralism, over-optimizing for agreement, and undermining legitimacy when participants cannot contest how they are represented. We propose a symbiotic human-AI framework organized into three layers: observation and diversity amplification, facilitation with clause-level provenance, and human primacy for ratification. Our contributions include graded coverage, diversity, and erasure metrics with salience-aware weighting; a provenance pipeline combining cross-encoder similarity with causal knockout diagnostics; preference-conditioned trade-off control; equity-aware contestability workflows; adversarial robustness tests; and an evaluation protocol with ablation designs informed by evidence of LLM-as-judge limitations. The result is a testable blueprint for deliberation technology that scales collective intelligence while preserving agency and legitimacy.
60. 【2605.26937】Beyond Questions: Evaluating What Large Language Models (Actually) Know
链接:https://arxiv.org/abs/2605.26937
作者:Luca Giordano,Simon Razniewski
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:remains poorly understood, Parametric knowledge, poorly understood, remains poorly, Parametric
备注:
点击查看摘要
Abstract:Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.26937 [cs.CL]
(or
arXiv:2605.26937v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.26937
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
61. 【2605.26935】DunbaaBERT: From Sacrifice to Semantics
链接:https://arxiv.org/abs/2605.26935
作者:Iffat Maab,Waleed Jamil,Raphael Schmitt
类目:Computation and Language (cs.CL)
关键词:fragmented evaluation settings, Large language models, Large language, comparatively underexplored due, Urdu NLP benchmarks
备注:
点击查看摘要
Abstract:Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.
62. 【2605.26934】Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks
链接:https://arxiv.org/abs/2605.26934
作者:Yihua Zhu,Qianying Liu,Fei Cheng,Jiaxin Wang,Akiko Aizawa,Sadao Kurohashi,Hidetoshi Shimodaira
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reinforcement learning, verifiable rewards, learning with verifiable, key limitation, limitation of existing
备注: Pre-print
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.
63. 【2605.26924】Learning to Adapt SFT Data for Better Reasoning Generalization
链接:https://arxiv.org/abs/2605.26924
作者:Lisong Sun,Li Wang,Chen Zhang,Jinyang Wu,Kui Zhang,Tianhao Peng,Wenjun Wu
类目:Computation and Language (cs.CL)
关键词:Large language models, achieved remarkable progress, Large language, remarkable progress, achieved remarkable
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at this https URL.
64. 【2605.26918】Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation
链接:https://arxiv.org/abs/2605.26918
作者:Unggi Lee,Hoyoung Ahn,Yoon Choi,Seonmin Eun,Jahyun Jeong,Seonmin Jin,Harmony Jung,Hye Jin Kim,Chaerin Lee,Hyunji Lee,Jeongjin Lee,Soohwan Lee,Young-Seok Oh,Jaehyeon Park,Sun-ok Ryu,Sunyoung Shin,Yoorim Son,Haeun Park,Yeil Jeong
类目:Computation and Language (cs.CL)
关键词:Video generation models, existing benchmarks evaluate, rapidly entering classrooms, intrinsic faithfulness, generation models
备注:
点击查看摘要
Abstract:Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.
65. 【2605.26893】GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought
链接:https://arxiv.org/abs/2605.26893
作者:Weijiang Lv,Wentong Zhao,Jiayu Wang,Yuhao Wu,Jiaheng Wei,Xiaobo Xia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, pervasive post-hoc rationalization, advanced large language, outcome-based supervision leads, language models
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.
66. 【2605.26891】nor Nordics Customer Service self-help corpus
链接:https://arxiv.org/abs/2605.26891
作者:Mike Riess
类目:Computation and Language (cs.CL)
关键词:manually validated documents, self-help corpus comprising, multilingual customer service, manually validated, million tokens
备注: 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: [this https URL](https://zenodo.org/records/19493152)
点击查看摘要
Abstract:This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at this https URL, intended to support reproducible research in Nordic NLP and information retrieval.
67. 【2605.26872】he Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection
链接:https://arxiv.org/abs/2605.26872
作者:Zhengyu Hu,Zheyuan Xiao,Linxin Song,Fengqing Jiang,Yutai Li,Zhengyu Chen,Zhihan Xiong,Yue Liu,Junhao Lin,Yao Su,Lijie Hu,Kaize Ding,Xiao Teng,Radha Poovendran
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:LLM training increasingly, training increasingly relies, LLM training, tool-use demonstrations, increasingly relies
备注:
点击查看摘要
Abstract:LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 8 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.
68. 【2605.26849】Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning
链接:https://arxiv.org/abs/2605.26849
作者:Manh Nguyen,Sunil Gupta,Hung Le
类目:Computation and Language (cs.CL)
关键词:multiple responses improves, responses improves language, uniform compute allocation, Sampling multiple responses, questions remain under-explored
备注:
点击查看摘要
Abstract:Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at this https URL.
69. 【2605.26842】MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
链接:https://arxiv.org/abs/2605.26842
作者:Jiacheng Li,Jianchao Tan,Hongtao Xu,Jiaqi Zhang,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:produce geometry-aware updates, leveraging matrix orthogonalization, language model training, leveraging matrix, geometry-aware updates
备注:
点击查看摘要
Abstract:The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.
70. 【2605.26840】Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics
链接:https://arxiv.org/abs/2605.26840
作者:Yuxuan Ye,Raul Santos-Rodriguez,Edwin Simpson
类目:Computation and Language (cs.CL)
关键词:enhance specific capabilities, Reinforcement learning, learning with evaluation, enhance specific, specific capabilities
备注: EMNLP 2025 Findings
点击查看摘要
Abstract:Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped, limiting their effectiveness as signals for shaping model this http URL individual factuality metrics are unreliable, their combination can more effectively capture diverse factual errors. We leverage this insight to introduce an automated training pipeline that improves factual consistency in summaries by aggregating scores from different weak metrics. Our approach avoids the need for complex reward shaping by mapping scores to preferences and filtering out cases with high disagreement between metrics. For each source document, we generate lexically similar summary pairs by varying decoding strategies, enabling the model to learn from factual differences caused by subtle lexical differences. This approach constructs a high-quality preference dataset using only source this http URL demonstrate consistent factuality gains across models, ranging from early encoder-decoder architectures to modern large language models, with smaller models reaching comparable factuality to larger ones.
71. 【2605.26827】ContextGuard: Structured Self-Auditing for Context Learning in Language Models
链接:https://arxiv.org/abs/2605.26827
作者:Hongbo Jin,Chi Wang,Haoran Tang,Zhongjing Du,Xu Jiang,Jingqi Tian,Qiaoman Zhang,Jiayu Ding
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent benchmarks reveal, complex contextual knowledge, faithfully apply complex, apply complex contextual, large language models
备注:
点击查看摘要
Abstract:Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply complex contextual knowledge. These failures are often not wholesale reasoning collapses: in context-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format-sensitive requirements.
72. 【2605.26823】Generating Logically Consistent Synthetic Supply Chain Data with LLM-Driven Knowledge Graph Reasoning
链接:https://arxiv.org/abs/2605.26823
作者:Yunbo Long,Ge Zheng,Liming Xu,Alexandra Brintrup
类目:Computation and Language (cs.CL)
关键词:supply chain analytics, supply chain, synthetic supply chain, Synthetic data offers, offers a promising
备注:
点击查看摘要
Abstract:Synthetic data offers a promising solution to two persistent barriers in supply chain analytics: data scarcity and data privacy. However, for synthetic data to support operational simulation and decision-making, it must do more than reproduce the statistical distributions of real records, and also preserve the \emph{operational logic} that governs supply chain processes, including the temporal orderings, mathematical dependencies, hierarchical taxonomies, and conditional rules that make a record operationally plausible. We consider this logic as the ``physics'' of supply chain data. Existing tabular generative models are primarily optimized for distributional fidelity and downstream predictive utility, and therefore often generate records that appear statistically realistic but violate fundamental operational constraints. This paper introduces \textbf{\textit{TabKG}}, a knowledge-graph-guided framework for logically consistent synthetic supply chain tabular data generation. TabKG constructs a \textbf{\textit{Column Relationship Knowledge Graph (CR-KG)}} to represent data operational dependencies. It uses a multi-LLM ensemble with majority voting to propose candidate relationships from column metadata, validates these relationships against real data to remove hallucinated or unsupported edges, and then uses the validated CR-KG to guide generation. Specifically, TabKG compresses the original table into independent columns, generates these columns using a latent diffusion model, and deterministically reconstructs dependent columns according to the validated relationships, enforcing logical consistency by construction with respect to the discovered operational rules.
73. 【2605.26801】Psychological Constructs in Shared Semantic Space
链接:https://arxiv.org/abs/2605.26801
作者:Hubert Plisiecki
类目:Computation and Language (cs.CL)
关键词:direct comparison difficult, makes direct comparison, separate instruments, research traditions, measured in separate
备注:
点击查看摘要
Abstract:Psychological constructs are often measured in separate instruments, datasets, and research traditions, which makes direct comparison difficult. This paper proposes a framework for making such constructs semantically commensurate by representing and comparing them as directions in a shared word-embedding space. Using Supervised Semantic Differential, we estimate construct-specific semantic gradients from text-outcome associations and project them onto theoretically motivated reference axes. As an initial test case, we use Valence, Arousal, and Dominance (VAD) as an affective coordinate system. First, we recover interpretable VAD directions from English word-level affective norms. Second, we project semantic gradients for 27 GoEmotions categories into this space and recover the expected organization of emotions, especially along valence and arousal. Third, we apply the same procedure to Big Five personality domains and facets derived from IPIP-NEO-300 item-factor associations. Domain-level placements are broadly coherent, while facet-level results are more exploratory because they rely on sparse questionnaire text. The results suggest that embedding spaces can support construct-level comparison across otherwise incommensurable psychological measurements, provided that semantic placements are assessed for stability and interpretability.
74. 【2605.26797】Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior
链接:https://arxiv.org/abs/2605.26797
作者:Zeyi Huang,Xuehai He,LiLiang Ren,Yiping Wang,Baolin Peng,Hao Cheng,Shuohang Wang,Pengcheng He,Jianfeng Gao,Yong Jae Lee,Yelong Shen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:high-level source-layer hidden, study Latent Recurrent, source-layer hidden state, Latent Recurrent Transformer, cross-layer recurrent latent
备注:
点击查看摘要
Abstract:We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this recurrence at scale without sequentially unrolling the transformer, we introduce interleaved parallel training: a single full-sequence initialization forward pass builds a shared buffer; then disjoint position subsets are refined in parallel and written back, so that all tokens receive recurrent-memory-aware supervision at roughly 2 times baseline compute. Across nanochat style backbones and a wide range of tokens-per-parameter budgets, LRT improves both language-modeling loss and in-context learning under matched effective compute while adding as little as 0.3% parameters.
75. 【2605.26788】SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability
链接:https://arxiv.org/abs/2605.26788
作者:Ramakrishna Vamsi Setti,Jagadeesh Rachapudi,Sachin Chaudhary,Praful Hambarde,Amit Shukla
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, achieve impressive performance, achieve impressive, revealed incrementally
备注:
点击查看摘要
Abstract:Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.
76. 【2605.26785】EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation
链接:https://arxiv.org/abs/2605.26785
作者:Yunbo Long,Haolang Zhao,Lukas Beckenbauer,Liming Xu,Alexandra Brintrup
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Post-trained LLMs, making them safe, human preferences, optimized to align, align responses
备注:
点击查看摘要
Abstract:Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.
77. 【2605.26770】Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids
链接:https://arxiv.org/abs/2605.26770
作者:Fabian Lukassen,Jan Herrmann,Christoph Weisser,Alexander Silbersdorff,Benjamin Saefken,Thomas Kneib
类目:Computation and Language (cs.CL)
关键词:Natural Language Explanations, Large Language Models, Large Language, Natural Language, outputs into Natural
备注:
点击查看摘要
Abstract:Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge's ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.
78. 【2605.26755】From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification
链接:https://arxiv.org/abs/2605.26755
作者:Babu Kumar,Gaurav Kumar,Ayush Garg,Aditya Kishore,Jasabanta Patro
类目:Computation and Language (cs.CL)
关键词:fact verification requires, verification requires evidence, relevant and sufficiently, sufficiently complete, reliable factuality prediction
备注:
点击查看摘要
Abstract:Multilingual fact verification requires evidence that is both relevant and sufficiently complete for reliable factuality prediction. However, existing systems often rely on search snippets, sentence-level evidence, or locally segmented passages, which can miss decisive context and produce fragmented evidence. To overcome these limitations, we propose SEEK, a Semantic Evidence Extraction with an adaptive chunKing framework that constructs coherent evidence chunks from full fact-checking articles by identifying semantic topic transitions and preserving local verification context. The constructed chunks are encoded using a multilingual encoder and then multilingual LLMs are finetuned using LoRA adapter for veracity prediction. Experiments on X-FACT and RU22Fact show that SEEK improves macro-f1 by up to 10% over semantic chunking, 19% over sentence chunking, and 20% over search-snippet baselines. Evidence completeness and significance analyses further show that SEEK preserves richer verification context and enables more reliable multilingual fact-checking.
79. 【2605.26738】KARMA: Karma-Aligned Reward Model Adaptation
链接:https://arxiv.org/abs/2605.26738
作者:Jared Scott,Jesse Roberts
类目:Computation and Language (cs.CL)
关键词:Human communication depends, Human communication, Reward Model Adaptation, Reward Model, shaped by tone
备注:
点击查看摘要
Abstract:Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.
80. 【2605.26735】Rethinking the Multilingual Reasoning Gap with Layer Swap
链接:https://arxiv.org/abs/2605.26735
作者:Maxence Lasbordes,Amélie Chatelain,Djamé Seddah
类目:Computation and Language (cs.CL)
关键词:Recent reasoning Large, Large Language Models, reasoning Large Language, Language Models produce, Recent reasoning
备注:
点击查看摘要
Abstract:Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emph{native reasoning}) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emph{English-pivoted reasoning}). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \texttt{Qwen/Qwen3-8B-Base}, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9--3.5\% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist's stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.
81. 【2605.26731】It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
链接:https://arxiv.org/abs/2605.26731
作者:Yong-eun Cho
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:universally improve reliability, Toggle, optimal harness complexity, agent deployment holds, LLM agent deployment
备注: 9 pages, 3 figures
点击查看摘要
Abstract:A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.
Comments:
9 pages, 3 figures
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2605.26731 [cs.AI]
(or
arXiv:2605.26731v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2605.26731
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yong Eun Cho [view email] [v1]
Tue, 26 May 2026 09:08:41 UTC (55 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled It’s Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers, by Yong-eun ChoView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.AI
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
cs.CL
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
82. 【2605.26730】PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers
链接:https://arxiv.org/abs/2605.26730
作者:Ngoc Phan Phuoc Loc,Toan Huynh La Viet,Thanh Tran Khanh,Duy A Nguyen,Tuan Anh Nguyen Pham,Thanh Nguyen,Nitesh V. Chawla,Wray Buntine,Kok-Seng Wong,Khoa D. Doan,Binh T. Nguyen
类目:Computation and Language (cs.CL)
关键词:machine learning venues, Identification Major Issues, LLM-based automated peer, Major Issues Prioritization, scientific peer-review system
备注:
点击查看摘要
Abstract:The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at this https URL.
83. 【2605.26711】he Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models
链接:https://arxiv.org/abs/2605.26711
作者:Francesco Corielli
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:random regime governed, binary mixed-regime process, deterministic textual regime, unobserved latent state, mixed-regime process
备注:
点击查看摘要
Abstract:We construct a binary mixed-regime process with one deterministic textual regime and one random regime governed by an unobserved latent state. Even an ideal infinite-capacity sequence predictor that exactly recovers the text-only marginal law can become overconfident when the observed prefix is compatible with the wrong latent regime. The resulting entropy difference is not an ordinary optimization error; it is a sufficiency gap caused by marginalization over an unobserved state. We then formalize retrieval, tool use, and external grounding through an auxiliary binary signal with fidelity $\gamma \in [1/2,1]$. The resulting Bayesian update yields a contextual dominance threshold: a corrective signal reverses the posterior odds induced by the textual history exactly when its fidelity exceeds the text-only posterior weight assigned to the misleading regime. This threshold reduces, but does not generally eliminate, the sufficiency gap; complete closure requires perfect revelation of the relevant latent state or an equivalent verification mechanism. The analysis clarifies why temperature scaling cannot restore missing context, why grounding mechanisms must be both informative and learnably usable by the model, and why autonomous sequence models require structurally decoupled observers or verifiers in high-stakes domains.
84. 【2605.26689】PinPoint: Prompting with Informative Interior Points
链接:https://arxiv.org/abs/2605.26689
作者:Pouya Sadeghi,Shawn He,Pedro Pablo Guerrero Vela,C. Thomas,Alex Wong,Sirisha Rambhatla
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Modern referring image, referring image segmentation, image segmentation pipelines, segmentation pipelines couple, Modern referring
备注:
点击查看摘要
Abstract:Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.
85. 【2605.26683】An In-Vitro Study on Cross-Lingual Generalization in Language Models
链接:https://arxiv.org/abs/2605.26683
作者:Adrian Cosma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:data imbalance, models is difficult, difficult to study, study in natural, natural corpora
备注: 16 Figures, 1 Table
点击查看摘要
Abstract:Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.
86. 【2605.26678】NestedKV: Nested Memory Routing for Long-Context KV Cache Compression
链接:https://arxiv.org/abs/2605.26678
作者:Hong Chen,Xiang Liu,Yubo Gao,Yuxuan Fan,Bo Wang,Yuanlin Chu,Yuanguo Lin,Xuming Hu
类目:Computation and Language (cs.CL)
关键词:Long-context language models, Long-context language, Continuum Memory System, memory footprint, Long-context
备注:
点击查看摘要
Abstract:Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, or key distinctiveness -- which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant. We introduce NestedKV, a key-only KV cache compression method inspired by the Continuum Memory System in Nested Learning. NestedKV maintains global, block-level, and sliding-window key anchors, scores tokens by multi-time-scale cosine anomaly, and combines the resulting rankings with a training-free outer learner using head-adaptive mixing and surprise-gated token routing. The score is paired with adaptive per-head budgets and requires no training or LLM modification. Across RULER (4k--32k), LooGLE, LongBench, LongBench-E, InfiniteBench, and MMLU-Pro on Qwen3 and Llama-3.2 models, NestedKV is strongest when the retained cache is small. On Qwen3-4B, it improves over KeyDiff by up to 19.10 points on RULER and 19.29 on LongBench at $r=0.75$; at $r=0.95$, it retains 37.32 on LongBench versus 17.55 for KeyDiff.
87. 【2605.26670】he Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models
链接:https://arxiv.org/abs/2605.26670
作者:Zheng Wang,Kaixuan Zhang,Wanfang Chen,Jingwen Zhang,Xiaonan Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:necessity remains unclear, large language models, targeted factual updates, remains unclear, large language
备注: Accepted for publication at ICML 2026
点击查看摘要
Abstract:Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between one-time and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null-space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne's thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates. Our code is available at this https URL.
88. 【2605.26663】Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification
链接:https://arxiv.org/abs/2605.26663
作者:Jingxi Qiu,Zeyu Han,Cheng Huang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE)
关键词:fact verification benchmarks, observationally similar, benchmarks can make, make them observationally, NEI
备注: Preprint. Under review. 20 pages, 2 figures
点击查看摘要
Abstract:Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar. The Not Enough Information (NEI) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns and what its score can hide. We introduce NEI-CAP, a construction-aware diagnostic protocol for insufficient-evidence evaluation. Each NEI example carries the construction family that produced it; NEI-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions. We instantiate the protocol in SciFact-style scientific verification, with FEVER and HoVer as bounded external controls. Across these settings, NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.
89. 【2605.26662】AI evaluation may bias perceptions: The importance of context in interpreting academic writing
链接:https://arxiv.org/abs/2605.26662
作者:Shang Wu,Randol Yao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
关键词:ignore contextual differences, methods ignore contextual, paper examines, examines how estimates, scientific writing
备注:
点击查看摘要
Abstract:This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science.
90. 【2605.26655】Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis
链接:https://arxiv.org/abs/2605.26655
作者:Shuzhi Gong,Hechuan Wen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
关键词:large language model, Automated prompt optimization, tasks remains underperformed, Automated prompt, LLM backbones
备注: 17 pages, 4 figures, 8 tables
点击查看摘要
Abstract:Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.
91. 【2605.26646】UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
链接:https://arxiv.org/abs/2605.26646
作者:Yiqun Chen,Wei Yang,Erhan Zhang,Shijie Wang,Qi Liu,Zechun Niu,Bin Zhang,Haitao Li,Rui Li,Lingyong Yan,Jinyuan Feng,Biqing Qi,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Jiaxin Mao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:reinforcement learning interface, unified reinforcement learning, systems decompose complex, LLM-based multi-agent systems, decompose complex tasks
备注:
点击查看摘要
Abstract:LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Cite as:
arXiv:2605.26646 [cs.AI]
(or
arXiv:2605.26646v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2605.26646
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yiqun Chen [view email] [v1]
Tue, 26 May 2026 07:30:03 UTC (1,181 KB)
92. 【2605.26645】Bounded Path Context: A Controlled Study of Visible Path History in LLM-Based Knowledge Graph Question Answering
链接:https://arxiv.org/abs/2605.26645
作者:Xihang Shan,Ye Luo
类目:Computation and Language (cs.CL)
关键词:knowledge-graph question answering, local relation-selection decisions, relation-selection decisions repeated, delegates graph traversal, language models
备注: 13 pages, 1 figure, submitted to EMNLP 2026
点击查看摘要
Abstract:LLM-based knowledge-graph question answering (KGQA) delegates graph traversal to language models, turning each question into a sequence of local relation-selection decisions repeated across beams and hops. A common but untested default is to serialize the complete partial path into every routing prompt, even though the controller already maintains this path as exact symbolic state. Bounded Path Context (BPC) decouples these two roles: the controller retains full paths in symbolic memory for answer extraction and audit, while the relation-selection prompt exposes only the question, the current entity, outgoing relation candidates, and at most the last K hops. A controlled sweep over K -- fixing graph neighborhoods, beam budget, depth, decoding, and answer-extraction format -- shows that bounded histories match or exceed full-history prompting on complete WebQSP and CWQ test sets with Qwen3.5-9B-AWQ: K=1 achieves 0.487 answer-set F1 on WebQSP versus 0.472 for full history, and K=0 reaches 0.287 on CWQ versus 0.274, with 9.7% and 12.1% fewer input tokens respectively. At the 4B scale, K=1 remains the strongest setting on both benchmarks. Per-example analysis reveals that 71-84% of examples are unaffected by history length, while the affected cases expose when prior hops disambiguate versus distract. These results suggest that path serialization length is better treated as a tunable interface variable than as a default assumption in LLM-based graph controllers.
93. 【2605.26620】Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering
链接:https://arxiv.org/abs/2605.26620
作者:Lukas Ellinger,Alexander Fichtl,Miriam Anschütz,Georg Groh
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Natural language conveys, language conveys information, Natural language, broad descriptions, language conveys
备注:
点击查看摘要
Abstract:Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.
94. 【2605.26612】LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation
链接:https://arxiv.org/abs/2605.26612
作者:Jinze Li,Xiaoyan Yang,Shuo Yang,Jinfeng Xu,Yue Shen,Jian Wang,Jinjie Gu,Edith Cheuk-Han Ngai
类目:Computation and Language (cs.CL)
关键词:large language models, language models requires, Personalized generation, frozen large language, compact and current
备注: Under review
点击查看摘要
Abstract:Personalized generation with frozen large language models requires a conditioning signal that is both compact and current. Existing personalization methods typically retrieve or summarize user histories in text, or compress them into static latent profiles and soft prompts. These approaches are efficient, but they treat a user's past behavior as an aggregate profile and therefore mix stable identity, recent drift, and item content in the same representation. We propose LAtent Trajectory Tracking and Extrapolation (LATTE), a framework that represents personalization as forecasting a peer anchored relative preference state. For each historical session, LATTE subtracts a time masked baseline formed from comparable users who responded to the same item, producing a state that measures how the target user differs from peers under a shared item context. A lightweight sequence predictor then forecasts the next state in this trajectory, and a State to Token Bridge injects the forecast into a frozen instruction tuned LLM through a single anchored soft token. We provide a latent factor analysis showing when peer anchoring cancels shared item variation and why temporal forecasting trades off stale averages against noisy recent states. Experiments on Amazon Reviews 2023 and MemoryCD show that LATTE consistently outperforms retrieval, summary memory, static latent profiles, difference aware latent profiles, and soft prompt compression baselines. On Amazon Reviews 2023, LATTE improves average ROUGE-L from 0.219 for a static latent profile and 0.245 for the strongest added latent compression baseline to 0.259. Additional pairwise comparisons and diagnostic analyses suggest that the improvement is mainly due to forecasting user-specific trajectory information, rather than merely adding a soft prompt interface.
95. 【2605.26575】Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models
链接:https://arxiv.org/abs/2605.26575
作者:Adib Sakhawat,Fardeen Sadab,Atik Shahriar
类目:Computation and Language (cs.CL)
关键词:Multilingual embedding models, query in language, translation in language, models are deployed, assumption that cross-lingual
备注: 17 pages, 5 figures
点击查看摘要
Abstract:Multilingual embedding models are deployed under the assumption that cross-lingual retrieval is symmetric: if a query in language A retrieves its translation in language B, the reverse should also hold. In practice it does not. Using a parallel corpus of 6,518 idiomatic and proverbial expressions in English, Bangla, Hindi, and Arabic, embedded by five production-grade encoders (Gemini, Mistral, OpenAI-L, OpenAI-S, Qwen), we formalise this failure as a deficit in mutual nearest-neighbour reciprocity and test a single mechanistic claim: among the geometric pathologies of multilingual spaces, hubness, not anisotropy, centroid drift, or magnitude, is the dominant causal driver. Across five pre-registered experiments with falsification conditions specified in advance, hub mass dominates a joint regression on reciprocity (49.5% dominance share, 1.68x the next predictor; partial R^2 = 0.302 versus 0.003 for anisotropy), while a hub-aware score correction (CSLS) closes 63.5% of the worst-to-best reciprocity gap and yields a mean within-model effect size 130x larger than surgical hub-vector ablation. The latter contrast pinpoints the mechanism: hubness is a pathology of the similarity metric, not of individual hub vectors. We resolve the well-known anisotropy-hubness paradox by showing the two are statistically dissociable, and we recommend replacing cosine similarity with CSLS as the default retrieval metric for multilingual embedding pipelines.
96. 【2605.26560】Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline
链接:https://arxiv.org/abs/2605.26560
作者:Michal Laufer,Yehudit Aperstein,Alexander Apartsin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Objective, MRI brain, carry follow-up instructions, follow-up instructions pairing, Abstract
备注: 17 pages, 5 figures
点击查看摘要
Abstract:Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action, date) pairs supports scheduling and audit, but generative extractors miss the date because linking and arithmetic are implicit in decoding. We test a hybrid neural-symbolic pipeline against direct generation. Methods. We define TestSpecification and TimeSpecification entities and a ScheduledFor relation. BioBERT feeds BIO tagging and a biaffine linker; entities are canonicalized via a 28-action ontology and times normalized to day offsets deterministically. We evaluate on a 2,000-note synthetic outpatient corpus with action-disjoint splits (18 train, 6 OOV-test) against zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B with note-level bootstrap 95% CIs. Results. On 259-note seen and OOV splits the hybrid pipeline achieves Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day MAE. Baselines reach high action F1 (LLaMA-3 0.992; GPT-4o-mini 0.963 seen) but Pair F1 stays at 0.51-0.57 (LLaMA-3) and 0.53 (GPT-4o-mini), CIs non-overlapping with the hybrid. Conclusion. Separating learned entity extraction from deterministic date arithmetic outperforms generation on this benchmark, generalizes to held-out actions, and exposes failure modes. Transfer to real EHR notes is the next validation; a first-pass realism check is in Limitations.
97. 【2605.26537】Conceptual Steganography
链接:https://arxiv.org/abs/2605.26537
作者:Zhejian Zhou,Jonathan May
类目:Computation and Language (cs.CL)
关键词:Language Models, Language, reasoning, Abstract, LMs
备注:
点击查看摘要
Abstract:Language Models (LMs) emit Chains-of-Thought (CoTs) that drive much of their capability. However, the same sequence that carries useful reasoning can also covertly convey messages: a misaligned model may embed covert information in its CoT that slips through human supervision, a form of steganography known as encoded reasoning. Prior LM steganography schemes operate in the token or lexical space, and a content-preserving paraphraser is the canonical and effective defense in recent work. We introduce conceptual steganography, in which each step of a CoT carries information through patterns of high-level reasoning behavior, rather than through lexical choice. Across four model families and two reasoning domains, this backdoor communication channel is shown to be consistently more robust to a strong paraphrase defense than standard keyword approaches, and the encoding of information into CoTs does not affect their utility in the reasoning process. Having raised awareness of this new risk, we then demonstrate that a strategy-aware paraphraser can close much of the channel, highlighting new challenges and recommended defenses for ensuring faithful LLM reasoning in the wild.
98. 【2605.26533】A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection
链接:https://arxiv.org/abs/2605.26533
作者:Malikussaid,Imad Gohar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Automated industrial inspection, linguistic interpretation left, industrial inspection requires, precise defect localization, Automated industrial
备注: 23 pages, 6 figures, 9 equations, and 6 tables
点击查看摘要
Abstract:Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.
99. 【2605.26498】Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation
链接:https://arxiv.org/abs/2605.26498
作者:Zehua Pei,Hui-Ling Zhen,Yu Zhang,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu
类目:Computation and Language (cs.CL)
关键词:Large language models, improved Verilog generation, Large language, Verilog generation, treat generation
备注:
点击查看摘要
Abstract:Large language models (LLMs) have improved Verilog generation from natural-language specifications, but most pipelines still treat generation as isolated sampling followed by functional checking. This is insufficient for practical RTL design, where useful Verilog must be correct, synthesizable, timing-conscious, and friendly to downstream hardware objectives. We present Verilog-Evolve, a feedback-driven framework for versioned Verilog refinement and cross-session skill evolution. For each task, Verilog-Evolve generates diverse minor candidates, evaluates them with executable feedback from functional simulation, Yosys synthesis, ABC timing proxy, and optional GEMM metrics, then promotes the best candidate into a major version under configurable scoring. To improve across tasks, the system maintains modular skill guidance, retrieves skills according to task and feedback context, and evolves candidate skills from logged histories through create/improve/skip decisions and verifier reports. Experiments on VerilogEval and mixed-precision GEMM tasks show that Verilog-Evolve improves final functional success and promotion stability while producing more downstream-friendly RTL under open-source synthesis, timing-proxy, and netlist-level GEMM objectives. Validation-gated skill evolution further improves GEMM downstream quality and achieves the best downstream score and GEMM held-out pass rate among the evaluated skill modes.
100. 【2605.26494】he MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
链接:https://arxiv.org/abs/2605.26494
作者:MiniMax:Aili Chen,Aonian Li,Baichuan Zhou,Bangwei Gong,Binyang Jiang,Boji Dan,Changqing Yu,Chao Wang,Cheng Ma,Cheng Zhong,Cheng Zhu,Chengjun Xiao,Chengyi Yang,Chengyu Du,Chenyang Zhang,Chi Zhang,Chuangyi Huang,Chunhao Zhang,Chunhui Du,Chunyu Zhao,Congchao Guo,Da Chen,Deming Ding,Dianjun Sun,Dongyu Zhang,Enhui Yang,Fei Yu,Guang Zheng,Guodong Zheng,Guohong Li,Haichao Zhu,Haigang Zhou,Haimo Zhang,Han Ding,Hao Zhang,Haohai Sun,Haolin Lyu,Haonan Lu,Haoyu Wang,Huajie Shi,Huiyang Li,Jiacheng Chen,Jian Zhang,Jiaqi Zhuang,Jiaren Cai,Jiaxin Pan,Jiayao Li,Jiayuan Song,Jichuan Zhang,Jie Wang,Jihao Gu,Jin Zhu,Jingwei Dong,Jingyang Li,Jingyu Zhang,Jingze Zhuang,Jinhao Tian,Jinli Liu,Jinyi Hu,Jun Tao,Jun Zhang,Junbin Ruan,Junhao Xu,Junjie Yan,Junteng Liu,Junxian He,Kang Xu,Ke Ji,Ke Yang,Kecheng Xiao,Keyu Duan,Keyu Li,Le Han,Letian Ruan,Li Yuan,Lianfei Yu,Liheng Feng,Lijie Mo,Lin Li,Lingye Bao,Lingyu Yang,Lingyuan Zhou,Loki,Lu Chen,Lunbin Ceng,Ming Li,Ming Zhong,Mingliang Tao,Mingyuan Chi,Mujie Lin,Nan Hu,Ningxin Chen,Peiyin Zhu,Peng Gao,Pengcheng Gao,Pengfei Li,Penglin Li,Pengyu Zhao,Qibin Ren
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:language models built, maximum real-world intelligence, unleash maximum real-world, language models, real-world intelligence
备注: Technical Report. 35 pages, 10 figures, 4 tables
点击查看摘要
Abstract:We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.
101. 【2605.26492】Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories
链接:https://arxiv.org/abs/2605.26492
作者:Sil Hamilton,David Mimno
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:popular use case, low variability, show very low, LLM-generated stories, Abstract
备注:
点击查看摘要
Abstract:LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.
102. 【2605.26485】OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
链接:https://arxiv.org/abs/2605.26485
作者:Xudong Lu,Xueying Li,Annan Wang,Yang Bo,Jinpeng Chen,Zengliang Li,Nianzu Yang,Rui Liu,Xue Yang,Jingwen Hou,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:omnimodal large language, native online inference, large language models, language models evaluated, real-time omnimodal large
备注:
点击查看摘要
Abstract:We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at this https URL.
103. 【2605.26476】FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
链接:https://arxiv.org/abs/2605.26476
作者:Jingbin Qian,Congwen Yi,Min Xia,Wen Wu,Jun Zhu,Jian Guan(a href="http://FutureFab.AI" rel="external noopener nofollow" class="link-external link-http"this http URL/a)
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:vertical domains remains, domains remains difficult, remains difficult due, Retrieval-Augmented Generation, diverse context scales
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.
104. 【2605.26463】owards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records
链接:https://arxiv.org/abs/2605.26463
作者:Yeonsu Kwon,Jiho Kim,Junseong Choi,Paloma Rabaey,Minseo Kim,Sujeong Im,Jeewon Yang,Jun-Min Lee,Sangji Lee,Jiwon Kim,Hangyul Yoon,Hyunwook Kwon,Edward Choi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Electronic Health Records, Health Records, Electronic Health, Data consistency, essential for patient
备注:
点击查看摘要
Abstract:Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.
105. 【2605.26457】Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization
链接:https://arxiv.org/abs/2605.26457
作者:Anmol Agarwal,Natalie Neamtu,Pranjal Aggarwal,Seungone Kim,Jannis Limperg,Cedric Flamant,Kanna Shimizu,Bryan Parno,Sean Welleck
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
关键词:write real-world software, real-world software, Codeforces, Formal, coding agents
备注: Preprint
点击查看摘要
Abstract:AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, (b) testing them against official Codeforces tests adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, logs can be found at this https URL
106. 【2605.26454】Model Unlearning Objectives Vary for Distinct Language Functions
链接:https://arxiv.org/abs/2605.26454
作者:Berk Atil,Vipul Gupta,Rebecca J. Passonneau
类目:Computation and Language (cs.CL)
关键词:learn undesirable properties, toxic text generation, Large language models, Large language, learn undesirable
备注:
点击查看摘要
Abstract:Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation. Just as post-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue. To study this, we consider two mechanistically distinct unlearning goals, dangerous-knowledge unlearning and toxicity unlearning. For dangerous knowledge, we introduce a cosine-based, meta-learned variant of RMU. For toxicity, we propose a multi-layer objective based on layer-specific probe directions. Across four open-source 7-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post-training.
107. 【2605.26445】Curation and Extraction of Drug-Related Entities from Reddit Platform
链接:https://arxiv.org/abs/2605.26445
作者:Zewei Wang,Zihan Xu,Yishu Wei,Michael Chary,Yifan Peng
类目:Computation and Language (cs.CL)
关键词:Physicians learn primarily, clinical overdose cases, Physicians learn, overdose cases, limiting their understanding
备注: Accepted by IEEE International Conference on Healthcare Informatics (ICHI 2026)
点击查看摘要
Abstract:Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use. A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0.79 vs. 0.72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0.41. ReDose captures patient-curated narratives to advance medical data extraction from social media.
108. 【2605.26444】MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies
链接:https://arxiv.org/abs/2605.26444
作者:Zhiyang Chen,Daliang Xu,Yinyuan Zhang,Chenghua Wang,Mengwei Xu,Yun Ma
类目:Computation and Language (cs.CL)
关键词:typically employ vocabularies, major computational bottleneck, final linear projection, linear projection layer, Large language models
备注:
点击查看摘要
Abstract:Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend on either fixed or coarse-grained sub-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model. We introduce MicroSpec, a training-free technique that overcomes this limitation by building a compact, context-sensitive active vocabulary on the fly for every decoding step. Exploiting the natural temporal locality found in language generation, MicroSpec attains high token coverage while reducing the average vocabulary size by more than 40x (down to under 3k tokens), all without any additional trained parameters. To translate this high sparsity into actual speedups on contemporary hardware, we present a co-designed system and algorithm that mitigates the overhead of sparse memory accesses via asynchronous gathering and GPU-resident state management. Acting as a plug-and-play enhancement, MicroSpec reduces draft inference latency by 51.6% on average, achieving an end-to-end speedup of 1.12-1.32x relative to the leading speculative decoding approach EAGLE-2 on various benchmarks, while also surpassing more sophisticated training-based pruning baselines.
109. 【2605.26442】Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines
链接:https://arxiv.org/abs/2605.26442
作者:Hwanjun Song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:alignment tuning literature, treated implicitly, literature is organized, alignment, alignment data
备注: Accepted at the Findings of ACL 2026
点击查看摘要
Abstract:Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment tuning as a pipeline design problem. We decompose alignment data construction into three interacting stages, response synthesis, preference evaluation, and preference instantiation, and use this framework to organize existing alignment methods into a unified taxonomy. Through this lens, we identify recurring design trade-offs and failure modes observed across prior alignment methods, and distill a set of high level principles that clarify how pipeline design choices influence the resulting optimization signal. Finally, we outline open challenges for alignment data pipelines, including prompt-level alignment, agentic settings, and alignment under evolving objectives.
110. 【2605.26440】Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks
链接:https://arxiv.org/abs/2605.26440
作者:Victor M. dos Santos,Andre C. Castro,Samuel L. de S. Toledo,Bruno M. L. Calura,Lisandra C. de M. Menezes,Raul C. R. Mata,Telma W. de L. Soares,Bryan L. M. de Oliveira
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:Large Language Models, labor-intensive expert curation, Language Models, remain heavily dependent, Large Language
备注:
点击查看摘要
Abstract:The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a multi-stage framework that automatically transforms authentic multi-turn user-assistant dialogues into structured, verifiable requirement checklists. By leveraging the "instructional evolution" found in real-world conversational logs, our approach deconstructs fragmented user intent into consolidated instructions and binary evaluation criteria. Applied to the programming domain, Conv-to-Bench produces evaluation sets that demonstrate near-perfect alignment with human-authored standards like BigCodeBench, achieving Spearman correlations of up to $\rho$ = 1.000 with significantly lower computational overhead. Validation of the LLM-as-a-judge framework further confirms its reliability, with the primary evaluator achieving substantial agreement with human-verified ground truth ($\kappa$ = 0.705). Our comprehensive ablation studies reveal that while multi-turn interactions capture the iterative evolution of user intent, instruction-centric extraction provides a more robust foundation. Ultimately, Conv-to-Bench provides a scalable, cost-effective paradigm for maintaining high-fidelity evaluation standards as user-centric AI applications continue to diversify.
111. 【2605.26438】LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
链接:https://arxiv.org/abs/2605.26438
作者:Igor Ivanov,David Demitri Africa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, language models, models can recognize, behave differently, undermines the validity
备注:
点击查看摘要
Abstract:Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.
112. 【2605.26436】argeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models
链接:https://arxiv.org/abs/2605.26436
作者:Lin Yao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Discrete masked diffusion, LLaDA generate text, Discrete masked, masked diffusion language, iterative denoising
备注:
点击查看摘要
Abstract:Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively replaced with predicted tokens. LLaDA2.1 introduced a Token-to-Token (T2T) editing mechanism that accelerates generation by directly replacing committed tokens suspected of being incorrect. However, we identify fundamental limitations of T2T editing: it couples error detection with replacement, pollutes the generation context with potentially incorrect tokens, and introduces a train-inference noise mismatch where systematic model-generated errors differ from the random perturbations seen during training. We propose Token-to-Mask (T2M) remasking, a training-free, drop-in replacement for T2T editing that resets suspected erroneous tokens back to the mask state, allowing the diffusion process to re-predict them under cleaner context. We design and empirically validate three complementary error detection strategies -- probability-based, trigger-mirrored, and temporal-difference-based -- and provide a unified theoretical analysis showing that T2M remasking purifies the generation context, converts systematic inference errors back to the model's native mask noise type, and enables delayed commitment for joint multi-position optimization. Comprehensive experiments across 12 benchmarks spanning knowledge, reasoning, mathematics, coding, and instruction following show that T2M generally improves performance on tasks requiring precise token-level output, with the largest gain on mathematics (+5.92% on CMATH). Error analysis on CMATH reveals that the dominant failure mode is last-mile token corruption -- where correct reasoning produces a corrupted final answer -- and that T2M repairs 59.4% of such cases.
113. 【2605.26433】Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization
链接:https://arxiv.org/abs/2605.26433
作者:Weixin Liu,Bowen Qu,Juming Xiong,Congning Ni,Bradley A. Malin,Zhijun Yin
类目:Computation and Language (cs.CL)
关键词:Large language model, Large language, pass compact vector, language model, analytic workflows
备注: 30 pages, 2 figures; preprint
点击查看摘要
Abstract:Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows. Even when source documents remain access-restricted, derived vectors may be handled under different access controls and still support sensitive-information inference, creating a residual information-disclosure risk. We study this issue in clinical discharge-summary generation as a high-stakes case study, using electronic health record (EHR)-recorded race as a controlled sensitive-label audit. We audit two artifacts that a system might retain or expose to downstream components: the final prompt-token hidden state and the mean-pooled prompt representation. Our results show that reducing recoverability of the case-study sensitive label from one exported artifact does not necessarily reduce recoverability from another. As a mitigation case study, we introduce SurfaceLoRA, an exported-vector-targeted parameter-efficient fine-tuning method that uses a gradient-reversal discriminator attached to a designated exported vector. Under a balanced five-way probing protocol, SurfaceLoRA reduces EHR-recorded race recoverability from the targeted final-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components.
114. 【2605.26431】Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent
链接:https://arxiv.org/abs/2605.26431
作者:Yuanhao Chen,Peter Chin
类目:Computation and Language (cs.CL); Applications (stat.AP)
关键词:Universal Dependencies, train on Universal, Structural probes train, Structural probes, Dependencies
备注:
点击查看摘要
Abstract:Structural probes train on Universal Dependencies (UD), which does not encode formal-syntactic abstractions such as phase boundaries or phase-internal cohesion. Whether large language models (LLMs) encode these remains an open question that UD-based probing cannot answer by construction. We evaluate structural probes on wh-movement stimuli where UD distances are invariant across conditions by design -- any non-zero effect therefore reflects structure beyond UD. The three conditions -- bare small clause, infinitival, and finite -- are ordered by the number of Minimalist Program (MP) phase boundaries the wh-element crosses. Across 13 LLMs from four families, we find a phase-count gradient on a cross-clause pair (12/13 models) and a 13/13 sign asymmetry on a within-clause pair whose UD distance is identical across conditions -- the latter specifically predicted by phase-internal cohesion, an MP abstraction invisible to UD by construction. Activation patching confirms the representations are causally active in 12/13 models. These findings suggest that distributional pretraining can induce representations aligned with formal-syntactic abstractions beyond the reach of annotation-based probing; UD-grounded probes provide a lower bound on syntactic encoding, not an upper bound.
Subjects:
Computation and Language (cs.CL); Applications (stat.AP)
Cite as:
arXiv:2605.26431 [cs.CL]
(or
arXiv:2605.26431v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.26431
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
115. 【2605.26428】Slide Deck QA Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation
链接:https://arxiv.org/abs/2605.26428
作者:Jim Salsman
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Generating high-quality, important instructional content, visual elements, difficult because important, important instructional
备注: 15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendices
点击查看摘要
Abstract:Generating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\A Quality Assurance (slidesqaqa), a Flask-based software system that extracts text and rendered images from PDF slides and processes them through a four-stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck-level goals, section structure, slide-level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non-instructional slides and produce high-fidelity, pedagogically coherent questions for visually complex content. The working system is at this https URL The software repository is at this https URL
Comments:
15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendices
Subjects:
Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
MSC classes:
68T50
ACMclasses:
K.3.1; D.2.2
Cite as:
arXiv:2605.26428 [cs.CL]
(or
arXiv:2605.26428v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.26428
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
116. 【2605.26414】Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
链接:https://arxiv.org/abs/2605.26414
作者:Matthew Kutakh
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, achieve impressive accuracy, mathematical reasoning benchmarks, Language Models
备注: 6 pages, 4 figures, 2 tables
点击查看摘要
Abstract:Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.
117. 【2605.26405】owards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM
链接:https://arxiv.org/abs/2605.26405
作者:Younghun Lee,Amir Bralin,Nobel Sanjay Rebello,Dan Goldwasser
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Educational interventions, interventions are effective, effective tools, tools for enhancing
备注: 8 pages, Accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
点击查看摘要
Abstract:Educational interventions are effective tools for enhancing student learning. While Large Language Models (LLMs) allow for generating adaptive feedback at scale, current studies lack clear methodologies for providing Just-in-Time (JiT) feedback in authentic instructional settings. In this paper, we present a framework that provides adaptive feedback by grounding LLMs with domain-specific expert knowledge. Our approach collects written reasoning logic (strategy essays) from students, analyzes potential error types based on the content of that reasoning, and delivers non-intrusive feedback designed to clarify missing or incorrect concepts. We deploy this framework in a large-scale university course (N 1000), where it improved student performance by over 80% compared to previous semesters. Lastly, we validate the framework's pedagogical utility by analyzing the learning trajectories; we demonstrate how iterative conversations with LLM facilitate shifting one's misconception to correct understanding.
118. 【2605.26397】Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection
链接:https://arxiv.org/abs/2605.26397
作者:Naba Rizvi,Harper Strickland,Saleha Ahmedi,Nedjma Ousidhoum
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:high-stakes settings affecting, affecting autistic communities, settings affecting autistic, Large language models, raising concerns
备注: main paper: 8 pages; total: 18 pages; 2 figures
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used in decision-making tasks where they can amplify or suppress perspectives, raising concerns in high-stakes settings affecting autistic communities. While previous research has identified disability-related biases in LLMs, it remains unclear how they conceptualize ableism or detect it in text. We introduce a bias-aware evaluation framework targeting anti-autistic ableist language with a psychometrically-weighted, community-proximate ground truth anchored in annotator positionality. This framework constitutes a stricter standard than conventional majority-vote aggregation which significantly and consistently underweights autistic and autism-accepting perspectives. We find that LLMs frequently produce harmful outputs, mislabel community-reclaimed language as ableist, and express more negative attitudes toward autistic people when assessment instruments are masked. Our error analysis reveals that models rely on surface-level keyword matching rather than contextual factors such as speaker identity, and whether the language fosters in-group solidarity or inflicts out-group harm.
119. 【2605.26396】Advancing Creative Physical Intelligence in Large Multimodal Models
链接:https://arxiv.org/abs/2605.26396
作者:Cheng Qian,Hyeonjeong Ha,Jiayu Liu,Jeonghwan Kim,Emre Can Acikgoz,Bingxuan Li,Kunlun Zhu,Jiateng Liu,Aditi Tiwari,Zhenhailong Wang,Xiusi Chen,Mahdi Namazifar,Heng Ji
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large multimodal models, Large multimodal, pattern recognition, rapidly advanced, advanced in perception
备注: 51 Pages, 9 Figures, 7 Tables, Previous Work CreativityBench: [arXiv:2605.02910](https://arxiv.org/abs/2605.02910)
点击查看摘要
Abstract:Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.
120. 【2605.26394】Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study
链接:https://arxiv.org/abs/2605.26394
作者:Ravi Kumar Tummalapenta,Suman Addanki
类目:Computation and Language (cs.CL)
关键词:remains predominantly evaluated, Claude Sonnet, single-turn settings, analytics yet remains, remains predominantly
备注: 18 pages, 4 figures, 14 tables; includes appendices with verbatim prompts, example session, and full ablation tables; prepared by the LLM Suite Engineering Team, JP Morgan Chase Co
点击查看摘要
Abstract:Multi-turn Text-to-SQL is central to enterprise analytics yet remains predominantly evaluated in single-turn settings. We introduce EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark of 300 sessions and 1,400 turns built programmatically from three enterprise domains (BIRD financial, SEC EDGAR, Northwind), with deterministic ground truth and per-turn memory-critical annotation. We evaluate five frontier models -- GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 -- across five memory conditions enabling a three-way ablation isolating working-memory window size, episodic retrieval, and semantic augmentation as independent effects. All Claude models are evaluated with extended thinking enabled to maintain parity with GPT reasoning models. We introduce the Memory Benefit Score (MBS) as a per-turn diagnostic metric. Four findings emerge: (1) stateless multi-turn Text-to-SQL collapses to zero execution accuracy by Turn 3 across all five models, even under reasoning; (2) memory-architecture complexity does not monotonically improve accuracy -- working memory dominates, and additional components produce model- and dataset-dependent effects from +14 to -16 percentage points; (3) Claude Sonnet 4.6 underperforms Sonnet 4.5 by 17-33pp on SEC EDGAR across conditions, a generational regression persisting under reasoning; (4) under reasoning, Claude error distributions become mono-modal -- every non-correct turn is a wrong-result error. We release the benchmark, agent, and evaluation code.
121. 【2605.26365】Cultural Value Alignment Via Latent Activation Steering in Large Language Models
链接:https://arxiv.org/abs/2605.26365
作者:Trung Duc Anh Dang,Sarah Masud
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, homogenized cultural perspectives, exhibit homogenized cultural
备注: ACL 2026 Student Research Workshop (Non-Archival Track)
点击查看摘要
Abstract:Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access the model's latent cultural depth, leading to safety-aligned refusals or neutral responses. Here, we propose a generalizable framework for cultural evaluation and intervention that transitions from abstract queries to scenario-based behavioral probing. By extracting implicit token probabilities across 300 situational dilemmas, we bypass surface-level alignment to map the latent coordinates of LLMs cultural value. We further introduce activation steering to shift these internal alignments during the forward pass without retraining. Across multiple LLMs, we find substantial variation in adaptability and uncover a consistent phenomenon of latent entanglement, where interventions along one cultural dimension induce shifts along another. These results suggest that cultural values are encoded as coupled structures, limiting precise alignment. This work establishes a computationally efficient framework for cultural steering, highlighting the structural complexities when navigating global value with LLMs.
122. 【2605.26362】Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
链接:https://arxiv.org/abs/2605.26362
作者:Shanghao Li,Jinda Han,Yibo Wang,Yuanjie Zhu,Zihe Song,Langzhou He,Kenan Kamel A Alghythee,Philip S. Yu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, sequential token representations, reasoning tasks, large language, typically linearized
备注: To appear in Proceedings of ACL 2026
点击查看摘要
Abstract:In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge is available, LLMs can still produce hallucinated outputs, and the underlying mechanisms behind such failures remain poorly understood. We investigate these mechanisms and find that hallucinations arise from systematic internal dynamics rather than random noise. First, attention disproportionately concentrates toward shortcut-like structural cues rather than distributing across the full context. Second, feed-forward representations fail to ground the provided knowledge, causing the model to revert to parametric memory. Moreover, our results indicate that hallucination is consistently associated with failures in semantic grounding within feed-forward layers, while attention allocation exhibits greater task-dependent variability. Finally, we show that these mechanistic patterns generalize beyond single-hop graphs to multi-hop and tabular settings, enabling effective hallucination detection across structured knowledge formats.
123. 【2605.26356】In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective
链接:https://arxiv.org/abs/2605.26356
作者:Mingchen Li,Jiatan Huang,Chuxu Zhang,Liang Zhao,Hong Yu
类目:Computation and Language (cs.CL)
关键词:implicit gradient descent, learning has recently, recently been linked, linked to implicit, induce a forward-pass
备注:
点击查看摘要
Abstract:In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.
124. 【2605.26355】Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention
链接:https://arxiv.org/abs/2605.26355
作者:Athanasios Zeris
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Signal Processing (eess.SP)
关键词:pairwise token similarity, computes pairwise token, transformer attention computes, attention computes pairwise, equally local
备注: 10 pages, 1 figure, 3 tables. Part 2 of a five-paper series on spectral methods in transformer attention. Code: [this https URL](https://github.com/AthanasiosZeris/energy-gated-attention)
点击查看摘要
Abstract:Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 -- more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.
125. 【2605.26352】RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents
链接:https://arxiv.org/abs/2605.26352
作者:Mingchen Li,Hansi Zeng,Zhuo Qian,Jiatan Huang,Hamed Zamani,Hong Yu
类目:Computation and Language (cs.CL)
关键词:iteratively inspect evidence, language agents iteratively, agents iteratively inspect, inspect evidence, increasingly moving
备注:
点击查看摘要
Abstract:Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.
126. 【2605.26346】he Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology
链接:https://arxiv.org/abs/2605.26346
作者:Jason Holmes,Federico Mastroleo,Mariana Borras-Osorio,Srinivas Seetamsetty,Satomi Shiraishi,Mirek Fatyga,Judy C. Boughey,Cornelius A. Thiels,William G.Breen,Daniel J. Ma,Daniel K. Ebner,David M. Routman,Brady S. Laughlin,Carlos E. Vargas,Samir H. Patel,Sujay A. Vora,Nadia N. Laack,Andrew Y.K. Foong,Wei Liu,Mark R. Waddle
类目:Computation and Language (cs.CL)
关键词:Daily Dose, early clinical evaluation, clinical-trial identification system, identification system integrated, automated clinical summarization
备注: 28 pages, 4 figures, 1 table
点击查看摘要
Abstract:Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice. Design: Mixed-methods evaluation using a cross-sectional, anonymous clinician survey administered after 1 month of system deployment. Exposure: Daily automated delivery of physician-specific email summaries generated using RadOnc-GPT, including patient schedules, concise EHR-derived clinical-status summaries, and automated identification of potentially relevant clinical trials for new or consult visits. Main Outcomes and Measures: Primary outcomes included self-reported usability, satisfaction, perceived usefulness, perceived impact on workflow, time savings, and intention for continued use. Internal consistency reliability was assessed using Cronbach's $\alpha$. Results: Among 55 respondents, 52 (94.5\%) worked in radiation oncology, and 38 (69.1\%) were attending physicians. Most participants (83.6\%) reported using TDD daily or several times per week. Mean (SD) scores were 3.89 (1.04) for usability and satisfaction, 3.43 (1.24) for perceived usefulness, and 3.80 (1.17) for impact and future use (5-point Likert scale). Overall satisfaction was positively associated with perceived time savings ($p .001$). Participants reported variable time savings, with 27\% estimating $\geq 10$ minutes saved per day. The questionnaire demonstrated excellent internal consistency (overall Cronbach's $\alpha$ = 0.97).
127. 【2605.26340】ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
链接:https://arxiv.org/abs/2605.26340
作者:Rui Meng,Bhavana Dalvi Mishra,Jiefeng Chen,Chun-Liang Li,Palash Goyal,Mihir Parmar,Yiwen Song,Yale Song,Rajarishi Sinha,Parthasarathy Ranganathan,Burak Gokturk,Jinsung Yoon,Tomas Pfister
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:agents produce competitive, produce competitive solutions, research agents produce, Autonomous research agents, fabricated citations
备注: Project website: [this https URL](https://scientist-one.github.io/)
点击查看摘要
Abstract:Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.
128. 【2605.26339】QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling
链接:https://arxiv.org/abs/2605.26339
作者:Preetam Sharma,Kacper Dobek
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Scalar post-training quantizers, post-training quantizers discard, quantizers discard pairwise, Scalar post-training, discard pairwise coordinate
备注:
点击查看摘要
Abstract:Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Hadamard rotated, paired into 2D coordinates, and quantized against a single Lloyd-Max codebook trained on the unit circular Gaussian, with activation-aware per-channel scaling. In a cross-model study spanning five LLMs from four families (1.1B--13B parameters) and eight quantized configurations, the activation-aware variant at $\approx 5.5$ bpw stays within $\pm 0.4\%$ of BF16 WikiText-2 perplexity on every model, matching the SmoothQuant W8A8 quality envelope at $32\%$ fewer weight bits. Joint 2D coding outperforms polar (amplitude $\times$ phase) coding by 2--15~pp $\Delta$PPL at equal bitrate, and paired KL against BF16 tracks $\Delta$PPL\% at Spearman $\rho = 0.99$ across 37 (method, model) rows, consistent with a monotone composite bound from codec distortion to KL divergence. A 3.5~bpw variant is competitive on quantization-tolerant architectures. At strict 4~bpw, the rotated-codebook frontier method QTIP outperforms QAM-W; the contribution is the quality-preserving 5--6~bpw band.
129. 【2605.26320】MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding
链接:https://arxiv.org/abs/2605.26320
作者:Sai Munikoti,Ian Stewart,Chengping Chai,Lisa Linville,Scott Vasquez,Sameera Horawalavithana,Karl Pazdernik
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:remains limited due, multiple data modalities, scientific domains remains, domains remains limited, text and images
备注:
点击查看摘要
Abstract:The application of generalist multimodal models (GMMs) to specialized scientific domains remains limited due to the scarcity of comprehensive domain-specific datasets that integrate multiple data modalities beyond text and images. In seismology, understanding earthquake phenomena requires the synthesis of timeseries waveform data, geographical imagery, and contextual metadata, a multimodal integration absent in existing seismic datasets. We present MultiSeismo, a large scale structured multimodal seismic dataset, comprising over 16K seismic events spanning 13 years (2010 to 2023) across diverse geographical regions. Each event data integrates waveform recordings from global station networks, intensity maps, population exposure visualizations, and a comprehensive textual description within a standardized JSON format. We additionally develop MISCE, a multimodal instruction set on top of raw data to enable supervised training and evaluation of GMMs on seismic reasoning tasks ranging from basic information retrieval to complex cross modal analysis. We leverage MISCE to finetune an existing multimodal model (Unified IO 2) enhanced with a specialized timeseries encoder, which yields SeisModal, the first domain specific multimodal model for comprehensive seismic analysis. Evaluation of state of the art multimodal models on MultiSeismo reveals significant challenges, particularly with time-series data processing for general purpose models, while demonstrating SeisModal's superior performance on seismic multimodal reasoning tasks. These results prove that MultiSeismo provides a rigorous benchmark for future multimodal research in seismology and validate the success of our domain specific architectural adaptations.
130. 【2605.26302】Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
链接:https://arxiv.org/abs/2605.26302
作者:Jianing Zhu,Yeonju Ro,John Robertson,Kevin Wang,Junbo Li,Haris Vikalo,Aditya Akella,Zhangyang Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:persistent operational systems, freshly initialized models, persistent operational, evaluated like freshly, freshly initialized
备注:
点击查看摘要
Abstract:Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.
131. 【2605.26293】CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
链接:https://arxiv.org/abs/2605.26293
作者:Mike Zhang,Ali Basirat,Desmond Elliott
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Prior work establishes, Prior work, work establishes, establishes that controlled, controlled contrastiveness
备注:
点击查看摘要
Abstract:Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.
132. 【2605.26292】Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
链接:https://arxiv.org/abs/2605.26292
作者:Taha Koleilat,Hassan Rivaz,Yiming Xiao
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:ambiguous image-text alignment, precise multimodal understanding, image-text alignment, crucial for precise, precise multimodal
备注: MICCAI 2026 Early Accept; Project Page: [this https URL](https://tahakoleilat.github.io/Evi-Steer)
点击查看摘要
Abstract:Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at this https URL.
133. 【2605.26275】SPEAR: Code-Augmented Agentic Prompt Optimization
链接:https://arxiv.org/abs/2605.26275
作者:Mengyin Lu,Cong Feng,Huimin Han,Guangming Lu,Yu Sun,Xiaonan Ding,Shihui Long,Fengyi Li,Tanvi Motwani
类目:Computation and Language (cs.CL)
关键词:Automatic prompt engineering, existing APE loops, APE loops treat, Sandboxed Prompt Engineer, Automatic prompt
备注: 19 pages, 3 figures, EMNLP 2026 submission
点击查看摘要
Abstract:Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools -- evaluate, python, set_prompt, finish -- that decides autonomously how and when to use them. The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor. We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ($\kappa$ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; $\kappa$ 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484. Ablations show the Python tool is the largest single lever on complex judge tasks ($\Delta \approx +0.79\kappa$ on the 5-class tool-selection judge, $\Delta \approx +0.35\kappa$ on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.
134. 【2605.26186】SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
链接:https://arxiv.org/abs/2605.26186
作者:Zihang Zhou,Ziqian Ren,Yukai Wu,Yingjie Xiong,Wei Zhou,Chao Peng,Dong Zhang,Bingheng Yan,Xuanhe Zhou,Fan Wu
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:repository documented features, Functionality-correct repository setup, build scripts, Functionality-correct repository, documented features
备注: 21 pages, 6 figures
点击查看摘要
Abstract:Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at this https URL.
135. 【2605.26174】A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
链接:https://arxiv.org/abs/2605.26174
作者:Hiroki Fukui
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:Production language-model systems, Production language-model, language-model systems answer, answer a request, request by partitioning
备注: 24 pages, 2 figures. Data and code: doi: [https://doi.org/10.5281/zenodo.20372696](https://doi.org/10.5281/zenodo.20372696)
点击查看摘要
Abstract:Production language-model systems answer a request by partitioning it across an invisible orchestration of worker agents that recompose one integrated report. We ask what this does to a class of defect no single worker can see: a contradiction in the relation between two distant sections of a document. Holding the documents, defects, mechanism, scoring, and seed fixed, we vary only the model -- ten systems across five generations from one developer and five providers from distinct alignment paradigms. Two layers separate. First, a universal detection cliff: every model that finds these cross-section defects under a single agent loses that ability under orchestration, detection falling two-thirds or more across every paradigm tested. The cliff is mechanism-derived and not closed by scale or extended reasoning. Second, how models behave once fallen. A signal-detection decomposition shows that, among the six models discriminating above chance, only one developer's generations move along the reporting-criterion axis: as alignment is strengthened, the model misses fewer defects yet raises more false alarms on clean documents -- two faces of one criterion shift, scaling with generation within that developer (p 0.001) and near-absent elsewhere. At the floor the missed defect is often not out of view: the model's private record reconstructs the structural fault accurately, while the integrated report signs off on its soundness, its concern spent on the artifact and an absent collaborator. This resists quantification -- an automated judge is unstable (precision 17-50%) and keywords cannot separate it from ordinary agreement -- a resistance we report as a finding. We release all runs, probes, defect keys, scorer prompts, and scripts. An integrated report's confidence is uninformative about partition-spanning defects, the most aligned systems are not the safest, and the cliff is structural.
Comments:
24 pages, 2 figures. Data and code: doi:https://doi.org/10.5281/zenodo.20372696
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Cite as:
arXiv:2605.26174 [cs.SE]
(or
arXiv:2605.26174v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2605.26174
Focus to learn more
arXiv-issued DOI via DataCite</p>
136. 【2605.26165】ool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets
链接:https://arxiv.org/abs/2605.26165
作者:Furkan Sakizli
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:critical resource conflict, Agentic RAG systems, equip language models, resource conflict, retrieval-augmented generation
备注: 12 pages (8 main + 4 appendix), 7 tables, 2 figures. Code and data: [this https URL](https://github.com/SKZL-AI/tscg)
点击查看摘要
Abstract:Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta = 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.
137. 【2605.26133】Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
链接:https://arxiv.org/abs/2605.26133
作者:Ziyi Tong,Feifei Sun,Le Minh Nguyen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, paradigm in NLP, Large Language, predominant paradigm, LLM pretraining corpus
备注: accepted by NLDB 2025
点击查看摘要
Abstract:Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.
Comments:
accepted by NLDB 2025
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2605.26133 [cs.CL]
(or
arXiv:2605.26133v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.26133
Focus to learn more
arXiv-issued DOI via DataCite
Related DOI:
https://doi.org/10.1007/978-3-031-97144-0_14
Focus to learn more
DOI(s) linking to related resources</p>
138. 【2605.26132】Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline
链接:https://arxiv.org/abs/2605.26132
作者:Tony Lee,Percy Liang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:post-trained large language, large language models, feedback from tools, post-trained large, large language
备注:
点击查看摘要
Abstract:Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding. We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark's use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes. We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains. For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.
信息检索
1. 【2605.27294】Separating Semantic Competition from Context Length in RAG Reading
链接:https://arxiv.org/abs/2605.27294
作者:Vyzantinos Repantis,Ameya Gawde,Harshvardhan Singh,Rohit Alekar,Cien Zhang,Svetlana Karslioglu,Akash Vishwakarma
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Retrieval-augmented generation, systems can respond, respond incorrectly, Retrieval-augmented, passages
备注: 4 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. This passage-reading model is called the reader. Does it fail simply because the context is longer or because the other passages genuinely compete with the correct one? We introduce and demonstrate a matched-control protocol for RAG reading: we keep the number and length of passages fixed, but replace hard competitors with less competitive real passages. We apply this control across two compact open models on SQuAD. This replacement partially restores performance, with the strongest effects on F1 and answer inclusion. For Phi-2, this recovers +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1. For Qwen2.5-1.5B, it recovers +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1. To track how performance changes as competitors accumulate, we also report retention curves and summarize them with a right-censored half-life when the curves do not cross half-retention. Together, these results show the protocol isolates a competition effect distinct from context length, though the effect is clearer for F1 and answer inclusion than for exact match, and also varies with snippet length.
2. 【2605.27220】he Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
链接:https://arxiv.org/abs/2605.27220
作者:Zafar Hussain,Kristoffer Nielbo
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:modern RAG pipelines, RAG pipelines, substantial LLM inference, LLM inference costs, modern RAG
备注:
点击查看摘要
Abstract:In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.
3. 【2605.27204】GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing
链接:https://arxiv.org/abs/2605.27204
作者:Pujun Zheng,Wanying Ren,Jiacheng Yao,Guoxiu He,Star X. Zhao
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Scientific paper evaluation, Scientific paper, assessing a manuscript, Scientific, paper evaluation
备注:
点击查看摘要
Abstract:Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $\rho$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at this https URL.
4. 【2605.27123】Rethinking Agentic RAG: Toward LLM-Driven Logical Retrieval Beyond Embeddings
链接:https://arxiv.org/abs/2605.27123
作者:Yuqi Zeng,Qixiang Deng,Yulei Wan,Ruiquan Jiang,Xiaoqing Zheng,Xuanjing Huang
类目:Information Retrieval (cs.IR)
关键词:Recent advances, iteratively refine queries, refine queries based, intermediate results, multiple turns
备注:
点击查看摘要
Abstract:Recent advances in RAG have shifted toward an agentic paradigm, where LLMs interact with retrieval systems over multiple turns and iteratively refine queries based on intermediate results. At the same time, LLMs have demonstrated a strong ability to construct structured queries that precisely express their information needs. However, contemporary RAG systems remain heavily focused on engineering complex retrieval backends, including dense, hybrid, and graph-based retrieval architectures. In this study, we argue that agentic RAG should delegate greater control to the LLM to steer the retrieval process, while relying on a lightweight retrieval interface that provides fine-grained control and faithfully executes the LLM's structured intent. Guided by this principle, we propose an agentic RAG framework that enables LLMs to formulate retrieval intents using logical expressions while simplifying the retrieval backend to an inverted-index-based system. Extensive experiments show that our framework matches a strong agentic hybrid baseline, while substantially reducing construction and serving cost. Moreover, we show that anchoring the retrieval process in logical queries substantially reduces hallucinations in generated responses.
5. 【2605.27105】Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG
链接:https://arxiv.org/abs/2605.27105
作者:Jorge Gabín,Anxo Perez,Javier Parapar
类目:Information Retrieval (cs.IR)
关键词:Retrieval-Augmented Generation, controversial design choices, systems rely, rely on retrieved, critical yet controversial
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems rely on retrieved documents being concatenated into a model's input context, making both document ordering and context size critical yet controversial design choices. Prior work reports position-based effects such as lost in the middle and related long-context phenomena. However, empirical findings remain inconsistent and hard to reproduce across models, datasets, and evaluation protocols. In this paper, we present a systematic reproducibility study that revisits these claims and examines how they evolve with contemporary LLMs under a controlled evaluation framework. We first show that topic sampling is a major source of variance: small topic sets can mask or exaggerate ordering effects. Based on repeated subset sampling across multiple topic budgets, we provide a practical calibration procedure that identifies topic counts yielding stable trends at feasible cost. Using these fixed topic sets, we then reproduce and extend results on position sensitivity, re-evaluating lost in the middle and positional biases in modern LLMs. Then, we also study a more realistic RAG scenario in which relevance is mediated by a retriever rather than oracle access to ground-truth documents. In this setting, we re-examine a recent industry study and identify discrepancies to evaluation choices such as limited topic coverage and reliance on LLM-based judges. Finally, we conduct an analysis of how retrieval order and context size affect downstream LLM performance under imperfect retrieval. Our results demonstrate that both factors interact strongly with retrieval quality and model choice, and that conclusions drawn from idealised setups do not always transfer to real-world RAG pipelines. We release all code and configurations to support reproducibility and future work on robust RAG evaluation.
6. 【2605.27103】MuChator: Enabling Active Music Discovery via Conversational Music LLMs in Douyin Music
链接:https://arxiv.org/abs/2605.27103
作者:Jiahao Liang,Linzhi Huang,Xuannan Liu,Xukai Wang,Xuanpu Luo,Yongchun Zhu,Jingwu Chen,Feng Zhang,Xiao Yang
类目:Information Retrieval (cs.IR)
关键词:passively explore music, feed-based discovery paradigm, users passively explore, Music, adopts an immersive
备注:
点击查看摘要
Abstract:Douyin Music, a large-scale platform with millions of daily users, adopts an immersive, feed-based discovery paradigm, where users passively explore music through continuous recommendations. While effective for passive music discovery, this paradigm restricts users to recommendation results and provides limited support for explicitly specifying listening intents. Unlike conventional search, where users express well-defined intents through explicit queries such as specific songs or artists, real-world active music discovery is often situational and colloquial, involving vague or underspecified requests. While LLMs enable natural language interaction, their direct use in music discovery remains limited by insufficient music-domain knowledge, lack of music-query collaborative reasoning, and shallow understanding of personalized preferences. To address these challenges, we introduce MuChator, an interactive MusicLLM-based framework that enables users to actively express situational music intents in natural language. MuChator incorporates three key components: (1) Music Knowledge Pre-training, a three-stage scheme that incrementally injects objective music knowledge, subjective music knowledge, and personalized music preferences into LLMs; (2) Context-aware Instruction Tuning, which constructs high-quality user-query-music triplets through an automated synthesis pipeline to align LLMs with active and situational user intents; and (3) Preference Alignment with Hybrid RM, which jointly models intent relevance, personalized preferences, and basic constraints, and is optimized using GRPO-based reinforcement learning. Extensive evaluations on industrial music recommendation datasets demonstrate that MuChator outperforms leading proprietary models, such as Gemini-3-Pro. The model has been deployed on Douyin Music App within ByteDance, with 46.49\% improvement of user active days in online A/B test.
7. 【2605.27066】Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search
链接:https://arxiv.org/abs/2605.27066
作者:Mingyue Wang,Xingyu Xie,Hang Yang,Li Gao,Lixin Su,Ge Chen,Dawei Yin,Daiting Shi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:engines handling queries, search engines handling, Baidu Search, engines handling, handling queries
备注: Accepted at KDD 2026
点击查看摘要
Abstract:Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.
8. 【2605.26941】he 2nd EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval
链接:https://arxiv.org/abs/2605.26941
作者:Junchen Fu,Xuri Ge,Xin Xin,Alexandros Karatzoglou,Ioannis Arapakis,Xi Wang,Qijiong Liu,Qian Li,Joemon M. Jose
类目:Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:attracted increasing attention, pretrained multimodal foundation, attracted increasing, increasing attention, pretrained multimodal
备注: Accepted as a workshop proposal at ACM Multimedia 2026
点击查看摘要
Abstract:Multimodal representation learning has attracted increasing attention in AI, driven by the strong performance of large, pretrained multimodal foundation models such as Qwen, LLaVA, and CLIP. These models deliver impressive performance on a range of multimodal information retrieval (MIR) tasks, including web search, cross-modal retrieval, and recommender systems. Yet their massive parameter counts create major efficiency bottlenecks when adapting their representations for IR tasks during training, deployment, and inference. These limitations hinder the practical use of foundation models for representation learning in information retrieval. To address these issues, we propose organizing the EReL@MIR workshop at MM 2026, bringing together researchers from academia and industry to discuss emerging solutions, open challenges, and new efficiency metrics and benchmarks for multimodal IR representation learning in the foundation-model era. The workshop's official website is available at this https URL.
9. 【2605.26902】ICICLE: Expanding Retrieval with In-Context Documents
链接:https://arxiv.org/abs/2605.26902
作者:Yu-Chen Den,Yung-Yu Shih,Zhi Rui Tam,Kuan-Yu Chen,Pu-Jen Cheng,Yun-Nung Chen,Eugene Yang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:maps queries directly, corpus expansion costly, design makes corpus, makes corpus expansion, requires updating model
备注:
点击查看摘要
Abstract:Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes corpus expansion costly: adding new documents requires updating model parameters to encode new document-docid associations incurs repeated training and catastrophic forgetting of previously indexed documents. In this work, we revisit incremental GR as an in-context retrieval problem, where newly added documents are supplied as inference-time document-docid evidence. We propose ICICLE, an in-context indexing framework that performs source-aware docid generation over both parametric memory and context-provided document-docid pairs. ICICLE combines a `[COPY]`-based routing mechanism, preference-based calibration, and large context adaptation to distinguish context-grounded retrieval from parametric retrieval. Experiments on MS MARCO and NQ320K show that ICICLE improves retrieval of newly introduced documents while preserving seen-document retention without corpus-specific retraining. Our analysis further shows that high-shot degradation is mainly caused by routing failure, highlighting source-selection calibration as a key bottleneck for scaling in-context generative retrieval.
10. 【2605.26819】RAGEAR: Retrieval-Augmented Graph-Enhanced Academic Recommender
链接:https://arxiv.org/abs/2605.26819
作者:Francesco Granata,Lorenzo Lamazzi,Misael Mongiovì,Francesco Poggi,Valeria Secchini
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:neurosymbolic recommender system, Graph-Enhanced Academic Recommender, Retrieval-Augmented Graph-Enhanced Academic, neurosymbolic recommender, recommender system
备注:
点击查看摘要
Abstract:We present RAGEAR (Retrieval-Augmented Graph-Enhanced Academic Recommender), a neurosymbolic recommender system for academic course recommendation. RAGEAR combines dense retrieval over full lecture transcripts with a symbolic Knowledge Graph modelling courses, lessons, transcript chunks, credits, study plans, and curricular information. The Knowledge Graph supports symbolic filtering and contextualisation based on structured constraints, such as credits, academic disciplines, study plans, and prerequisites. Unlike metadata-based approaches, it exploits fine-grained instructional content by retrieving transcript chunks semantically aligned with a student's query. The main contribution is a graph-aware aggregation function that propagates chunk-level evidence to course-level recommendations. The score combines three factors: the share of retrieved similarity associated with a course, the rank-based strength of its relevant chunks, and the distribution of evidence across lessons. We evaluate RAGEAR on 152 student-like queries through a human evaluation sample and a large-scale LLM-based relevance assessment. Results show that lecture transcripts improve over metadata-only retrieval, and that RAGEAR further improves ranking quality over a transcript-based normalized SumP baseline, especially for top-ranked recommendations.
11. 【2605.26717】L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation
链接:https://arxiv.org/abs/2605.26717
作者:Pingjun Pan,Tingting Zhou,Peiyao Lu,Tingting Fei,Hongxiang Chen,Chuanjiang Luo
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Adapting large language, large language models, recommendation requires aligning, Adapting large, personalized recommendation requires
备注: Accepted at SIGIR 2026
点击查看摘要
Abstract:Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specific preferences while effectively leveraging both behavioral and semantic signals. Existing approaches typically integrate these signals at either the input level (e.g., injecting behavioral embeddings into the token space) or the output level (e.g., contrastive alignment of separate encoders), suffering from distribution gaps or lack of end-to-end task supervision. In this work, we introduce L2Rec, which unifies behavioral and semantic understanding at the parameter level of LLMs. Our key insight is that the same set of Transformer parameters can serve as a shared medium for both views: by applying view-specific, personalized low-rank perturbations via a Dual-view Personalized Mixture-of-Experts (DPMoE) mechanism, L2Rec enables a single LLM backbone to produce complementary behavioral and semantic adaptations for each user with minimal representation-level misalignment. An adaptive cross-view fusion module further integrates the dual-view outputs into a unified user preference. Experiments on four datasets show that L2Rec consistently outperforms state-of-the-art baselines, and online A/B testing on a large-scale industrial platform validates significant improvements in key engagement metrics.
12. 【2605.26663】Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification
链接:https://arxiv.org/abs/2605.26663
作者:Jingxi Qiu,Zeyu Han,Cheng Huang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE)
关键词:fact verification benchmarks, observationally similar, benchmarks can make, make them observationally, NEI
备注: Preprint. Under review. 20 pages, 2 figures
点击查看摘要
Abstract:Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar. The Not Enough Information (NEI) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns and what its score can hide. We introduce NEI-CAP, a construction-aware diagnostic protocol for insufficient-evidence evaluation. Each NEI example carries the construction family that produced it; NEI-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions. We instantiate the protocol in SciFact-style scientific verification, with FEVER and HoVer as bounded external controls. Across these settings, NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.
13. 【2605.26578】Is Position Bias in Dense Retrievers Built In-or Learned from Data?
链接:https://arxiv.org/abs/2605.26578
作者:Daegon Yu,SeungYoon Han,Woomyoung Park
类目:Information Retrieval (cs.IR)
关键词:Dense retrievers exhibit, Dense retrievers, retrievers exhibit positional, exhibit positional bias, degrading retrieval performance
备注:
点击查看摘要
Abstract:Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.
14. 【2605.26476】FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
链接:https://arxiv.org/abs/2605.26476
作者:Jingbin Qian,Congwen Yi,Min Xia,Wen Wu,Jun Zhu,Jian Guan(a href="http://FutureFab.AI" rel="external noopener nofollow" class="link-external link-http"this http URL/a)
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:vertical domains remains, domains remains difficult, remains difficult due, Retrieval-Augmented Generation, diverse context scales
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.
15. 【2605.26474】Generalized Range Filtering Approximate Nearest Neighbor Search: Containment and Overlap [Technical Report]
链接:https://arxiv.org/abs/2605.26474
作者:Yingfan Liu,Tong Wu,Jiadong Xie,Yang Zhao,Jeffrey Xu Yu,Jiangtao Cui
类目:Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Approximate nearest neighbor, garnered significant attention, recently garnered significant, Approximate nearest, ANN search
备注: The paper has been accepted by KDD 2026
点击查看摘要
Abstract:Approximate nearest neighbor (ANN) search with range filters has recently garnered significant attention. This paper delves into a generalized form of this problem, i.e., ANN search with exact range-range (RR) predicates on a range-valued attribute, named RR filtering ANN (RRANN). Specifically, given $n$ vectors in $\mathbb{R}^d$, each vector $v_i$ is associated with a numeric range $[l_i, r_i]$, symbolizing aspects like a price range or time interval. An RRANN query $(v_q, l_q, r_q)$ aims at finding $k$ vectors closest to $v_q$ within the vectors satisfying an arbitrary RR predicate defined between the query range $[l_q, r_q]$ and the object range $[l_i, r_i]$. The RR predicate remains unspecified, enabling user-defined conditions. It may encompass containment ($[l_i, r_i] \subseteq [l_q, r_q]$ or $[l_q, r_q] \subseteq [l_i, r_i]$), overlap ($l_i \le l_q \le r_i \le r_q$ or $l_q \le l_i \le r_q \le r_i$), or a disjunction of them. RRANN has broad applications in queries related to price ranges or time intervals, and it generalizes existing variants of ANN search with range filters. However, existing dedicated approaches for these problems lack the capacity to support queries with arbitrary RR predicates. Hence, we introduce a new approach, labeled multi-segment tree graph. It efficiently handles arbitrary RR predicates by avoiding traversal through non-predicate-satisfied nodes, and keeps equivalent index size and construction time to state-of-the-art methods for RFANN. Extensive experiments on real-world data demonstrate the efficacy of our approach in RRANN queries, achieving up to 12.5x speedups with the same accuracy as the baselines. Moreover, our approach attains comparable RFANN search performance and notably superior IFANN and TSANN search performance compared to the respective state-of-the-art approaches. Our code is available at this https URL.
16. 【2605.26424】Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation
链接:https://arxiv.org/abs/2605.26424
作者:Ge Fan,Nan Zhao,Kai Meng,Cong Luo,Yang Fu,Huiping Chu,Jialin Liu,Yuning Jiang,Bo Zheng
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:internet services, rapid evolution, evolution of internet, traffic allocation, traffic
备注: accepted by SIGIR 2026
点击查看摘要
Abstract:With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, score inflation, and a lack of interpretability. To address these challenges, we propose Uniboost, a unified traffic allocation framework. Uniboost introduces a posterior value alignment mechanism that calibrates abstract model scores to anchor metrics with explicit business semantics, significantly enhancing interpretability. Furthermore, it employs an independent linear boosting paradigm to decouple complex weighting schemes, enabling precise attribution of each plan's contribution. We validate the effectiveness of Uniboost through online A/B tests and in-depth data analysis, demonstrating three key findings: 1) Reducing the overall weight of weighted scores effectively mitigates unintended business interference, yielding a more efficient micro-level traffic allocation strategy; 2) Post-hoc analyses and aggregated dashboards provide intuitive, macro-level insights that guide the design of the overall traffic allocation mechanism; 3) The proposed "Effective Completion Score" serves as an easily obtainable post-metric that offers a reliable anchor for content recommendation pipelines. Collectively, our experiments show that Uniboost not only improves traffic allocation efficiency and recommendation performance at the micro level but also provides macro-level guidance for system iteration. Thus, this work provides an efficient and controllable traffic regulation solution for large-scale industrial recommendation systems.
17. 【2605.26400】Plans for Evaluating Structured Generative Search Summaries
链接:https://arxiv.org/abs/2605.26400
作者:Tetsuya Sakai,Jina Lee,Hanpei Fang,Young-In Song
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:web search results, atop organic web, generative search summaries, organic web search, structured generative search
备注: 8 pages (including 2 pages for references)
点击查看摘要
Abstract:We propose a framework for evaluating structured generative search summaries that are placed atop organic web search results. A structured summary, generated by a large language model, typically consists of an overview, several sections with section titles, and a list of source documents that are cited within the summary. We then describe our plans for implementing and evaluating the framework.
18. 【2605.26385】Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking
链接:https://arxiv.org/abs/2605.26385
作者:Haruka Kiyohara,Mihaela Curmei,Ariel Evnine,Shankar Kalyanaraman,Israel Nir,Ana-Roxana Pop,Nitzan Razin,Sarah Dean,Thorsten Joachims,Udi Weinsberg
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
关键词:systems typically employ, Large-scale search, early-stage ranker, late-stage ranker, candidate set
备注: ICML2026
点击查看摘要
Abstract:Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.
计算机视觉
1. 【2605.27372】G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing
链接:https://arxiv.org/abs/2605.27372
作者:Bharath Raj Nagoor Kani,Noah Snavely
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:VGGT predict pixel-aligned, Modern feed-forward, methods like VGGT, VGGT predict, Modern
备注: Project Page: [this https URL](https://g3t-paper.github.io/)
点击查看摘要
Abstract:Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.
2. 【2605.27367】SpatialBench: Is Your Spatial Foundation Model an All-Round Player?
链接:https://arxiv.org/abs/2605.27367
作者:Haosong Peng,Hao Li,Jiaqi Chen,Yuhao Pan,Runmao Yao,Yalun Dai,Fushuo Huo,Fangzhou Hong,Zhaoxi Chen,Haozhao Wang,Dingwen Zhang,Ziwei Liu,Wenchao Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:specific hardware constraints, varying input densities, demonstrated impressive performance, hardware constraints, critical question remains
备注: Project Page: [this https URL](https://ropedia.github.io/SpatialBench/)
点击查看摘要
Abstract:While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.
3. 【2605.27365】LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
链接:https://arxiv.org/abs/2605.27365
作者:Shihao Wang,Shilong Liu,Yuanguo Kuang,Xinyu Wei,Yangzhou Liu,Zhiqi Li,Yunze Man,Guo Chen,Andrew Tao,Guilin Liu,Jan Kautz,Lei Zhang,Zhiding Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:decoded largely independently, coordinate-token generation problem, Vision-language models, Parallel Box Decoding, commonly formulate visual
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.
4. 【2605.27351】Feedforward 3D Editing Learns from Semantic-Part Transformation
链接:https://arxiv.org/abs/2605.27351
作者:Jiawei Weng,Saining Zhang,Zhenxin Diao,Peishuo Li,Henghaofan Zhang,Junhao Chen,Hao Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:content creation, editing, fundamental capability, edit, feedforward
备注: 30 pages, 22 figures. Project Page: [this https URL](https://dennis-jwweng.github.io/pxform/)
点击查看摘要
Abstract:3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative paradigms, 3D AI generation remains dominated by training-free editing pipelines. A central challenge of feedforward 3D editing lies in the lack of high-quality paired supervision. Editable 3D assets require simultaneous preservation of geometry, multi-view consistency, structural coherence, and localized edit controllability. Existing 3D editing datasets often rely on independently generated assets, image-mediated reconstruction or narrow edit taxonomies, leading to inaccurate localization, weak preservation, blurred edit boundaries, and limited semantic consistency. In this work, we introduce a new perspective: scalable feedforward 3D editing should be learned from semantic-part transformations. Based on this insight, we propose Pxform, a high-quality 3D editing dataset with over 100K consistent before/after editing pairs across seven edit types. Instead of treating objects as unstructured shapes, our pipeline grounds edits directly in semantic 3D parts. Built upon Pxform, we further propose PartFlow, a feedforward 3D editing network that injects source-aware latent control into pretrained 3D generative priors. PartFlow introduces mask-aware velocity preservation and render-space consistency supervision to jointly improve edit fidelity and source preservation, while requiring no 3D edit mask during inference. Extensive experiments demonstrate that high-quality semantic-part supervision substantially improves scalable 3D editing, enabling PartFlow to achieve state-of-the-art performance on both geometric and appearance editing benchmarks.
5. 【2605.27348】When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
链接:https://arxiv.org/abs/2605.27348
作者:Kim Jihyeon,Sohee Kim,Soosan Lee,Souhwan Jung,James Matthew Rehg,Hyesong Choi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent generative models, photometrically authentic content, Recent generative, Social Gaze Consistency, pixel fingerprints
备注: 23 pages, 2 figures, 17 tables
点击查看摘要
Abstract:Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 - 71.5) and +1.3 pp on the COCOAI Person subset (83.0 - 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.
6. 【2605.27343】owards Controllable Image Generation through Representation-Conditioned Diffusion Models
链接:https://arxiv.org/abs/2605.27343
作者:Nithesh Chandher Karthikeyan,Jonas Unger,Gabriel Eilertsen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:produce specific outputs, specific outputs remains, remains a challenge, emerged as powerful, powerful tools
备注:
点击查看摘要
Abstract:Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text prompts or semantic maps, which require extensively annotated datasets. In this preliminary work, we explore diffusion models conditioned on representations from a pre-trained self-supervised model. The self-conditioning mechanism not only improves the quality of unconditional image generation, but also provides a representation space that can be used to control the generation. We explore this conditioning space by identifying directions of variations, and demonstrate promising properties in terms of smoothness and disentanglement.
7. 【2605.27336】PARE: Pruning and Adaptive Routing for Efficient Video Generation
链接:https://arxiv.org/abs/2605.27336
作者:Yutong Wang,Yunke Wang,Tianfan Xue,Yu Qiao,Yaohui Wang,Xinyuan Chen,Chang Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Diffusion Transformers, Diffusion Transformers, generate high-quality videos, demand substantial compute, substantial compute due
备注:
点击查看摘要
Abstract:Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.
8. 【2605.27332】EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
链接:https://arxiv.org/abs/2605.27332
作者:Zhifei Dou,Shabnam Hassani,Ou Wei
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:static images, remain embedded, embedded as static, Vision Language Models, percentage points
备注: 10 pages
点击查看摘要
Abstract:Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM's original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools.
Comments:
10 pages
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.27332 [cs.SE]
(or
arXiv:2605.27332v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2605.27332
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
9. 【2605.27318】Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning
链接:https://arxiv.org/abs/2605.27318
作者:Xianqiang Gao,Qizhi Chen,Delin Qu,Haoming Song,Zhigang Wang,Bin Zhao,Dong Wang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires accumulating viewpoint-dependent, Video spatial reasoning, reasoning requires accumulating, accumulating viewpoint-dependent evidence, spatial reasoning requires
备注:
点击查看摘要
Abstract:Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.
10. 【2605.27311】Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
链接:https://arxiv.org/abs/2605.27311
作者:Yifan Jiang,Dae Yon Hwang,Jesse C. Cresswell,Freda Shi
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:visual reasoning, benchmarks aim, background knowledge, aim to pose, pose questions
备注:
点击查看摘要
Abstract:Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.
11. 【2605.27310】How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning
链接:https://arxiv.org/abs/2605.27310
作者:Qian Yang,Ankur Sikarwar,Huy Le,Le Zhang,Zhuan Shi,Perouz Taslakian,Aishwarya Agrawal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cross-view spatial reasoning, spatial reasoning remains, fine-grained geometry needed, Cross-view spatial, visual thinking
备注: Preprint
点击查看摘要
Abstract:Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.
12. 【2605.27304】PlayClass: Automated Play Behaviour Classification in Poultry
链接:https://arxiv.org/abs/2605.27304
作者:Prince Ravi Leow(1),Neil Scheidwasser(1 and 3),Rebecca Oscarsson(2),Per Jensen(2),Samir Bhatt(1 and 3),David Alejandro Duchêne(1) ((1) Section for Health Data Science amp; AI, University of Copenhagen, (2) AVIAN Behaviour Genomics and Physiology Group, Linköping University (3) Department of Infectious Disease Epidemiology, Imperial College London)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:leaving positive welfare, targeted negative indicators, largely targeted negative, negative indicators, leaving positive
备注: Accepted at CV4Animals Workshop @ CVPR 2026
点击查看摘要
Abstract:Automated monitoring of animal welfare has largely targeted negative indicators, leaving positive welfare behaviours such as play underexplored. To address this gap, we present PlayClass, a pipeline for play-behaviour classification in poultry from top-down pen video. The pipeline leverages long-duration tracking with SAM 3 via YOLO-guided chunk boundaries to minimise identity errors in point-based prompting, and frozen embeddings from image and video foundation models for play action classification. Although handcrafted motion features from tracked masks alone achieved competitive accuracy, V-JEPA 2.1 consistently outperformed all other backbones across model scales, reaching 77.0 macro-averaged F$_1$ when combined with handcrafted features. Despite this result, the dataset remains challenging due to play sub-types sharing similar kinematic profiles with non-play and inter-bird occlusion. Overall, our work provides encouraging evidence towards automated frameworks for play behaviour classification in poultry.
13. 【2605.27295】Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
链接:https://arxiv.org/abs/2605.27295
作者:Madhuri Shanbhogue,Zhe Li,Shanfeng Zhang,Gustavo Hernández Ábrego,Shih-Cheng Huang,Aashi Jain,Daniel Salz,Sonam Goenka,Chaitra Hegde,Ji Ma,Feiyang Chen,Jiaxing Wu,Tanmaya Dabral,Babak Samari,Kevin Poulet,Daniel Cer,Kaifeng Chen,Paul Suganathan,Hui Hui,Jovan Andonov,Philippe Schlattner,Jay Han,Iftekhar Naim,Wing Lowe,Vladimir Pchelin,Albert Yang,Yi-Ting Chen,Zhongli Ding,Grace Zhang,Georg Heigold,Yichang Chen,Antoine Reveillon,Brendan Mccloskey,Wenlei Zhou,Dahun Kim,Rui Meng,Emma Wang,Jack Zheng,Halley Fede,Zhen Yang,Keegan Mosley,Brian Potetz,Sahil Dua,Henrique Schechter Vera,Shen Gao,Hesen Zhang,Andreas Hess,Hengxuan Ying,Alberto Montes,Karan Gill,Min Choi,Sebastian Russo,Anja Hauth,Jinhyuk Lee,Michael Boratko,Megan Barnes,Vikram Rao,Claudiu Musat,Cyril Allauzen,Ehsan Variani,Shankar Kumar,Tom Bagby,Junyi Jiao,Yang Gu,Tengxin Li,Ayush Agrawal,Roberto Santana,Dev Nath,Stephen Karukas,Shuoxuan Han,Lucia Loher,Alice Twu,Nidhi Vyas,Siddharth Bhai,Frank Palma Gomez,Wangyuan Zhang,Chaoren Liu,Jizheng Yang,Steve Qiu,Shijie Zhang,Sujay Kulkarni,Sascha Rothe,Sean Nakamoto,Raphael Hoffmann,Zach Gleicher,Yunhsuan Sung,Qin Yin,Tom Duerig,Mojtaba Seyedhosseini
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:introduce Gemini Embedding, native multimodal embedding, Gemini Embedding, introduce Gemini, make Gemini Embedding
备注:
点击查看摘要
Abstract:We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.
14. 【2605.27287】A Dynamic Programming Framework for Discovering Count and Values of Multilevel Image Thresholding
链接:https://arxiv.org/abs/2605.27287
作者:Eslam Hegazy,Mohamed Gabr
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multilevel Image thresholding, vision applications nowadays, computer vision applications, important preprocessing algorithm, Multilevel Image
备注:
点击查看摘要
Abstract:Multilevel Image thresholding is an important preprocessing algorithm in computer vision applications nowadays. Since most common thresholding methods take the desired count of thresholds as input by the user, thresholding methods that automatically determines a suitable count of thresholds from the input image itself are advantageous. In this article, a novel thresholding method based on a dynamic programming algorithm and a modification of Minimum Error Thresholding (MET) criterion is thoroughly presented. An empirical statistical study is performed to pinpoint why this proposed method is superior. Moreover, an extended comparison between this proposed method and other state-of-the-art methods is performed on a comprehensive set of natural, satellite and medical test images. The numerical results show that the proposed MET-DP method takes much less time than traditional dynamic programming thresholding methods when the number of thresholds is high. The proposed method can detect a suitable count of thresholds for most of tested images of different types. However, traditional methods that take the count of thresholds as input produce thresholded images of higher structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) values than MET-DP. Source code can be found on this https URL
15. 【2605.27243】Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models
链接:https://arxiv.org/abs/2605.27243
作者:Aaron Branson Cigres Li,Zhaowei Wang,Yu Zhao,Yiming Du,Haobo Li,Xiyu Ren,Ginny Wong,Simon See,Lishu Luo,Haodong Duan,Pasquale Minervini,Yangqiu Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long-horizon agent trajectories, Large vision-language models, vision-language models increasingly, models increasingly rely, locate relevant evidence
备注: Work in Progress
点击查看摘要
Abstract:Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.
16. 【2605.27235】MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale
链接:https://arxiv.org/abs/2605.27235
作者:Zhicong Tang,Zhao Zhang,Jingye Chen,Mohan Zhou,Yifan Pu,Yuchi Liu,Yalong Bai,Ethan Smith,Yuhui Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generated visual content, Layered image generation, Layered image, visual content, analogous to word-level
备注: CVPR 2026
点击查看摘要
Abstract:Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.
17. 【2605.27203】Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis
链接:https://arxiv.org/abs/2605.27203
作者:Mannat Khurana,Sanyam Jain,Rishav Agarwal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:plot Bézier points, manually select presets, configure timing properties, elevates digital documents, plot Bézier
备注: 5 pages, 6 figures
点击查看摘要
Abstract:Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers to manually select presets, plot Bézier points, and configure timing properties. We introduce Generative Animations, a system that transforms natural language prompts into production-ready animations. By chaining Large Language Models (LLMs) for semantic parsing with the Segment Anything Model (SAM) for visual grounding, our pipeline automatically generates motion paths that respect scene geometry, handle depth-based occlusions, and honor 3D perspective transforms. We demonstrate the system through three use cases: contour-following trajectories, orbital animations with z-order awareness, and perspective-aligned motion on transformed objects.
18. 【2605.27194】Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation
链接:https://arxiv.org/abs/2605.27194
作者:Ning Wu,Rui Liu,Xinkun Lin,Weixing Chen,Jinxi Xiang,Tao Wei,Lina Yao,Mingjie Li
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Distilling demonstration effects, hidden-space interventions offers, Distilling demonstration, full finetuning, demonstration effects
备注: Preprint. 20 pages, 6 figures
点击查看摘要
Abstract:Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.
19. 【2605.27178】FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation
链接:https://arxiv.org/abs/2605.27178
作者:Zihui Zhang,Zhixuan Sun,Yafei Yang,Jinxi Li,Jiahao Chen,Bo Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:complex scene point, scene point clouds, scene-level human annotations, annotations during training, address the challenging
备注: ICML 2026. Zihui and Zhixuan are co-first authors. Code and data are available at: [this https URL](https://github.com/vLAR-group/FoundObj)
点击查看摘要
Abstract:We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.
20. 【2605.27158】Model discovery for dynamical systems with complex-valued product units
链接:https://arxiv.org/abs/2605.27158
作者:Martin Brückmann,Babette Dellen,Uwe Jaekel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:future states, deeper insight, structure than mere, Discovering the governing, Discovering
备注: 16 pages, 8 figures
点击查看摘要
Abstract:Discovering the governing equations of a dynamical system from observed trajectories provides deeper insight into its structure than mere prediction of future states. We present a data-driven approach to model discovery based on complex-valued product-unit networks, in which each unit represents a complex monomial and the network output is a sparse linear combination of such monomials. In contrast to established library-based methods such as SINDy, our approach does not require a predefined set of candidate functions: the relevant monomials, including those with fractional or negative exponents, are learned directly from data. Across four chaotic benchmark systems (Lorenz63, Lorenz84, the Four-Wing attractor, and a fractional variant of Lorenz63), we recover the exact governing equations in 90% of trials for the first three systems, and in 70-90% of trials for the fractional case, using at least 3000 training points. Applied to real-world human-gait accelerometer signals, the model produced stable trajectories with bounded prediction errors, corresponding to an RMSE of approximately 12-14% of the signal amplitude range over a test horizon three times longer than the training interval, demonstrating its potential for high-dimensional systems in which analytic equations are unavailable.
21. 【2605.27155】Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection
链接:https://arxiv.org/abs/2605.27155
作者:Nico Steckhan,Krutarth Prajapati,Weija Shao,Silvia Vock
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Testing object detectors, safety-critical domains requires, domains requires semantically, requires semantically meaningful, Testing object
备注:
点击查看摘要
Abstract:Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually or automatically, select operational design domain-derived factors (or custom prompts), and run diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters. After each output, model inference runs automatically and displays annotated before/after comparisons with performance deltas. All probes are logged as structured artifacts, enabling traceable robustness evidence aligned with safety evaluation workflows. We demonstrate \textsc{SemProbe} on hand detection for dimension saws, targeting factors from insurance-oriented test criteria.
22. 【2605.27154】ouch-R1: Reinforcing Touch Reasoning in MLLMs
链接:https://arxiv.org/abs/2605.27154
作者:Yingxin Lai,Yafei Zhou,Fucai Zhu,Siyu Zhu,Weihao Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains largely underexplored, rule-based reinforcement learning, recently catalyzed explicit, catalyzed explicit reasoning, reasoning remains largely
备注: Our code and data will be made public on the [this https URL](https://laiyingxin2.github.io/Projects)
点击查看摘要
Abstract:While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4\% and GPT-4o by 24.7\% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.
23. 【2605.27146】Chaos-SSL: An Attention-Based Self-Supervised Learning Framework with Chaotic Transformation for Medical Image Classification
链接:https://arxiv.org/abs/2605.27146
作者:Joao Batista Florindo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image analysis, reliance on large, powerful paradigm, paradigm to mitigate, mitigate the reliance
备注:
点击查看摘要
Abstract:Self-Supervised Learning (SSL) has emerged as a powerful paradigm to mitigate the reliance on large, annotated datasets, a common bottleneck in medical image analysis. However, standard SSL methods, which rely on simple geometric and color augmentations, may fail to capture the fine-grained, complex textural details necessary for classifying subtle pathologies. This paper introduces Chaos-SSL, a novel two-stage framework for medical image classification. In the first stage, we propose a new self-supervised pre-training strategy that leverages 1D chaotic maps (Logistic, Tent, and Sine) as a complex, non-linear augmentation for contrastive learning. We hypothesize that these chaotic transformations create ``harder'' and more semantically-rich views, forcing a network to learn robust representations of fine-grained medical textures. In the second stage, we introduce an attention-based fusion model that dynamically combines the specialized features from our Chaos-SSL model with the general-purpose features of a larger, ImageNet-pre-trained model. We validate our method on two public datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). Our results demonstrate that the Chaos-SSL model pre-trained with a Tent map for 30 epochs, followed by attention fusion, achieves performance fully competitive with the state-of-the-art, yielding an accuracy of 0.9261 on ISIC 2018 and 0.8726 on APTOS 2019. This significantly outperforms existing SSL methods, including several recent approaches.
24. 【2605.27144】Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification
链接:https://arxiv.org/abs/2605.27144
作者:Pedro Henrique da Costa Avelar,Anderson R. Tavares,Luís C. Lamb
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:irregular image representations, processing irregular image, traditionally leveraged graph, graph neural networks, neural networks
备注:
点击查看摘要
Abstract:Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpassing convolutional neural networks (CNNs) in various tasks. However, a synergistic connection between GNNs, superpixels, and transformers remains unexplored. In this work, we propose Superpixel Transformers (SPT), a novel framework that unifies superpixel-based image classification and ViTs. SPT generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViT to support arbitrary superpixel-based chunking strategies, connectivity graphs, and positional encodings. We introduce refinements including a multidimensional sine-cosine positional encoding and an enriched patch data structure that fully incorporates superpixel shape and color information. By testing SPT across datasets such as CIFAR10, FashionMNIST, and Imagenette, with various superpixel generation and graph connectivity strategies, we demonstrate that SPT achieves superior performance compared to previous superpixel-based GNN methods and remains competitive with ViTs. Notably, our approach addresses the limitations of SICGAT, such as information loss during pixel aggregation, and shows how constrained graph connectivity can enhance ViT performance. SPT bridges the gap between superpixel-based and transformer models, opening avenues for cross-domain generalization and future innovations in hybrid attentional frameworks, and showing that an image can also be worth $16\times16$ superpixels.
25. 【2605.27136】Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation
链接:https://arxiv.org/abs/2605.27136
作者:Joseph Hoche,David Brellmann,Gianni Franchi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision Language, Large Vision, Vision Language Models, challenge in Large, Vision Language
备注:
点击查看摘要
Abstract:Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the contribution of visual information to LVLM uncertainty largely underexplored. In this paper, we investigate how LVLMs process visual information and whether this process can be used to improve uncertainty estimation. By analyzing hidden representations after the integration of visual features during the generation process, we observe that high-confidence predictions rely more heavily on visual content than uncertain ones. Building on this insight, we propose Visual-Grounded Token UQ (VIG-TUQ), a training-free framework that explicitly incorporates visual grounding into uncertainty estimation by weighting token-level language uncertainty with visual grounding scores. We evaluate VIG-TUQ on multiple datasets and across diverse LVLM architectures, including early-fusion, late-fusion, and native-fusion models. Results indicate that our method often improves upon existing token-level uncertainty approaches. Code and data will be made available upon acceptance.
26. 【2605.27135】Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?
链接:https://arxiv.org/abs/2605.27135
作者:Enoal Gesny,Eva Giboulot
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:identifying AI-generated images, generative models, diffusion models, extremely low false-alarm, rapid proliferation
备注:
点击查看摘要
Abstract:With the rapid proliferation of generative models, such as diffusion models, digital watermarking has emerged as a crucial solution for identifying AI-generated images. Modern post-hoc watermarking schemes use neural networks to achieve an extremely low false-alarm rate while remaining robust to common image transformations. However, there is a lack of comparison between these modern methods and classic ones, particularly in real-world scenarios where robustness and security take precedence over achieving an extremely low false-alarm probability. In this paper, we propose a fair comparison of robustness and security between modern and classic post-hoc watermarking across various types of classic augmentations and recent sophisticated attacks. Our experiments show that, in a realistic scenario, classic watermarking outperforms modern techniques in terms of security while maintaining robustness.
27. 【2605.27132】Image Thresholding: Understanding Bias of Evaluation Metrics towards Specific Evaluation Functions
链接:https://arxiv.org/abs/2605.27132
作者:Eslam Hegazy,Mohamed Gabr
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Structural Similarity Index, Multilevel image thresholding, Multilevel image, remote sensing, applications ranging
备注: Submitted to ICPR 2026 ( [this https URL](https://icpr2026.org) )
点击查看摘要
Abstract:Multilevel image thresholding is widely used for segmentation in applications ranging from medical imaging to remote sensing. Classical objective functions, such as Otsu's between-class variance and Kapur's entropy, are often optimized using metaheuristic algorithms, with performance evaluated via metrics like Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). These evaluations implicitly assume that SSIM and PSNR provide unbiased measures of segmentation quality. In this study, we examine this assumption by analyzing the correlation between thresholding objective functions and quality metrics across all possible thresholds for images in the BSDS500 dataset. Results show that Otsu's criterion consistently exhibits high correlation with both SSIM and PSNR, while Kapur's entropy demonstrates weaker and more variable correlation. Otsu outperforms Kapur in correlation with PSNR for all images and with SSIM for over 91%. Our findings reveal an inherent metric-objective-function bias. This work highlights the need for more neutral evaluation frameworks and motivates extending the analysis to additional thresholding criteria and domains. Source code of this paper can be found at this https URL
28. 【2605.27129】YOLO26-RipeLoc Lite: A lightweight architecture for tomato ripeness detection and picking point localization in greenhouse robotic harvesting
链接:https://arxiv.org/abs/2605.27129
作者:Rajmeet Singh,Manveen Kaur,Shahpour Alirezaee,Irfan Hussain
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:automated harvesting requires, harvesting requires accurate, requires accurate detection, greenhouse tomato production, precise picking-point localization
备注:
点击查看摘要
Abstract:In greenhouse tomato production, automated harvesting requires accurate detection of ripe tomatoes, ripeness classification, and precise picking-point localization for robotic end-effectors. This paper proposes YOLO26-RipeLoc Lite, a lightweight deep learning architecture based on YOLO26 for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes. The model introduces three modifications: (1) a Lightweight Feature Pyramid Network (LFPN) with depthwise separable convolutions for efficient multi-scale fusion, (2) a Ripeness-Aware Attention Module (RAAM) with dual pooling and a learnable ripeness bias vector for enhanced color-texture discrimination, and (3) a Compact Detection Head (CDH) with shared convolutions and an integrated center-point regression branch for direct grasp planning. The model is evaluated on a custom dataset of 1,500 images with 6,227 instances (3,566 ripe, 2,661 unripe) from the SILAL greenhouse, Abu Dhabi, UAE. YOLO26-RipeLoc Lite achieves mAP@0.5 of 92.9% (95.2% ripe, 90.6% unripe) with the highest precision (95.2%) among all evaluated architectures using only 2.38M parameters. Post-training BatchNorm pruning at 30% reduces parameters to ~1.8M with negligible accuracy loss. Ablation studies confirm that greenhouse-aware HSV augmentation provides the largest improvement (+2.02 pp mAP@50), backbone freezing achieves peak precision (93.8%), and 3-phase progressive unfreezing yields the best localization quality (mAP@50:95 of 64.6%). Comparisons with YOLOv8n/s, YOLO11n/s, YOLO12n/s, and YOLO26s confirm superior accuracy-efficiency: 2.9 pp higher precision than YOLO12n with 7.0% fewer parameters and integrated center-point localization for robotic end-effector guidance.
29. 【2605.27128】PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance
链接:https://arxiv.org/abs/2605.27128
作者:Yujing Zhou,Prashant Shekhar,Thomas Yang,Yongxin Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:offer an excellent, excellent balance, balance between accuracy, segmentation models offer, Real-time semantic segmentation
备注:
点击查看摘要
Abstract:Real-time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine-tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state-of-the-art real-time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real-time and lightweight strategy by implementing a parallel Derivative-branch (D-branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real-time performance.
30. 【2605.27116】COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection
链接:https://arxiv.org/abs/2605.27116
作者:Yupeng Zhang,Ruize Han,Yuzhong Feng,Zixin Ren,Yuntong Tian,Liang Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made significant progress, Open-vocabulary object detection, object detection, significant progress, enabling detectors
备注:
点击查看摘要
Abstract:Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional this http URL-114 and the code will be released.
31. 【2605.27102】JLT: Clean-Latent Prediction in Latent Diffusion Transformers
链接:https://arxiv.org/abs/2605.27102
作者:Funing Fu,Tenghui Wang,Junyong Cen,Qichao Zhu,Guanyu Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:ambient noised quantity, exploit low-dimensional structure, Flow matching, noised quantity, matching with clean-data
备注:
点击查看摘要
Abstract:Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.
32. 【2605.27101】Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models
链接:https://arxiv.org/abs/2605.27101
作者:Oscar Chew,Serhii Honcharenko,Qian-Hui Chen,Patricia Lu,Dishant Zaveri,Khoa D. Doan,Kuan-Hao Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Video Large Language, Large Language Models, Large Language, reliably linking subjects, Video Large
备注:
点击查看摘要
Abstract:A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.
33. 【2605.27080】Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning
链接:https://arxiv.org/abs/2605.27080
作者:Qida Tan,Hongyu Yang,Wenchao Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:insufficient dataset diversity, Appearance-based gaze estimation, poor generalization due, dataset diversity, Appearance-based gaze
备注: ICML2026
点击查看摘要
Abstract:Appearance-based gaze estimation always suffers from poor generalization due to limited annotated samples and insufficient dataset diversity. Leading approaches adopt weakly supervised learning to generate large-scale pseudo-labeled data from unconstrained real-world scenarios, aiming to mitigate the domain shifts. In this work, we devise a simple yet effective semi-supervised learning architecture that leverages unlabeled data to enhance domain generalization, thereby reducing reliance on labor-intensive manual annotations. Our key insight is to impose Jacobian regularization to disentangle feature representations into discriminative subspaces dedicated to specific gaze components, such as pitch and yaw angles. We further exploit the intrinsic ordinal ranking within each subspace for contrastive learning, enabling the model to learn robust gaze representations from a small set of labeled samples and an abundance of unlabeled ones. This ultimately yields our Disentangled Subspace Contrastive Learning (DSCL) framework. Extensive experiments on multiple benchmarks verify that the proposed DSCL is plug-and-play, achieving competitive performance using only 20\%, 10\%, and even 5\% of the annotated data under both in-domain and cross-domain evaluation settings. The public code is available at \href{this https URL}{this https URL}.
34. 【2605.27075】SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration
链接:https://arxiv.org/abs/2605.27075
作者:Yuhang Zhang,Junxiang Qiu,Huixia Ben,Zhenhua Tang,Shuo Wang,Yanbin Hao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:costly Transformer evaluations, achieve strong visual, strong visual quality, iterative denoising process, Diffusion Transformers
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying intermediate features, yet the runtime decision of when to execute a Full step is often driven by fixed schedules or hand-tuned thresholds. We propose \textbf{SoftCap}, a training-free control layer for cache-based DiT inference. SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations. On FLUX.1-dev, SoftCap improves over SpeCa at a comparable middle-compute operating point, raising ImageReward from 0.967 to 0.981 and reducing LPIPS-Full from 0.518 to 0.498 at nearly identical FLOPs, while target-sweep diagnostics show the intended soft-ceiling behavior as the budget is relaxed.
35. 【2605.27074】IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams
链接:https://arxiv.org/abs/2605.27074
作者:Jinzhao Li,Yinuo Chen,Wenxuan Song,Yijia Lei,Yichi Zhang,Honglei Yan,Panwang Pan,Miao Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent multimodal large, large language models, achieve strong performance, continuous visual inputs, multimodal large language
备注:
点击查看摘要
Abstract:Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.
36. 【2605.27067】BEAT: Rhythm-Elastic Alignment for Agentic Music-guided Movie Trailer Generation
链接:https://arxiv.org/abs/2605.27067
作者:Yutong Wang,Yunke Wang,Xinyuan Chen,Chang Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automatic movie trailer, Automatic movie, movie trailer generation, generation must select, full-length film
备注:
点击查看摘要
Abstract:Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.
37. 【2605.27032】SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation
链接:https://arxiv.org/abs/2605.27032
作者:Yuqi Liu,Yufei Chen,Wei Fu,Xiaodong Yue,Shuo Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:early cancer diagnosis, annotation scarcity necessitates, scarcity necessitates Semi-Supervised, necessitates Semi-Supervised Learning, cancer diagnosis
备注: 10.5 pages, 5 figures, Medical Image Computing and Computer Assisted Intervention 2026
点击查看摘要
Abstract:Accurate pancreas segmentation is critical for early cancer diagnosis, where annotation scarcity necessitates Semi-Supervised Learning (SSL). However, due to significant inter-sample morphological variability, existing SSL methods face severe generalizability limitations under sparse supervision, leading to the Supervision Bias problem. To address this, we propose Structural Consensus-based KAN Prototype Learning (SCKAN), which constructs the first cross-sample structural consensus learning with Kolmogorov-Arnold Networks (KANs), to achieve more generalizable and accurate segmentation. Specifically, SCKAN contains two key designs: Structure-constrained Prototype Consistency Learning (SPCL), which prompts unbiased structural representation by enforcing cross-sample consistency via prototype-level contrastive optimization, and Consensus-based Kolmogorov-Arnold Fusion (CKaF), which reduces morphology-specific bias by aggregating stable consensus and filtering sample-wise noise via KAN's adaptive B-spline nonlinearity. Extensive experiments on two public pancreas datasets demonstrate the effectiveness of SCKAN. Code is at this https URL.
38. 【2605.27024】NeR-SC: Adapting Neural Video Representation to Screen Content
链接:https://arxiv.org/abs/2605.27024
作者:Ruohan Shi,Jiaoyan Zhao,Haogang Feng
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:achieving competitive performance, Implicit neural representations, recent methods achieving, methods achieving competitive, Implicit neural
备注: Submitted to PRMVAI 2026
点击查看摘要
Abstract:Implicit neural representations have emerged as a promising paradigm for video compression, with recent methods achieving competitive performance on natural video. However, screen content video -- common in remote desktop, online education, and cloud gaming -- exhibits distinct statistics: sharp edges, limited color palettes, and strong temporal redundancy. Existing neural representation methods, designed for natural scenes, lack mechanisms to exploit these properties, leaving substantial room for improvement. In this paper, we propose NeR-SC, a neural representation framework tailored for screen content video. Building on the SNeRV backbone, NeR-SC introduces three screen-content-specific modules: (i) a learnable color palette that models the discrete color structure of screen content by restricting the low-frequency sub-band to a learned color set; (ii) a multi-gate dense fusion module that replaces sequential feature fusion with dense, attention-gated cross-stage interaction; and (iii) an embedding-level frame skip strategy that bypasses redundant decoder invocations for static frames, with zero training overhead. Experiments on DSCVC and VCD show that NeR-SC achieves 40.32~dB and 41.73~dB average PSNR, outperforming representative neural video representation methods and, at low bitrates, surpassing H.264 and H.265. The skip strategy enables real-time decoding with no loss in quality.
39. 【2605.27020】Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
链接:https://arxiv.org/abs/2605.27020
作者:Tao Qi,Huili Wang,Yuanhong Huang,Wendan Wang,Lianchao Zhao,Jinrui Wang,Zichen Qin,Shangguang Wang,Yongfeng Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:privacy infringements involving, infringements involving human-created, involving human-created data, rapid advancement, advancement of diffusion-based
备注: 13 pages, 9 figures; CVPR 2026 camera-ready
点击查看摘要
Abstract:The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.
40. 【2605.27003】mestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V
链接:https://arxiv.org/abs/2605.27003
作者:Junhao Wu,Dezhong Yao,Hai Jin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:diffusion Transformers offers, Transformers offers substantial, multi-step denoising trajectory, video diffusion Transformers, sparse large-magnitude activation
备注:
点击查看摘要
Abstract:W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.
41. 【2605.26994】ChartAct: A Benchmark for Dynamic Chart Understanding
链接:https://arxiv.org/abs/2605.26994
作者:Muye Huang,Wu Lin,Lingling Zhang,Hang Yan,Zhiyuan Wang,Yumeng Fu,Zesheng Yang,Jun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:present complex data, Dynamic chart understanding, chart understanding, Dynamic chart, chart
备注:
点击查看摘要
Abstract:Charts are widely used to present complex data for analysis and decision making. Existing chart understanding benchmarks mainly focus on static charts, but real-world charts are often dynamic and interactive. Key information may only appear after actions such as hovering, clicking, zooming, or dragging. Dynamic chart understanding therefore requires models to identify visible content, choose proper interactions, and reason over changing chart states. To evaluate this ability, we propose ChartAct, an interactive benchmark for dynamic chart understanding. ChartAct collects and filters 673 dynamic charts from 8 real chart websites, covers 7 common chart types, and constructs 1,440 high-quality question-answer samples. Each sample is instantiated in two environments, Dynamic Chart and Dashboard Chart, to evaluate dynamic chart understanding under different contexts. Based on ChartAct, we systematically evaluate 11 advanced multimodal models and GUI agents. Experimental results show that existing models still have clear limitations in dynamic chart understanding. The strongest model, Claude-Opus-4.7, achieves an average success rate of 84.5\%, while most models remain below 60\%. We also conduct detailed failure attribution and case analysis. ChartAct provides a new benchmark for studying chart understanding in real interactive environments. Codes at this https URL
42. 【2605.26992】On the Robustness of Machine Unlearning for Vision-Language Models
链接:https://arxiv.org/abs/2605.26992
作者:Yujie Lin,Kaidi Jia,Jiayao Ma,Chengyi Yang,Jinsong Su
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:motivating growing interest, Vision-language models, memorize undesirable information, VLM unlearning, training data
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at this https URL.
43. 【2605.26967】CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning
链接:https://arxiv.org/abs/2605.26967
作者:Zihan Lin,Songhe Deng,Shuwei He,Danxiang Zhu,Dan Zhang,Yishu Lei,Xianlong Luo,Shikun Feng,Rui Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing video captioning, introduce heavy redundancy, Existing video, heavy redundancy, captioning methods struggle
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.
44. 【2605.26949】DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models
链接:https://arxiv.org/abs/2605.26949
作者:Furkan Mert Algan,Eckehard Steinbach
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:noisy real-world observations, partial scans remains, scans remains challenging, inferring missing structure, real-world observations
备注:
点击查看摘要
Abstract:3D shape completion from partial scans remains challenging for unseen categories and noisy real-world observations, where geometry alone is often insufficient for inferring missing structure. We present DinoComplete, a deterministic and efficient shape completion framework that augments geometric reconstruction with voxel-aligned semantic priors distilled from DINO features. First, we construct multi-view DINO feature volumes aligned with ShapeNet data and train a student network to predict dense semantic features directly from incomplete shapes. These predicted features capture global structure and part-aware semantic context while remaining aligned with the underlying geometry. We then integrate these distilled features into a completion network, where geometric and semantic voxel representations are fused through voxel state-space modeling. To enable efficient long-range reasoning without sacrificing resolution, we introduce a multi-scale voxel Mamba module that refines the fused features by combining full-grid and chunk-wise sequence modeling. Experiments on unseen ShapeNet categories and ScanNet objects show that DinoComplete achieves stronger completion quality than prior deterministic and generative based completion methods while using fewer parameters, requiring lower memory, and achieving faster inference. Our results demonstrate that distilling semantic priors from visual foundation models improves generalization and robustness in 3D shape completion.
45. 【2605.26944】Object Pose and Shape Estimation for Grasping: Does it Work?
链接:https://arxiv.org/abs/2605.26944
作者:Pavan Karke,Kushal Shah,Gaurav Singh,Md Faizal Karim,K Madhava Krishna,Rajat Talak
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:pose and shape, shape estimation methods, shape estimation, object pose, shape
备注: 9 pages, 8 figures
点击查看摘要
Abstract:The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.
46. 【2605.26933】Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking
链接:https://arxiv.org/abs/2605.26933
作者:Zhengbo Zhang,Zhigang Tu,Junsong Yuan,De Wen Soh,Bo Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion models, Unsupervised visual object, ground-truth annotations, prompt, requires following arbitrary
备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026
点击查看摘要
Abstract:Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross-attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets demonstrate the effectiveness of our approach.
47. 【2605.26921】Revealing the core dimensions underlying representations in brains, behavior and AI
链接:https://arxiv.org/abs/2605.26921
作者:Florian P. Mahner,Ka Chun Lam,Francisco Pereira,Martin N. Hebart
类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:including neuroscience, widespread across fields, artificial intelligence, Similarity-Based Representation Factorization, representations
备注:
点击查看摘要
Abstract:The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and leveraging the dimensions underlying representations.
48. 【2605.26914】I2PRef: Image-Driven Point Completion with Iterative Refinement
链接:https://arxiv.org/abs/2605.26914
作者:Azhar Hussian,Marina Ritthaler,André Kaup,Vasileios Belagiannis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:primary geometric source, image-conditioned point cloud, secondary guide, present an image-conditioned, approach that treats
备注:
点击查看摘要
Abstract:We present an image-conditioned point cloud completion approach that treats images as the primary geometric source rather than a secondary guide. To this end, we introduce an Image-to-Point (I2P) module that can reconstruct complete point clouds directly from a single RGB image, with no need for 3D inputs. Additionally, we introduce a transformer-based Point-to-Point (P2P) refinement module that uses self- and cross-attention between point tokens and image features to iteratively refine the coarse I2P output. The I2P module enables the image encoder to learn rich geometric representations, while the P2P module progressively recovers fine-grained details. Unlike existing multimodal methods that rely on auxiliary losses or fusion modules, our explicit I2P task provides a strong, geometry-aware prior based on images alone. Extensive experiments on ShapeNet-ViPC demonstrate state-of-the-art completion performance with a 12.3% relative Chamfer Distance improvement over prior methods. Code is available at: this https URL
49. 【2605.26894】SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising
链接:https://arxiv.org/abs/2605.26894
作者:Chengwei Zhang,Xueyi Zhang,Tao Jiang,Xinhao Xu,Wenjie Li,Fubo Zhang,Longyong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:noise directly perturbs, directly perturbs point, perturbs point coordinates, location and geometry, directly perturbs
备注: Accepted by ICML 2026. 17 pages, 8 figures, 8 tables
点击查看摘要
Abstract:In point clouds, noise directly perturbs point coordinates that encode both spatial location and geometry, making one-to-one correspondence construction more challenging than in images. Existing methods impose statistical mappings across noisy variants via noise or optimal transport, but suffer from correspondence ambiguity. In this work, we propose Self-Induced Mirror-Point Consistency (SIMPC) to learn deterministic correspondences between points and the underlying surface in an unsupervised manner. For each noisy point, SIMPC generates a mirror-point on the opposite side of the underlying surface, guided by geometric priors during the denoising process. By encouraging consistency between the denoising targets of the original point and its mirror counterpart, SIMPC effectively localizes the position of underlying surface. Extensive experiments on synthetic and real-world datasets demonstrate that SIMPC significantly outperforms state-of-the-art unsupervised methods and surpasses several strong supervised counterparts.
50. 【2605.26884】Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation
链接:https://arxiv.org/abs/2605.26884
作者:Oussama Messai,Abbass Zein-Eddine,Abdelouahid Bentamou,Mickael Picq,Nicolas Duquesne,Stéphane Puydarrieux,Yann Gavet
类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
关键词:computer vision, address the problem, problem of detecting, major challenge, detecting small
备注:
点击查看摘要
Abstract:In this paper, we address the problem of detecting small, dense, and overlapping objects, a major challenge in computer vision. Our focus is on reviewing proposed methods based on deep learning supervised approaches. We provide a detailed comparison of these systems on a new dataset of more than 10k images and 120k instances, highlighting their performance, accuracy, and computational efficiency in the industrial recycling process use case. Through this comparative analysis, we identify the most reliable systems currently available and the specific challenges they are designed to tackle. Furthermore, we explore the benefits of data augmentation and synthetic images. Based on our analysis, we also propose potential future directions and innovative solutions that could enhance the effectiveness of small, dense and overlapped object detection systems. The scope of our investigations encompasses object detection, length measurement, and anomaly detection within the context of the recycling process. The anomaly detection strategy is robust against variations in image resolution and zoom levels, ensuring reliable performance in industrial applications. The repository of the proposed dataset, methods and evaluation codes can be found at: this https URL
51. 【2605.26879】Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
链接:https://arxiv.org/abs/2605.26879
作者:Dingkun Wei,Zehong Shen,Yan Xia,Georgios Pavlakos,Yujun Shen,Xiaowei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Human Motion Recovery, Human motion recovered, dynamically inconsistent, Human motion, overly smooth
备注: 13 pages, 6 figures. Accepted as an Oral presentation and Best Paper Candidate at CVPR 2026. Project page: [this https URL](https://zju3dv.github.io/htd-refine/)
点击查看摘要
Abstract:Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues -- velocity and acceleration -- which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail. We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, 3D velocities, and 3D accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines world-space trajectories, significantly reducing jitter, suppressing over-smoothing, and restoring physically plausible motion. Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.
52. 【2605.26862】RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction
链接:https://arxiv.org/abs/2605.26862
作者:Chenxu Peng,Chenxu Wang,Yimian Dai,Yongxiang Liu,Ming-Ming Cheng,Xiang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate road segmentation, Accurate road, geospatial applications, aerial imagery, imagery is fundamental
备注:
点击查看摘要
Abstract:Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg-360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7M parameters. The code are publicly available at: this https URL
53. 【2605.26861】REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization
链接:https://arxiv.org/abs/2605.26861
作者:Yong Li,Furong Jia,Dacheng Yin,Kang Rong,Fengyun Rao,Jing Lyu,Fan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Image geo-localization aims, recognizing visible landmarks, Image geo-localization, visible landmarks, geo-localization aims
备注:
点击查看摘要
Abstract:Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at this https URL.
54. 【2605.26855】Receipt Replay OOD: A Small Benchmark for Screen Replay Detection Under Domain Shift
链接:https://arxiv.org/abs/2605.26855
作者:Alexander Vinogradov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:presentation attack detection, Public datasets, significantly contributed, contributed to research, research on presentation
备注:
点击查看摘要
Abstract:Public datasets such as DLC-2021, SynID, and KID34K have significantly contributed to research on presentation attack detection for identity documents, including screen replay attacks. However, evaluation of out-of-domain (OOD) robustness remains insufficiently explored, especially under realistic domain shifts. In this work, we introduce Receipt Replay OOD, a small out-of-domain benchmark for screen replay detection. Receipts share several characteristics with identity documents, including planar geometry, curved corners, wear-and-tear artifacts, and text or logo patterns, while avoiding personally identifiable information constraints commonly associated with identity documents. We evaluate document replay detection models under cross-domain conditions and demonstrate the impact of domain shift on generalization performance. The dataset is publicly available.
55. 【2605.26831】OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes
链接:https://arxiv.org/abs/2605.26831
作者:Regina Kurkova,Maxim Popov,Sergey Kolyubin
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:manipulation-relevant corner cases, fixed benchmark datasets, Semantic mapping methods, downstream robotic reasoning, corner cases
备注: Code: [this https URL](https://github.com/be2rlab/OSMa-Bench-v2)
点击查看摘要
Abstract:Semantic mapping methods are increasingly used as intermediate scene representations for downstream robotic reasoning and manipulation, yet their evaluation is still largely tied to fixed benchmark datasets with limited coverage of manipulation-relevant corner cases. In this work, we extend OSMa-Bench toward controllable benchmarking with prompt-generated synthetic indoor scenes. Our pipeline automatically generates scene descriptions, synthesizes corresponding environments with SceneSmith, and adapts the resulting assets into an OSMa-Bench-compatible simulation format. This adaptation requires a nontrivial intermediate layer, including semantic normalization, material and texture repair, shader fallback policies, floor handling, navigation setup, and controlled lighting configuration. A key advantage of the proposed setup is that the original scene-generation prompt is known in advance and can therefore serve as an auxiliary semantic specification of the intended scene. We use this property to extend the VQA component of OSMa-Bench with a prompt-grounded question category. The resulting framework supports targeted stress-testing of semantic scene representations under conditions such as clutter, small objects, partial occlusions, and lighting variation, and makes benchmarking more extensible and better aligned with downstream manipulation requirements. Our code is available at this https URL.
56. 【2605.26830】he Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery
链接:https://arxiv.org/abs/2605.26830
作者:Vasileios Saketos,Ming Xiao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Kalman Filter, Optimized Kalman Filter, Gaussian noise, signal processing, linear dynamics
备注:
点击查看摘要
Abstract:State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under linear dynamics, Gaussian noise, and known noise covariances. However, these assumptions often fail in realistic sensing settings such as Doppler radar and LiDAR. In these cases, the optimal estimator is inherently nonlinear, which leads to systematic performance degradation. This creates a performance gap that cannot be eliminated by tuning the noise covariance parameters (i.e., the process and measurement noise in the Kalman Filter) alone. To address this limitation, we propose Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing both noise parameters and the update structure. Our approach leverages large language models (LLMs) as a structured prior over program space, enabling the generation of interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. We provide analytical results establishing the suboptimality of affine estimators under common nonlinear sensing models, motivating the need for structure-aware updates. Across a range of synthetic and real-world tracking benchmarks, including Doppler radar, LiDAR-based localization, and pedestrian tracking, the discovered algorithms consistently improve over strong baselines such as the Optimized Kalman Filter, achieving up to 12\% reduction in RMSE. These results suggest that optimizing the structure of the Kalman filter, rather than only its parameters, provides a practical and interpretable way to improve state estimation.
57. 【2605.26774】Cesarean Scar Defect Segmentation in Transvaginal Ultrasound Images: a Dataset and Benchmark
链接:https://arxiv.org/abs/2605.26774
作者:Yuan Tian,Yue Li,Wei Xia,Tianyu Xu,Jian Zhang,Liye Shi,Jing Liu,Yang Wang,Ming Liu,Qing Xu,Yixuan Zhang,Maggie M. He,Xiangjian He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cesarean Scar Defect, Scar Defect, Cesarean Scar, cesarean delivery, prevalent complications
备注:
点击查看摘要
Abstract:Cesarean Scar Defect (CSD) is one of the most prevalent complications following cesarean delivery. Transvaginal ultrasonography is widely used for primary CSD screening. Accurate determination of CSD outline and dimensions is crucial for treatment. However, CSDs are frequently overlooked by sonographers due to small size and irregular morphology, suboptimal image quality, and limited clinical awareness in resource-constrained settings. Despite artificial intelligence advances in medical imaging, no public dataset exists for transvaginal ultrasound CSD segmentation. To address this gap, we present a comprehensive CSD dataset comprising 1,111 images and 16 videos, yielding 501 positive samples with confirmed CSD and precise pixel-level manual annotations. Annotations are performed following standardized clinical guidelines through collaboration between experienced sonographers and trained PhD students. This work provides high-quality benchmark resources for advancing medical image segmentation algorithms and promoting clinical innovation. Ultimately, improved CSD diagnosis and subsequent treatment strategies can enhance the quality of life in women of reproductive age, representing significant value for both medical research and clinical practice.
58. 【2605.26761】Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning
链接:https://arxiv.org/abs/2605.26761
作者:Mingkang Dong,Hongyi Cai,Xiwen Lei,Jie Li,Tao Zhang,Muxin Pu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adapting vision language, vision language models, making data selection, data selection critical, highly redundant
备注: 15 pages, 6 figures. Mingkang Dong and Hongyi Cai contributed equally to this work. Muxin Pu is the corresponding author
点击查看摘要
Abstract:Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection framework that trains a reusable selector once and applies it to any dataset or model without recomputation. OFA clusters multimodal instructions in a frozen CLIP space, derives pseudo labels from the cluster structure, and trains a lightweight selector for only a few epochs; samples on which this selector is least confident are selected as the most informative. Once trained, the frozen selector transfers directly across datasets and model scales. The selector is trained once on LLaVA-665K and applied both to LLaVA-665K itself and, without any retraining, to the unseen Vision-Flan-186K. Selecting only 15% of the data, OFA achieves 98.3% of full data performance across 10 downstream benchmarks; on the smaller Vision-Flan-186K, the transferred selector surpasses full data training by 10.6%, confirming that the learned signal generalizes to datasets never seen during selector training. The same selected subsets benefit VLMs at both Qwen2.5-VL-3B and LLaVA-v1.5-7B without per model recomputation, decoupling selection from the target model. These results demonstrate that a single, transferable selector provides an effective and reusable solution for efficient multimodal instruction tuning.
59. 【2605.26744】Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy
链接:https://arxiv.org/abs/2605.26744
作者:Pascal Herrmann,Maarten Bieshaar,Dennis Mack,Robert Herzog,Juergen Gall
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made tremendous progress, surpassing ground truth, ground truth data, approaches surpassing ground, leading evaluation benchmarks
备注: Accepted to BMVC 2025
点击查看摘要
Abstract:Human motion generation has made tremendous progress in recent years, with state-of-the-art approaches surpassing ground truth data in leading evaluation benchmarks. However, visual inspection of the generated motions paints a different picture. Even state-of-the-art approaches generate motions frequently containing self-intersections, i.e., body parts interpenetrating, which are strong artifacts, severely limiting the perceived motion quality. We introduce a novel loss, which explicitly penalizes self-intersections, to the training of human motion generation methods. We base our loss on a sphere proxy of human geometry, which allows us to calculate a self-intersection loss 98% faster and uses 83% less memory than comparable methods based on triangular meshes. The loss is agnostic to the specific approach, and we add it to the training of the recent human motion generation methods human motion diffusion model (MDM) and MoMask. Our extensive experiments show a reduction of self-intersections in generated motions of up to 49% while improving other evaluation metrics. The code is available at this https URL .
60. 【2605.26734】CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains
链接:https://arxiv.org/abs/2605.26734
作者:Tomohisa Takeda,Yu-Chieh Lin,Yuji Nozawa,Youyang Ng,Osamu Torii,Yusuke Matsui
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing Multi-Turn Composed, Composed Image Retrieval, Multi-Turn Composed Image, lack dialogue-history consistency, Composed Image
备注:
点击查看摘要
Abstract:Existing Multi-Turn Composed Image Retrieval (MTCIR) datasets lack dialogue-history consistency and are restricted to the fashion domain. To address these limitations, we construct CIRCLED by extending FashionIQ, CIRR, and CIRCO. In CIRCLED, the query at each turn progressively approaches the target image. Data are generated via a CIReVL-based retrieval pipeline and curated with multiple filters on retrieval success, turn length, consistency, and information redundancy to ensure quality. In total, we collect 22,608 multi-turn sessions across nine subsets, substantially exceeding Multi-turn FashionIQ (11,505 sessions) in both scale and generality. We further apply multiple baseline methods and quantitatively assess retrieval accuracy on CIRCLED. Our work provides a practical, high-quality benchmark to facilitate future research on multi-turn CIR. The dataset and code are publicly available at this https URL and this https URL.
61. 【2605.26729】Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics
链接:https://arxiv.org/abs/2605.26729
作者:Hao Ren,Zetong Bi,Zhaoliang Wan,Hui Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exposure correction framework, reference-guided exposure correction, correction framework, reference-guided exposure, exposure correction
备注: ICASSP2026
点击查看摘要
Abstract:We present HICNet, a reference-guided exposure correction framework. A lightweight, content-agnostic encoder distills each image into a compact illumination embedding capturing regional brightness, edge contrast, and higher-order luminance moments. The embedding difference between a source and its reference drives a multi-scale modulation network that combines FiLM-based global adjustment with Photometric Channel Rebalancing for fine-grained, illumination-aware spectral gating, producing exposure-matched outputs while faithfully preserving scene details. A cross-batch contrastive loss orders the illumination manifold, bolstering robustness to diverse lighting conditions. Trained without ground truth or intrinsic decomposition, HICNet attains better accuracy on public benchmarks and generalizes well to entirely unseen scenes.
62. 【2605.26725】Joint 2D-3D Segmentation and Association in Street-level Imaging
链接:https://arxiv.org/abs/2605.26725
作者:Amir Melnikov,Masayuki Tanaka,Yusuke Monno,Masatoshi Okutomi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Spatial Digital Twin, Digital Twin, Spatial Digital, Accurate interpretation, creation of Spatial
备注: 15 pages, 6 image figures, 1 in-body table, 1 in-body algorithm, 2 indexes with tables
点击查看摘要
Abstract:Accurate interpretation of street-level imagery is essential for large-scale urban mapping and the creation of Spatial Digital Twin (SDT) environments. This work presents a unified framework for joint 2D-3D segmentation and association that integrates visual semantics with multi-view geometric reasoning. Unlike conventional approaches that rely heavily on sequential frames for temporal tracking, our method leverages zero-shot detection and segmentation together with structure-from-motion reconstruction to establish stable cross-view correspondences. A 3D-driven association mechanism replaces traditional 2D multi-object tracking, using geometric consistency to guide identity preservation across wide-baseline viewpoints and varying imaging conditions. By combining 2D texture cues with global 3D context, the proposed pipeline is well-suited for scalable street-level processing and can be used for a variety of object types. Experiments demonstrate substantially improved coverage of ground-truth sequences and more robust identity retention compared to state-of-the-art 2D-only tracking methods, achieving a 22% performance gain in challenging urban scenarios.
63. 【2605.26712】METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition
链接:https://arxiv.org/abs/2605.26712
作者:Mélodie Boillet,Solène Tarride,Christopher Kermorvant
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automatic Text Recognition, accurately evaluating Automatic, evaluating Automatic Text, Text Recognition, Automatic Text
备注:
点击查看摘要
Abstract:Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.
64. 【2605.26702】Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling
链接:https://arxiv.org/abs/2605.26702
作者:Pengzhen Chen,Yanwei Liu,Xiaoyan Gu,Antonios Argyriou,Wu Liu,Weiping Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:watermarking of panoramic, panoramic imagery, imagery is fundamentally, fundamentally challenged, Reliable watermarking
备注: ICML 2026
点击查看摘要
Abstract:Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, they naturally transform under the action of $SO(3)$, rendering conventional planar representations and augmentation-based robustness strategies inadequate and devoid of theoretical guarantees. To address this, we formulate panoramas as spherical signals and leverage $SO(3)$ representation theory to derive provably rotation-invariant descriptors. While spherical harmonic coefficients transform equivariantly under rotations, the natural invariant constructions are typically limited to zeroth-order statistics which eliminate directional information and severely constrain embedding capacity. In this work, we introduce a principled third-order invariant construction by coupling higher-order $SO(3)$ irreducible representations via tensor products and projecting onto the trivial representation. This yields a spherical invariant bispectrum that preserves phase information while remaining strictly rotation-invariant. Leveraging this property, we embed watermarks into higher-order spherical harmonic coefficients and recover them from invariant bispectral scalars, enabling reliable extraction under arbitrary 3D rotations. We provide a theoretical proof of $SO(3)$ invariance for it and demonstrate experimentally its near-perfect robustness to continuous rotations while maintaining high visual fidelity.
65. 【2605.26689】PinPoint: Prompting with Informative Interior Points
链接:https://arxiv.org/abs/2605.26689
作者:Pouya Sadeghi,Shawn He,Pedro Pablo Guerrero Vela,C. Thomas,Alex Wong,Sirisha Rambhatla
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Modern referring image, referring image segmentation, image segmentation pipelines, segmentation pipelines couple, Modern referring
备注:
点击查看摘要
Abstract:Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.
66. 【2605.26682】SteelDS: A High-Resolution Video Dataset of E40 Steel Scrap for Object Detection and Instance Segmentation
链接:https://arxiv.org/abs/2605.26682
作者:Melanie Neubauer,Christian Rauch,Gerald Koinig,Alexia Tischberger-Aldrian,Roland Pomberger,Elmar Rueckert
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:annotated video sequences, annotated video, sequences of shredded, conveyor belt, video sequences
备注:
点击查看摘要
Abstract:This dataset provides high-resolution, annotated video sequences of shredded E40-grade steel and copper scrap on a conveyor belt. Captured in a controlled laboratory environment, the data reflects the industrial post-magnetic sorting stage, where manual intervention is typically required to remove copper contaminants. The dataset comprises 24,297 labeled frames across five subsets, featuring 396 steel and 101 copper objects categorized by size. It supports the development of machine learning models for material classification, object detection, and instance segmentation. Variations in object spacing and density are included to simulate realistic industrial sorting conditions. Ground truth annotations include pixel-wise segmentation masks and material classes. This dataset serves as a benchmark for evaluating automated sorting algorithms aiming to identify copper impurities within complex, heterogeneous steel scrap streams.
67. 【2605.26680】DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
链接:https://arxiv.org/abs/2605.26680
作者:Peng Zhang,Guanghao Zhang,Wanggui He,Longxiang Zhang,Mushui Liu,Yan Xia,Zhenhao Peng,Weilong Dai,Jinlong Liu,Haobing Tang,Le Zhang,Hao Jiang,Pipei Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent video multimodal, video multimodal large, revisit relevant video, relevant video segments, multimodal large language
备注:
点击查看摘要
Abstract:Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at this https URL.
68. 【2605.26676】Memory-Distilled Selection for Noise-Robust Anomaly Detection
链接:https://arxiv.org/abs/2605.26676
作者:Sirojbek Safarov,Jaewoo Park,Yoon Gyo Jung,Kuan-Chuan Peng,Wonchul Kim,Seongdeok Bang,Octavia Camps
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unsupervised defect detection, deploying unsupervised defect, Anomaly detection, curating perfectly clean, sets is impractical
备注: Accepted by ICML2026. The code is available at [this https URL](https://github.com/SirojbekSafarov/MeDS)
点击查看摘要
Abstract:Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose Memory-Distilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16\% image-level AUROC on MVTecAD at a 40\% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses.
69. 【2605.26661】Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models
链接:https://arxiv.org/abs/2605.26661
作者:Yuanwei Hu,Bo Peng,Yadan Luo,Zhen Fang,Ling Chen,Jie Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:identifying unexpected inputs, machine learning models, zero-shot OOD detection, OOD detection, unknown classes
备注:
点击查看摘要
Abstract:Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying unexpected inputs from unknown classes. Recent progress in pre-trained vision-language models (VLMs) has enabled zero-shot OOD detection without access to in-distribution (ID) training data; in this setting, existing methods commonly treat text embeddings of class names as class prototypes. In this paper, we challenge the widely adopted text-as-prototype paradigm by theoretically showing that off-the-shelf textual prototypes are generally misaligned with the optimal visual prototypes, yielding an intrinsic modality gap that cannot be eliminated by prompt engineering alone. To mitigate this gap under the post-hoc constraint, this paper presents an online pseudo-supervised framework that directly learns class prototypes in the visual feature space using unlabeled test-time data streams and soft predictions from the pre-trained VLMs. We provide theoretical guarantees for the convergence of the online optimization procedure. Extensive experiments empirically demonstrate that our method achieves a new state of the art across a variety of OOD detection setups.
70. 【2605.26656】DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding
链接:https://arxiv.org/abs/2605.26656
作者:Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng,Bing Wang,Zhixing Tan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:predict ground-truth answers, large language models, Multimodal large language, typically trained, ground-truth answers
备注: Under Review
点击查看摘要
Abstract:Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading to coarse-grained visual understanding. Prior works attempt to supervise visual inputs but inevitably rely on auxiliary components such as additional decoders or forward passes, because visual tokens lack readily interpretable labels. This limits their practical applicability. In this work, we propose \textbf{D}irect \textbf{V}ision \textbf{S}upervised \textbf{F}ine-\textbf{T}uning (DV-SFT), which constructs explicit, token-level supervision for visual tokens and trains them through the same next-token prediction objective used for text. Specifically, we exploit the direct vision--text correspondence in OCR-related scenarios and automatically label each visual token with the word in its corresponding image patch. DV-SFT treats the MLLM as a black box, requiring no architectural modifications or additional forward passes. Extensive experiments demonstrate the superiority of direct vision supervision. DV-SFT consistently outperforms standard SFT across three in-domain and four out-of-domain benchmarks. Further analyses show that vision supervision effectively enhances fine-grained visual understanding and achieves higher multimodal alignment efficiency.
71. 【2605.26642】Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations
链接:https://arxiv.org/abs/2605.26642
作者:Hyunchul Bae,Heejin Ahn
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:share complementary observations, existing methods assume, methods assume fixed, Collaborative perception improves, object detection
备注: 9 pages main paper, 23 pages including references and appendix, 7 figures
点击查看摘要
Abstract:Collaborative perception improves 3D object detection by enabling agents to share complementary observations, but most existing methods assume fixed or known collaborator encoder configurations, limiting deployment in practice. In this work, we consider an open-world setting in which auxiliary agents with unseen configurations may appear after deployment, such as different LiDAR beam counts or encoder architectures. To address this challenge, we propose ALF, a collaborative perception framework that enables zero-adaptation collaboration with unseen agent configurations by lifting lightweight box-level messages into ego-compatible auxiliary features. ALF converts auxiliary box-level messages into pseudo-BEV maps and synthesizes ego-compatible latent features by combining object-centric cues with scene context from the ego feature. On V2X-Real, under a zero-shot evaluation across 64 case studies, ALF outperforms the strongest prior baseline by 35.91% in relative mAP@0.7 while requiring only 120 bytes per agent per frame (approximately 9.6 Kbps bandwidth at 10 Hz).
72. 【2605.26641】OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation
链接:https://arxiv.org/abs/2605.26641
作者:Yunze Liu,Chi-Hao Wu,Enmin Zhou,Junxiao Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unified multimodal embedding, Unified multimodal, multimodal RAG, multimodal embedding spaces, interface for cross-modal
备注: [this https URL](https://yunzeliu.github.io/OmniRetriever/)
点击查看摘要
Abstract:Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joint (T,V,A) embedding whenever all three modalities are available, but standard pairwise InfoNCE objectives leave this signal unused during training. We close this gap with fusion-as-teacher distillation, which treats a stop-gradient copy of the fused embedding as a teacher signal for the single-modal embeddings, paired with a Tuple-InfoNCE term that supervises the fused embedding directly. We instantiate this objective as OmniRetriever-7B. Across six zero-shot retrieval benchmarks, OmniRetriever-7B surpasses the closed-source Gemini Embedding 2 by 13.3-18.0 R@1 on Clotho and SoundDescs, and reaches the contemporary zero-shot specialist band of open video-text encoders on MSR-VTT and MSVD. To stress-test joint representations, we further release OmniRetriever-Bench, a 12-direction AVT retrieval benchmark totaling 3782 triples; on it OmniRetriever-7B attains AVG-all 34.84, improving over Gemini Embedding 2 by 1.72 and over the best prior open-source AVT method by 8.03.
73. 【2605.26636】JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
链接:https://arxiv.org/abs/2605.26636
作者:Dongyun Zou,Zhuoyang Zhang,Junyu Chen,Wenkun He,Qinhe Peng,Hanrong Ye,Yao Lu,Hongxu Yin,Yu Wang,Song Han,Han Cai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:hybrid-architecture Vision Transformer, Post-Training Attention Search, Vision Transformer, achieving substantially higher, substantially higher inference
备注: Accepted to CVPR 2026 Findings
点击查看摘要
Abstract:We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.
74. 【2605.26630】Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection
链接:https://arxiv.org/abs/2605.26630
作者:Lanqing Liu,Ruize Cui,Jialun Pei,Diandian Guo,Tiffany Y. So,Pheng-Ann Heng,Jing Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:laparoscopic liver surgery, Liver surface landmark, surface landmark detection, fundamental prerequisite, prerequisite for anatomical
备注: This paper has been accepted by MICCAI 2026
点击查看摘要
Abstract:Liver surface landmark detection is a fundamental prerequisite for anatomical guidance in laparoscopic liver surgery. However, it remains unreliable in practice due to two pervasive challenges: illumination attenuation in underexposed regions and the structural mismatch between pixel-wise localization and continuous curvilinear geometry. To address these limitations, we propose A2ONet, an attenuation-resilient alternating optimization network for robust liver landmark detection. To mitigate illumination attenuation, A2ONet embraces an illumination field compensation (IFC) block that adaptively enhances dark regions while preserving structural consistency. Meanwhile, we introduce a lightweight frequency-orientation selective filter (FOSF) to suppress repetitive texture interference and preserve salient curvilinear cues. Building upon these resilient representations, we design an alternating seg-curve optimization (ASCO) decoder that iteratively couples dense segmentation with explicit curve modeling, enabling mutual guidance to optimize both structural continuity and endpoint localization. Extensive evaluations on L3D-2K, L3D, and P2ILF demonstrate consistent improvements over competitive methods, establishing a more reliable foundation for intraoperative anatomy guidance. Our code will be available at this https URL.
75. 【2605.26629】DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction
链接:https://arxiv.org/abs/2605.26629
作者:Fuzhen Jiang,Zengtian Xie,Zhuoran Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sparse posed images, sparse posed, posed images, images are central, central to robotics
备注:
点击查看摘要
Abstract:Novel-view synthesis and 3D reconstruction from sparse posed images are central to robotics and AR/VR. Yet, feed-forward 3D Gaussian reconstruction fails under lowlight due to noise, color shifts, and unreliable correspondence. We propose DelowlightSplat, a lowlight-aware feed-forward Gaussian splatting framework for clean novel-view rendering. We build a controllable multi-view lowlight benchmark by degrading only context views while keeping target views clean. We introduce a lightweight Lowlight Adapter for residual enhancement to improve matchability, and couple it with cost-volume-based multi-view inference to directly predict clean 3D Gaussians. Experiments show that DelowlightSplat significantly outperforms previous feed-forward method and two-stage pipeline under lowlight conditions.
76. 【2605.26624】MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition
链接:https://arxiv.org/abs/2605.26624
作者:Haoliang Gong,Qingshan She,Jiale Xua,Yunyan Gao,Xugang Xi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:important affective computing, affective computing task, EEG emotion recognition, recent EEG foundation, based emotion recognition
备注:
点击查看摘要
Abstract:Electroencephalogram (EEG)-based emotion recognition is an important affective computing task, and recent EEG foundation models provide useful generic representations for downstream adaptation. However, under the fine-tuning setting, three limitations remain prominent: insufficient modeling of multi-scale emotional dynamics, inadequate exploitation of inter-channel functional connectivity, and the limited expressive power of simple linear classification heads. To address these issues, this paper proposes a new EEG emotion recognition method, termed MSCGC-KAN, which introduces a structured task head composed of multi-scale causal graph convolution and Kolmogorov--Arnold feature mapping. Built on a pre-trained CBraMod backbone, MSCGC-KAN enhances downstream adaptation by jointly strengthening multi-scale temporal modeling, learnable inter-channel connectivity modeling, and nonlinear discriminative mapping within a compact task-specific head. This design preserves the representation advantage of the foundation model while making the classifier more sensitive to emotion-related spatiotemporal patterns. Extensive experiments are conducted on the public FACED and SEED-VII datasets. The proposed method achieves a balanced accuracy of 60.66\%, a Cohen's Kappa of 0.5525, and a weighted F1-score of 60.40\% on FACED, and obtains 33.27\%, 0.2223, and 33.64\%, respectively, on SEED-VII. Compared with the CBraMod+Linear baseline, the balanced accuracy is improved by 5.91 and 2.03 percentage points on the two datasets, respectively. These results indicate that structured task-head design is an effective way to improve EEG emotion recognition when fine-tuning pre-trained EEG models.
77. 【2605.26621】MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation
链接:https://arxiv.org/abs/2605.26621
作者:Zichun Wang,Hairong Shi,Bingzheng Wei,Yan Xu,Zihua Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:free-form clinical query, Volumetric Reasoning Segmentation, aims to segment, medical scan, medical knowledge
备注:
点击查看摘要
Abstract:Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.
78. 【2605.26616】Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction
链接:https://arxiv.org/abs/2605.26616
作者:Zhenhua Du,Zhen Tan,Haoyu Zhang,Dewen Hu,Shuaifeng Zhi,Peidong Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, Splatting has achieved, Gaussian Splatting, achieved remarkable, remarkable success
备注: 27 pages, 14 figures
点击查看摘要
Abstract:While 3D Gaussian Splatting has achieved remarkable success in photorealistic novel view synthesis, its pursuit of fast and high-fidelity 3D reconstruction has long been constrained by a trade-off between geometric accuracy and optimization efficiency. Methods specialized in image rendering converge quickly at the cost of imperfect geometry caused by superfluous primitives overfitting training views, while methods integrating neural signed-distance field (SDF) for better geometry incur prohibitive training costs. In this paper, we attempt to strike a better trade-off by tethering scaffold-anchored Gaussians to a jointly optimized sparse voxel scaffold. This hybrid Gaussian-Voxel representation explicitly confines anchored Gaussians to a narrow band around surfaces defined by voxelized SDFs, which effectively improves representation efficiency and condenses floating Gaussians without sacrificing geometry quality. An implicit surface tethering loss further pulls individual Gaussian primitives closer to SDF-induced surfaces in a mutually regularized manner for improved reconstruction accuracy. Extensive experiments on diverse real-world indoor scenes from ScanNet++, ScanNetv2, and DeepBlending datasets demonstrate that our method achieves state-of-the-art surface reconstruction quality as well as superior novel view synthesis against leading baselines, while maintaining fast training convergence and real-time rendering. Code will be available at this https URL.
79. 【2605.26601】FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling
链接:https://arxiv.org/abs/2605.26601
作者:Guixian Xu,Yide Liang,Zeli Su,Xuexian Song,Ziyin Zhang,Yushuang Dong,Ting Zhang,Xu Han
类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
关键词:severely underserved low-resource, underserved low-resource language, low-resource language due, Tibetan vision-language research, progressed rapidly
备注:
点击查看摘要
Abstract:Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.
80. 【2605.26584】O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding
链接:https://arxiv.org/abs/2605.26584
作者:Peiran Wu,Yunze Liu,Chi-Hao Wu,Chen Chen,Junxiao Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Omnimodal large language, noisy user generated, Omnimodal large, user generated videos, enable unified audio
备注:
点击查看摘要
Abstract:Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.
81. 【2605.26576】rackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting
链接:https://arxiv.org/abs/2605.26576
作者:Yuyang Tan,Renhe Zhang,Hang Zhang,Ao Li,Xin Tan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:utilizes natural language, Gaussian Splatting, utilizes natural, natural language, crucial capability
备注:
点击查看摘要
Abstract:Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.
82. 【2605.26538】Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer
链接:https://arxiv.org/abs/2605.26538
作者:Amey Sunil Kulkarni
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pre-trained diffusion models, question remains underexplored, core question remains, diffusion models, advanced rapidly
备注: Accepted to CVPR NTIRE 2026
点击查看摘要
Abstract:Style transfer with pre-trained diffusion models has advanced rapidly, but a core question remains underexplored: where in the model should style injection be strongest? StyleID, the leading training-free method, uses a single global parameter (gamma) uniformly across all layers and timesteps, which forces a fixed tradeoff between style quality and content preservation. We show this tradeoff is unnecessarily rigid. We systematically explore four dimensions of control: varying style injection strength across decoder layers, across denoising timesteps, and scheduling ControlNet geometric conditioning along both axes. The pattern is consistent everywhere: decreasing schedules, with stronger structural signal injection in shallower layers and earlier timesteps, reliably outperform the reverse. Beyond direction, schedule shape matters: cosine and square-root timestep schedules outperform linear. Most importantly, we find that gamma scheduling and ControlNet conditioning are nearly independent. The resulting combined configurations expand the Pareto frontier, offering superior tradeoffs between style fidelity and content preservation compared to any single baseline setting. Our best balanced configuration achieves ArtFID of 27.036 versus StyleID's 28.801 - a 6.1% relative improvement, with consistent gains across the full style-content tradeoff frontier. Results are validated across 35 configurations totaling over 28,000 stylized images using four complementary metrics. These findings generalize across SD backbones with identical rank ordering. All modifications are training-free, parameter-free, and require only a few lines of scheduling code; code is available at this https URL.
83. 【2605.26535】Recursive Flow Matching
链接:https://arxiv.org/abs/2605.26535
作者:Jiahe Huang,Sihan Xu,Sharvaree Vadgama,Rose Yu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:solving physics systems, modeling complex spatiotemporal, complex spatiotemporal dynamics, models have emerged, powerful paradigm
备注: Project page: [this https URL](https://jhhuangchloe.github.io/RecFM/)
点击查看摘要
Abstract:Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity trade-off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self-consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics-based tasks. To our knowledge, this is the first method to achieve high-fidelity one- and few-step (2-4 step) dynamic generation for scientific systems with performance comparable to state-of-the-art multi-step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20$\times$ speedup over leading diffusion-based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real-time scientific emulation.
84. 【2605.26533】A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection
链接:https://arxiv.org/abs/2605.26533
作者:Malikussaid,Imad Gohar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Automated industrial inspection, linguistic interpretation left, industrial inspection requires, precise defect localization, Automated industrial
备注: 23 pages, 6 figures, 9 equations, and 6 tables
点击查看摘要
Abstract:Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.
85. 【2605.26525】ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation
链接:https://arxiv.org/abs/2605.26525
作者:Akide Liu,Jinbo Xing,Chaojie Mao,Ye Li,Zeyu Zhang,Yefei He,Weijie Wang,Zihan Wang,Yu Liu,Gholamreza Haffari,Bohan Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Minute-scale cinematic video, Minute-scale cinematic, generative video models, Multi-Shot Video Extrapolation, generative video
备注: Project Page: [this https URL](https://reca.vmv.re) , Code: [this https URL](https://github.com/ali-vilab/ReCA)
点击查看摘要
Abstract:Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at this https URL.
86. 【2605.26524】CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence
链接:https://arxiv.org/abs/2605.26524
作者:Yuxu Lu,Dong Yang,Xiaoyu Li,Mengwei Bao,Congcong Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:intelligent transportation systems, Maritime intelligent transportation, ensuring navigation safety, busy waterways, intelligent transportation
备注:
点击查看摘要
Abstract:Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD$^+$), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at this https URL.
87. 【2605.26520】InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward
链接:https://arxiv.org/abs/2605.26520
作者:Zhiwei Ning,Wenwen Tong,Xiangli Kong,Shengnan Ma,Ziyi Shang,Jingcheng Ni,Tao Hu,Yong Xien Chng,Jixuan Ying,Zehuan Wu,Hanming Deng,Jie Yang,Yuanjie Zheng,Wei Liu,Lewei Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:complex visual challenges, reasoning trajectories remain, text-centric paradigm, limiting their applicability, trajectories remain
备注:
点击查看摘要
Abstract:While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.
88. 【2605.26519】$R^3$: 3D Reconstruction via Relative Regression
链接:https://arxiv.org/abs/2605.26519
作者:Congrong Xu,Huachen Gao,Xingyu Chen,Yuliang Xiu,Jun Gao,Anpei Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent feed-forward geometry, single forward pass, feed-forward geometry foundation, demonstrated impressive generalization, geometry foundation models
备注:
点击查看摘要
Abstract:Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call $R^3$, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. $R^3$ supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism. Project page: this https URL
89. 【2605.26514】CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies
链接:https://arxiv.org/abs/2605.26514
作者:Geonwoo Baek,Ikbeom Jang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Confirming Alzheimer disease, Confirming Alzheimer, positron emission tomography, Alzheimer disease, structural MRI-based prescreening
备注:
点击查看摘要
Abstract:Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data's spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.
90. 【2605.26513】Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression
链接:https://arxiv.org/abs/2605.26513
作者:Haojie Yin,Chengcheng Feng,Tianyi Liu,Tianqi Zhang,Kaizhu Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:assessing visual field, visual field loss, Optical Coherence Tomography, loss in ophthalmology, critical metric
备注:
点击查看摘要
Abstract:Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29\% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.
91. 【2605.26503】Uncertainty-Aware Gaussian Map for Vision-Language Navigation
链接:https://arxiv.org/abs/2605.26503
作者:Jianzhe Gao,Rui Liu,Yuxuan Xu,Tongtong Cao,Yingxue Zhang,Zhanguang Zhang,Sida Peng,Yi Yang,Wenguan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language instructions, language instructions, Vision-Language Navigation, natural language, uncertainty
备注:
点击查看摘要
Abstract:Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agent's observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks show the effectiveness of our agent.
92. 【2605.26501】Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization
链接:https://arxiv.org/abs/2605.26501
作者:Xiang Fang,Wanlong Fang,Changshuo Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, Large Vision-Language, transformed multi-modal understanding, visual question answering, question answering
备注: Publish in AAAI 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.
93. 【2605.26500】3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation
链接:https://arxiv.org/abs/2605.26500
作者:Jianzhe Gao,Rui Liu,Wenguan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language instructions, Vision-language navigation, language instructions, Egocentric Scene Map, natural language
备注:
点击查看摘要
Abstract:Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.
94. 【2605.26491】Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models
链接:https://arxiv.org/abs/2605.26491
作者:Austin Wang,Jiaqi Han,Stefano Ermon,Yisong Yue
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:online reinforcement learning, human feedback, efficient alternative, alternative to online, online reinforcement
备注:
点击查看摘要
Abstract:Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.
95. 【2605.26486】LongCat-Video-Avatar 1.5 Technical Report
链接:https://arxiv.org/abs/2605.26486
作者:Meituan LongCat Team:Xunliang Cai,Meng Cheng,Feng Gao,Zhe Kong,Jiamu Li,Le Li,Weiheng Li,Hongyu Liu,Shuai Tan,Xiaoming Wei,Tianyu Yang,Yong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:stability remains challenging, audio-driven video generation, remains challenging, commercial-grade stability remains, advances in audio-driven
备注: Homepage: [this https URL](https://meigen-ai.github.io/LongCat-Video-Avatar-1.5-Page/) Github: [this https URL](https://github.com/meituan-longcat/LongCat-Video)
点击查看摘要
Abstract:Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.
96. 【2605.26485】OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
链接:https://arxiv.org/abs/2605.26485
作者:Xudong Lu,Xueying Li,Annan Wang,Yang Bo,Jinpeng Chen,Zengliang Li,Nianzu Yang,Rui Liu,Xue Yang,Jingwen Hou,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:omnimodal large language, native online inference, large language models, language models evaluated, real-time omnimodal large
备注:
点击查看摘要
Abstract:We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at this https URL.
97. 【2605.26483】Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis
链接:https://arxiv.org/abs/2605.26483
作者:Jianzhe Gao,Churan Wang,Weiyi Zhang,Jianghua Li,Li-An Li,Wenguan Wang,Yixin Zhu,Yizhou Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical video diagnosis, diagnosis involves inferring, involves inferring clinical, inferring clinical decisions, dynamic tissue responses
备注:
点击查看摘要
Abstract:Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an end-to-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.
98. 【2605.26478】Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
链接:https://arxiv.org/abs/2605.26478
作者:Haoxiang You,Yilang Liu,Davis Zong,Qian Wang,Teeratham Vitchutripop,Qi Wang,Daniel Rakita,Ian Abraham
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
关键词:single NVIDIA RTX, NVIDIA RTX, visuomotor control policies, trains diverse visuomotor, diverse visuomotor control
备注:
点击查看摘要
Abstract:We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation, challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.
99. 【2605.26475】Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes
链接:https://arxiv.org/abs/2605.26475
作者:ZhiXin Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision-based metric distance, unstable imaging conditions, large-scale outdoor environments, outdoor environments due, area measurement remains
备注:
点击查看摘要
Abstract:Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera zoom, and unstable imaging conditions. This work studies planar metric measurement in a real-world reservoir monitoring scenario using PTZ cameras and compares three representative approaches: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging using two jointly calibrated monocular cameras. For monocular ranging, planar localization models are derived from camera geometry and the effect of camera pitch angle is analyzed. Image stitching is investigated for large-area mapping, while a stereo-based scheme is developed for long-range measurement without dedicated stereo hardware. Experiments show clear trade-offs: monocular ranging achieves meter-level accuracy under sufficiently large pitch angles, stereo-based ranging achieves decimeter-level accuracy with reduced sensitivity to pitch variations, and image stitching is effective for small-scale scenes but degrades in stability and scalability as scene size increases.
100. 【2605.26470】riadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules
链接:https://arxiv.org/abs/2605.26470
作者:Junseo Bang,Dong Ju Mun,Hoigi Seo,Seongmin Hong,Se Young Chun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative posterior sampling, Generative posterior, solving inverse problems, diffusion models, models has emerged
备注: ICML 2026
点击查看摘要
Abstract:Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.
101. 【2605.26460】AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation
链接:https://arxiv.org/abs/2605.26460
作者:Jian Zhang,Zhijun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multi-Modal Diffusion Transformers, Diffusion Transformers, encode rich representations, produce overlapping activations, target responses spill
备注:
点击查看摘要
Abstract:Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.
102. 【2605.26456】Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth
链接:https://arxiv.org/abs/2605.26456
作者:Kai Zheng,Qiang Feng,Xingjian Liu,Wenquan Tan,Yuan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong results, depth foundation models, Prior Depth, depth foundation, shown strong
备注: 6 pages, 3 figures, 2 tables
点击查看摘要
Abstract:Sparse-LiDAR-prompted depth foundation models (PromptDA, Prior Depth Anything, DMD3C) have shown strong results on indoor scenes or within KITTI's standard 80-meter evaluation cap. However, two limitations remain: (i) systematic distance-stratified evaluation in long-range driving regimes (50-150 m) is largely absent; (ii) prior approaches built on disparity-based foundations rely on pre-interpolated dense priors, leaving truly sparse LiDAR injection on point-map foundations (e.g., MoGe-2, NeurIPS 2025) unexplored. We present SLIM (Sparse-LiDAR Injected Monocular geometry), the first adaptation of MoGe-2 to accept truly sparse LiDAR input. SLIM integrates a partial-convolution sparse encoder with a multi-scale fusion neck that fuses LiDAR features into the point-map decoder at five scales. We adopt density-agnostic training (random injection ratio in [0.005, 0.30]) so a single model serves diverse input densities. On Virtual KITTI and CARLA, SLIM reduces the absolute relative error of the MoGe-2 baseline by approximately 39-51% at 100-150 m. Ablation across six injection ratios shows partial-convolution injection improves both AbsRel and RMSE on Virtual KITTI in all six settings; on CARLA, AbsRel improves in five of six settings (one near-tie at 0.015 differs by 0.0013), and RMSE is comparable across encoders, with partial-convolution improving in three settings (by up to 0.31 unit) and losing by at most 0.11 unit in the other three.
103. 【2605.26451】Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation
链接:https://arxiv.org/abs/2605.26451
作者:Zhiyao Cui,Chenxu Wang,Shuyue Hu,Yiqun Zhang,Wenqi Shao,Qiaosheng Zhang,Zhen Wang
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:Producing presentation slides, strict spatial constraints, automatically entails coordinating, entails coordinating narrative, coordinating narrative structure
备注:
点击查看摘要
Abstract:Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at this https URL.
104. 【2605.26449】Cross-scale Aligned Supervision for Training GANs
链接:https://arxiv.org/abs/2605.26449
作者:Sangeek Hyun,MinKyu Lee,Jae-Pil Heo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:resulting multi-stage synthesis, Modern GANs, introduce adversarial supervision, interpret the resulting, resulting multi-stage
备注: Preprint
点击查看摘要
Abstract:Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.
105. 【2605.26447】Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting
链接:https://arxiv.org/abs/2605.26447
作者:Jiangbei Hu,Weichao Song,Shibo Yu,Mohan Wang,Zihan Yi,Rui Wu,Mingkang Xiang,Na Lei,Shengfa Wang,Zhongxuan Luo,Ying He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, Omnidirectional Gaussian Splatting, complex participating-media effects, Gaussian Splatting, absorption and scattering
备注:
点击查看摘要
Abstract:Underwater scene reconstruction is essential for immersive exploration of aquatic environments, yet remains challenging due to complex participating-media effects such as absorption and scattering, as well as the limited field of view (FoV) of conventional cameras. Although combining panoramic imaging with 3D Gaussian Splatting (3DGS) offers a promising direction for photorealistic underwater rendering, traditional 3DGS struggles with both spherical projection distortion and underwater medium degradation. In this paper, we propose \textbf{Underwater360}, a physics-informed omnidirectional 3DGS framework for underwater panoramic scene reconstruction. First, we introduce an Omnidirectional Gaussian Splatting module that performs ray casting directly in spherical camera space instead of relying on 2D projection approximations, thereby reducing geometric distortions under 360$^\circ$ FoV. Second, we design a physics-based appearance-medium modeling architecture with pose-conditioned appearance embeddings to explicitly decouple intrinsic scene radiance from depth-dependent backscatter and attenuation, enabling physically grounded scene appearance restoration. Finally, we establish a new panoramic underwater benchmark dataset containing both synthetic and real-world scenes. Extensive experiments demonstrate that Underwater360 achieves superior performance in underwater novel view synthesis and scene appearance restoration, delivering improved rendering quality and cross-view consistency in complex underwater environments. The code and datasets are released at this https URL
106. 【2605.26441】Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
链接:https://arxiv.org/abs/2605.26441
作者:Xiang Fang,Zeyu Xiong,Wanlong Fang,Xiaoye Qu,Chen Chen,Jianfeng Dong,Keke Tang,Pan Zhou,Yu Cheng,Daizong Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:weakly-supervised video temporal, video temporal grounding, addresses the challenging, Complex moment proposals, moment proposals
备注: Published in ECCV 2024
点击查看摘要
Abstract:This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal this http URL, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment this http URL show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.
107. 【2605.26421】HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection
链接:https://arxiv.org/abs/2605.26421
作者:Senyuan Shi,Hao Tan,Zichang Tan,Shuhan Feng,Ajian Liu,Sergio Escalera,Jun Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic Image Detection, posing significant challenges, existing Synthetic Image, Image Detection, posing significant
备注: 8 pages, 6 figures
点击查看摘要
Abstract:The rapid evolution of generative models has precipitated a proliferation of fabricated content, posing significant challenges to existing Synthetic Image Detection (SID) methods. Capitalizing on advancements in vision-language models (e.g., CLIP), recent attempts have leveraged learnable textual prompts to identify synthetic images. However, they still leverage static prompt as a fixed boundary for real and fake images, failing to adapt to the varying types of forgery that emerge during inference. To overcome this issue, we propose **HydraPrompt**, an asymmetric prompting framework that dynamically adjusts the category centers by aligning with fine-grained image cues. Specifically, we propose an Asymmetric Prompt Adapter (**APA**): (1) for authentic category, we introduce a single set of prompts to capture the consistent representative patterns, which serves as a unified anchor for real content. While (2) for fake category, we construct sample-adaptive prompts that specialize in capturing diverse cues from different samples, enabling adaptive modeling of forgery image variations. To increase pronounced discriminability within different synthetic images, we further introduce a Conditional Supervised Contrastive (**CSC**) objective, which compacts the authentic representations while capturing fine-grained forgery clues. Extensive experiments on popular SID benchmarks demonstrate the state-of-the-art performance of our framework.
108. 【2605.26415】he Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP
链接:https://arxiv.org/abs/2605.26415
作者:Kahyeon Nam,Hyesong Choi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deploying Vision-Language Models, quantized CNN classifiers, hardware typically requires, resource-constrained hardware typically, failure mode distinct
备注:
点击查看摘要
Abstract:Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% - 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.
109. 【2605.26399】OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following
链接:https://arxiv.org/abs/2605.26399
作者:Qiaomu Miao,Haoyu Wu,Jingyi Xu,Minh Hoai,Dimitris Samaras
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding human gaze, Understanding human, human gaze behavior, human-computer interaction, complex scene comprehension
备注:
点击查看摘要
Abstract:Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at this https URL.
110. 【2605.26391】Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing
链接:https://arxiv.org/abs/2605.26391
作者:Kiyohiro Nakayama,I-chao Shen,Ruofan Liu,Yiming Wang,Gordon Wetzstein,Takeo Igarashi
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Practical garment design, requires professional training, garment design spans, complex low-level editing, Practical garment
备注:
点击查看摘要
Abstract:Practical garment design spans two modes: intuitive creation from high-level intent, such as a reference image or text description, and complex low-level editing across 2D sewing patterns and 3D draped geometry, which requires professional training to navigate their complex interdependencies. Yet existing frameworks address only part of this challenge, offering either garment generation from casual inputs or direct editing on sewing patterns. To support both ends of the spectrum, we propose Garment Particles, a 5D point-cloud representation that jointly encodes 2D sewing patterns and 3D geometry. This representation enables Garment Particles Flow (GPF), a rectified flow framework that supports intuitive generation from high-level inputs (text, images, sketches) and various editing operations on 2D sewing patterns and 3D geometries via diffusion posterior sampling. Finally, we introduce Particles-to-Pattern Flow that converts generated garment particles into curved-based patterns for simulation. We validate our model's generation ability on multiple datasets, achieving state-of-the-art garment generation results against competitive baselines. Our model also enables many garment editing scenarios, including garment interpolation, sewing pattern editing, point-cloud- and silhouette-conditioned garment generation. Our project website is at this https URL .
111. 【2605.26383】Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion
链接:https://arxiv.org/abs/2605.26383
作者:Dmytro Klepachevskyi,Alexander Wong,Sirisha Rambhatla,Yuhao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intra-class appearance variations, egocentric kitchen videos, large intra-class appearance, frequent occlusions, cluttered scenes
备注:
点击查看摘要
Abstract:Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches. We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45.3% mAP. We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7.5% mAP to 52.8%.
112. 【2605.26382】Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation
链接:https://arxiv.org/abs/2605.26382
作者:Mengchen Fan,Baocheng Geng,Xi Xiao,Tianyang Wang,Siyuan Mei,Pulin Che,Xiaoqian Jiang,Qizhen Lan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image segmenters, Deploying high-performing, medical image, image segmenters, inference latency
备注: Accepted by MICCAI 2026. 11 pages, 3 figures
点击查看摘要
Abstract:Deploying high-performing 3D medical image segmenters (e.g., nnU-Net) is often limited by memory footprint and inference latency. Compression is therefore necessary, but compact 3D encoders tend to lose fine structural cues (small lesions and sharp boundaries) as downsampling repeats across multi-resolution stages. We propose Detail Consistent Distillation (DCD), a stage-wise distillation framework that preserves structural detail across scales by aligning teacher-student features in a wavelet-decomposed representation. At each encoder stage, DCD distills directional detail components in the wavelet domain while leaving the coarse approximation comparatively unconstrained, avoiding over-regularization of global semantics. DCD is used only during training and introduces no inference-time overhead. Experiments on the BraTS 2024 and ISLES 2022 benchmarks demonstrate that our approach achieves superior performance in MRI segmentation using 3D multi-modal data. Code and implementation details for DCD are publicly available at this https URL.
113. 【2605.26381】Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery
链接:https://arxiv.org/abs/2605.26381
作者:Niels Sombekke,Rob G.J. Wijnhoven,Martin R. Oswald
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spatial patch tokens, multi-modal classification framework, classification framework, framework that fuses, patch tokens
备注:
点击查看摘要
Abstract:We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities. The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e.g., +11.3 AP for slate, +1.3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.
114. 【2605.26380】VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes
链接:https://arxiv.org/abs/2605.26380
作者:Jingru Chen,Yiming Liu,Mingtao Chen,Sijie Chen,Richeng Xuan,Liang Yang,Zhichao Hu,Fanyang Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Frontier multimodal large, multimodal large language, Frontier multimodal, large language models, multimodal large
备注:
点击查看摘要
Abstract:Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.
115. 【2605.26376】BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma
链接:https://arxiv.org/abs/2605.26376
作者:Junlin Yang,Tian Yu,Nicha C. Dvornek,Yuexi Du,Peiyu Duan,Annabella Shewarega,Lawrence H. Staib,James S. Duncan,Julius Chapiro
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Hepatocellular carcinoma, underlying biological processes, hepatic functional reserve, similar survival outcomes, tumor-related oncologic factors
备注: Early accepted at MICCAI 2026
点击查看摘要
Abstract:Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p0.05), without supervision. The code is available at this https URL.
116. 【2605.26370】Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery
链接:https://arxiv.org/abs/2605.26370
作者:Luuk Versteeg,Rob G.J. Wijnhoven,Martin R. Oswald
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:jointly predicting instance-level, predicting instance-level roof, continuous geometric attributes, instance-level roof segment, roof segment masks
备注:
点击查看摘要
Abstract:We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes -- building height, roof slope, and roof azimuth -- from a single aerial orthophoto. Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset. Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0.566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.
117. 【2605.26368】Unified Panoramic Geometry Estimation via Multi-View Foundation Models
链接:https://arxiv.org/abs/2605.26368
作者:Vukasin Bozic,Isidora Slavkovic,Dominik Narnhofer,Nando Metzger,Denis Rozumny,Konrad Schindler,Nikolai Kalischek
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Panoramic Geometry Reconstruction, greatly advanced, Geometry estimation, single panoramic image, single view
备注:
点击查看摘要
Abstract:Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360-degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre-trained transformer for 3D reconstruction and turn it into a unified high-performance model that predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360-degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state-of-the-art performance and excellent zero-shot performance across a wide range of scenes.
118. 【2605.26353】Personalized Generative Models for Contextual Debiasing
链接:https://arxiv.org/abs/2605.26353
作者:Xinran Liang,Esin Tureci,Prachi Sinha,Ye Zhu,Vikram V. Ramaswamy,Olga Russakovsky
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:images, Decoupling Contextual Patterns, Abstract, beach balls, beach ball
备注: CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at [this https URL](https://github.com/princetonvisualai/DecoupleGen)
点击查看摘要
Abstract:Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.
119. 【2605.26332】Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models
链接:https://arxiv.org/abs/2605.26332
作者:Arian Komaei Koma,Seyed Amir Kasaei,AmirMahdi Sadeghzadeh,Mohammad Hossein Rohban
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Machine unlearning aims, Machine unlearning, Toggle, Arian Komaei Koma, remove specific concepts
备注:
点击查看摘要
Abstract:Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.26332 [cs.CV]
(or
arXiv:2605.26332v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.26332
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Arian Komaei Koma [view email] [v1]
Mon, 25 May 2026 21:11:59 UTC (8,823 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models, by Arian Komaei Koma and 3 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.CV
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
cs.AI
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
120. 【2605.26328】RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields
链接:https://arxiv.org/abs/2605.26328
作者:Chuhan Chen,Tianshu Huang,Akarsh Prabhakara,Chaithanya Kumar Mummadi,Zhongxiao Cong,Anthony Rowe,Matthew O'Toole,Deva Ramanan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:provide metric depth, offering fine angular, radars provide metric, cameras offering fine, adverse weather
备注: Accepted to 3DV 2026. Project website: [this https URL](https://sally-chen.github.io/radar-sim/)
点击查看摘要
Abstract:Radars are an ideal complement to cameras: both are inexpensive, solid-state sensors, with cameras offering fine angular resolution, while radars provide metric depth and robustness under adverse weather. However, radar data is more difficult to interpret than camera images and varies significantly between sensors, necessitating increased reliance on simulation for prototyping sensors and processing pipelines. Recent work treating radar reconstruction as a novel view synthesis problem has shown great promise in reconstructing radar-relevant geometry and simulating low-level radar data. However, such methods are constrained by the low spatial resolution of the underlying radar. To address this, we propose a unified differentiable renderer, RadarSim, which leverages the high angular resolution of RGB cameras to generate Doppler radar range images from a camera-initialized neural field. Using a novel data set of calibrated radar camera recordings from a custom hand-held rig, we demonstrate that RadarSim produces sharper geometry and Doppler range frames than radar-only reconstructions.
121. 【2605.26316】E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
链接:https://arxiv.org/abs/2605.26316
作者:Qiao Gu,Lingni Ma,Adam W Harley,Richard Newcombe,Florian Shkurti,Julian Straub
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:others' actions manifest, physically grounded egocentric, grounded egocentric video, egocentric video generation, physically grounded
备注: Preprint. Project Page: [this https URL](https://e3c-videogen.github.io/)
点击查看摘要
Abstract:Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego exo human control over strong baselines, while also enabling intuitive scene editing.
122. 【2605.26295】Sleep-stage efficient classification using a lightweight self-supervised model
链接:https://arxiv.org/abs/2605.26295
作者:Eldiane Borges dos Santos Durães,João Batista Florindo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Linear SVM classifier, diagnosing sleep disorders, significantly enhance clinical, Linear SVM, sleep stage classification
备注:
点击查看摘要
Abstract:Accurate classification of sleep stages is crucial for diagnosing sleep disorders and automating this process can significantly enhance clinical assessments. This study aims to explore the use of a self-supervised model (more specifically, an adapted version of mulEEG) combined with a Linear SVM classifier to improve sleep stage classification. \textbf{Methods:} The mulEEG model, which learns electroencephalogram signal representations in a self-supervised manner, was simplified here by replacing ResNet-50 with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were conducted: the first one evaluated different configurations of the model and data volume for training, while the second tested the effectiveness of time series features, spectrogram features, and their concatenation as inputs to a Linear SVM classifier. \textbf{Results:} The results showed that reducing the volume of data offered a better cost-benefit ratio compared to simplifying the model. Using the concatenated features with ResNet-18 also outperformed the linear evaluations of the original mulEEG model, achieving higher classification performance. \textbf{Conclusions:} Simplifying the mulEEG model to extract features and pairing it with a robust classifier leads to more efficient and accurate sleep stage classification. This approach holds promise for improving clinical sleep assessments and can be extended to other biological signal classification tasks.
123. 【2605.26294】CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection
链接:https://arxiv.org/abs/2605.26294
作者:Durjoy Dey,Yuhong Yan,Hassan Hajjdiab
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rising malignancy worldwide, fast rising malignancy, malignancy worldwide, rising malignancy, common and fast
备注: 13 pages, 3 figures, accepted at ICPRAI 2026, The Fifth International Conference on Pattern Recognition and Artificial Intelligence. To appear in Lecture Notes in Computer Science
点击查看摘要
Abstract:Skin cancer is a common and fast rising malignancy worldwide. Early detection is critical for improving outcomes. Deep learning models trained on dermoscopic and clinical images can support automated and fast triage. However, many studies evaluate only a limited set of architectures. Experimental setups also vary across studies. In this paper, we present a unified evaluation of twelve deep learning models for binary skin cancer detection on the PAD-UFES-20 dataset. The models span four families: convolutional neural networks (CNN), vision transformers (ViT), hybrid convolution transformer backbones, and vision language models (VLM). Performance is assessed using AUC, the maximum F1 score with its precision and recall, and sensitivity at 80% specificity, reflecting screening oriented requirements. Our results show that well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision. The full codebase for all experiments is publicly released. Together, these findings offer practical guidance on which model families are most suitable for real world deployment in skin cancer screening and establish a reproducible reference point for future work on PAD-UFES-20.
124. 【2605.26292】Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
链接:https://arxiv.org/abs/2605.26292
作者:Taha Koleilat,Hassan Rivaz,Yiming Xiao
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:ambiguous image-text alignment, precise multimodal understanding, image-text alignment, crucial for precise, precise multimodal
备注: MICCAI 2026 Early Accept; Project Page: [this https URL](https://tahakoleilat.github.io/Evi-Steer)
点击查看摘要
Abstract:Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at this https URL.
125. 【2605.26287】A multifractal-based masked auto-encoder: an application to medical images
链接:https://arxiv.org/abs/2605.26287
作者:Joao Batista Florindo,Viviane de Moura
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown great promise, Multifractal-Optimized Masked Autoencoder, medical image classification, shown great, great promise
备注:
点击查看摘要
Abstract:Masked autoencoders (MAE) have shown great promise in medical image classification. However, the random masking strategy employed by traditional MAEs may overlook critical areas in medical images, where even subtle changes can indicate disease. To address this limitation, we propose a novel approach that utilizes a multifractal measure (Renyi entropy) to optimize the masking strategy. Our method, termed Multifractal-Optimized Masked Autoencoder (MO-MAE), employs a multifractal analysis to identify regions of high complexity and information content. By focusing the masking process on these areas, MO-MAE ensures that the model learns to reconstruct the most diagnostically relevant features. This approach is particularly beneficial for medical imaging, where fine-grained inspection of tissue structures is crucial for accurate diagnosis. We evaluate MO-MAE on several medical datasets covering various diseases, including MedMNIST and COVID-CT. Our results demonstrate that MO-MAE achieves promising performance, surpassing other basiline and state-of-the-art models. The proposed method also adds minimum computational overhead as the computation of the proposed measure is straightforward. Our findings suggest that the multifractal-optimized masking strategy enhances the model's ability to capture and reconstruct complex tissue structures, leading to more accurate and efficient medical image representation. The proposed MO-MAE framework offers a promising direction for improving the accuracy and efficiency of deep learning models in medical image analysis, potentially advancing the field of computer-aided diagnosis.
126. 【2605.26283】Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening
链接:https://arxiv.org/abs/2605.26283
作者:Durjoy Dey,Aymane Ajbar,Yuhong Yan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Modern deep learning, deep learning offers, learning offers powerful, Multi-disease Image Dataset, Fundus Multi-disease Image
备注: 12 pages, 3 figures, accepted at ICMHI 2026, 10th International Conference on Medical and Health Informatics, Kyoto, Japan. To appear in ACM Conference Proceedings
点击查看摘要
Abstract:Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.
127. 【2605.26277】VesselSim: learning 3D blood vessel segmentation without expert annotations
链接:https://arxiv.org/abs/2605.26277
作者:Erin Rainville,Melissa Ananian,Tristan Mirolla,Hassan Rivax,Yiming Xiao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Blood vessel segmentation, deep learning techniques, Blood vessel, related deep learning, surgical planning
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October
点击查看摘要
Abstract:Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challenges of providing expert vascular annotations pose a major obstacle for the progress of related deep learning techniques. To address this, we propose VesselSim, a two-stage framework for universal 3D blood vessel segmentation that eliminates the need for real annotated data during training. First, we introduce a stochastic, geometry-driven vascular simulation framework that models recursive branching, curvature-controlled growth, and collision-aware topology, followed by domain-randomized intensity synthesis to generate 16,500 anatomically plausible 3D angiographic volumes. Second, a 3D U-Net is trained solely on this synthetic data. To bridge the domain gap from synthetic to real images at inference time, we introduce a test-time adaptation strategy via a self-supervised mask reconstruction decoder, enabling adaptation to unseen clinical scans without prior domain knowledge. We evaluate VesselSim in a zero-shot setting on multiple real-world datasets spanning MR and CT across several anatomical regions, including the brain and kidneys. Despite being trained exclusively on synthetic data, VesselSim achieves performance competitive with state-of-the-art vascular segmentation foundation models. These findings suggest that learning vessel geometry from synthetic tubular structures is effective for robust cross-domain generalization, substantially reducing the reliance on acquired medical imaging data and more importantly, expert annotations.
128. 【2605.26273】Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation
链接:https://arxiv.org/abs/2605.26273
作者:İsmail Emre Canıtez,Özgür Erkent
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adverse lighting conditions, provide insufficient information, urban driving scenes, driving scenes remains, scenes remains challenging
备注: 9 pages, 7 figures, To be Presented at Perception Beyond the Visible Spectrum workshop series (IEEE PBVS) at CVPR, 2026
点击查看摘要
Abstract:Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies. For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism. For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73\% and 86.24\% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at this https URL
129. 【2605.26266】Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion
链接:https://arxiv.org/abs/2605.26266
作者:Tuna Tuncer,Felix Becker,Thomas Pfeil
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
关键词:Chunk-wise autoregressive video, avoid redundant computation, diffusion models rely, videos grow longer, autoregressive video diffusion
备注: Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S
点击查看摘要
Abstract:Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.
130. 【2605.26262】Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis
链接:https://arxiv.org/abs/2605.26262
作者:Émile Bergeron,Tadagbé Dhossou,Sébastien Tremblay,Jean-François Lalonde
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:important sites, dissemination of culture, exhibitions, Distribution Emotion State, emotion
备注:
点击查看摘要
Abstract:Museums are important sites for the dissemination of culture and art. They are institutions rooted in history and tradition; their exhibitions are often designed to highlight these aspects. Recently, a new approach is being explored in the field: emotion-based exhibitions. These exhibitions are designed specifically to elicit emotions in the visitors, in order to maximize engagement, and as a way to democratize access to art and attract a wider, more diverse audience. To do so, the emotional content of the artworks must first be extracted, however, manually annotating the artworks by experts is a prohibitively labor-intensive process, and risks introducing the personal bias of curators. To assist the museum curators in their design of these exhibitions, we wish to develop a tool that can predict the emotional response evoked by a work of art. In this article, we leverage a continuous bi-dimensional emotion space to enhance emotion representations and the training process of deep learning models. Drawing inspiration from existing categorical and dimensional emotion representations, we introduce a new representation, Dimensional Distribution Emotion State (DDES), along with a pipeline for multi-dataset training. We show that DDES provides multiple advantages compared to widely used representations while exhibiting similar baseline performance.
131. 【2605.26244】LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
链接:https://arxiv.org/abs/2605.26244
作者:Tengfei Liu,Yang Shi,Xuanyu Zhu,Jiafu Tang,Liu Yang,Qixun Wang,Zhuoran Zhang,Yuqi Tang,Fengxiang Wang,Yuhao Dong,Xinlong Chen,Bozhou Li,Bohan Zeng,Yue Ding,Xiaohan Zhang,Jialu Chen,Haotian Wang,Yuanxing Zhang,Pengfei Wan,Leye Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
关键词:protocols remain largely, remain largely confined, evaluation protocols remain, existing evaluation protocols, short-form settings
备注:
点击查看摘要
Abstract:Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.
132. 【2605.26241】RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
链接:https://arxiv.org/abs/2605.26241
作者:Jiahao Zhang,Joseph Liu,Young-Yoon Lee,Seonghyeon Moon,Victor Zordan,Guy Tevet,Karen Liu,Stephen Gould,Oren Jacob,Haomiao Jiang,Mubbasir Kapadia,Yizhak Ben-Shabat
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Success in generative, building capable models, modeling across language, generative modeling, key driver
备注: Accepted to CVPR'26
点击查看摘要
Abstract:Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.
133. 【2605.26239】Sentinel: Embodied Cooperative Spatial Reasoning and Planning
链接:https://arxiv.org/abs/2605.26239
作者:Xiangye Lin,Hongxin Zhang,Ruxi Deng,Qinhong Zhou,Chuang Gan
类目:Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:city-scale outdoor domains, decentralized embodied agents, study Cooperative Spatial, Cooperative Spatial Intelligence, city-scale outdoor
备注: The first two authors contributed equally
点击查看摘要
Abstract:In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized embodied agents must communicate in natural language to agree on a mutually safe and convenient meeting point within large, city-scale outdoor environments. Each agent must then navigate safely while avoiding dynamic sentinels patrolling the area, using a tool that provides coarse spatial information. To address this, we propose CoSaR (Cooperative Spatial Reasoning and Planning), a framework that bridges the high-level communication and planning abilities of foundation models with the precision of classical spatial navigation algorithms. CoSaR enables agents to exchange situational updates, reason over evolving spatial constraints, and collaboratively replan trajectories. Evaluated across 14 city-level scenes with 3-5 agents, CoSaR consistently leads to faster gathering, shorter path lengths, and improved safety. Our results demonstrate that integrating dynamic communication with spatial reasoning is essential for robust multi-agent cooperation. By formalizing this new setting and providing a scalable benchmark, we aim to build a foundation for advancing cooperative spatial intelligence in embodied multi-agent systems. Code and challenge are available at this https URL.
134. 【2605.26236】DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation
链接:https://arxiv.org/abs/2605.26236
作者:Ferdinand Paar,Lanmiao Liu,Aslı Özyürek,Serge Thill,Esam Ghaleb
类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:gesture generation requires, Co-speech gesture generation, semantic, generation requires, Variational Information Bottleneck
备注:
点击查看摘要
Abstract:Co-speech gesture generation requires both semantic expressivity and biomechanically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This limits semantic grounding, speech-motion alignment, and kinematic smoothness. We propose \emph{DuoGesture}, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a \emph{Semantic Variational Information Bottleneck}, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion. The semantic stream is controlled by \emph{Motion-Grounded Semantic Conditioning}, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is further regularised by an \emph{Inertial Beat Prior}, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective experiments show that DuoGesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.
135. 【2605.26232】Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos
链接:https://arxiv.org/abs/2605.26232
作者:Bonan Ding,Umair Nawaz,Ufaq Khan,Abdelrahman M. Shaker,Muhammad Haris Khan,Jiale Cao,Jin Xie,Fahad Shahbaz Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pre-trained video large, large language models, language models excel, Pre-trained video, video large language
备注: 19 pages, 8 figures, 7 tables, preprint
点击查看摘要
Abstract:Pre-trained video large language models excel at visual reasoning. However, they struggle when videos arrive with auxiliary streams, such as audio, depth map, or dense temporal evidence. In such a scenario, uniform fusion induces modality interference, allowing irrelevant channels to distract the model. To address this issue, we present a unified multimodal video understanding framework, named UniMVU, that performs instruction-aware fusion across video, audio, depth map, or any other modality inputs via two levels of dynamic gating: inner-modality gates emphasize salient regions within each modality, whereas modality-level gates re-weight whole streams; both are conditioned on the text instruction to adaptively balance modality importance. Our UniMVU combines cross-modal self-attention with instruction-driven inner-modality gating module and a modality-level gating module with control token; for time-aligned streams we further adopt a fast-to-slow fusion scheme that reduces redundancy. Across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D and MVBench), our UniMVU achieves consistent gains over static-fusion baselines achieving gains as high as 13.5 in terms of CIDEr metric. Further, our analysis shows that the gating mechanism aligns with the human-interpretable modality relevance, and ablations show the contributions of inner-modality and modality-level gating. Our UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.
136. 【2605.26230】Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction
链接:https://arxiv.org/abs/2605.26230
作者:Jin Hyeon Kim,Jaeeun Lee,Claire Kim,Kyoungjin Oh,Paul Hyunbin Cho,Jaewon Min,Yeji Choi,Jihye Park,Hyunhee Park,Minkyu Park,Seungryong Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, achieved remarkable, remarkable progress, reconstruction, reconstruction models
备注:
点击查看摘要
Abstract:Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.
137. 【2605.26149】AnySurf: Any Surface Generation with Directed Edge
链接:https://arxiv.org/abs/2605.26149
作者:Wenda Shi,Chenyuan Pan,Dengming Zhang,Yiren Song,Biao Zhang,Xingxing Zou
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:surface components prevail, Open surface components, content and support, support rendering, physical simulation
备注:
点击查看摘要
Abstract:Open surface components prevail in real industrial 3D content and support rendering, physical simulation and geometric editing. Garments serve as a typical open surface type, with numerous existing generation methods leveraging sewing patterns to generate 2D panels and stitch them into 3D shapes. Such domain-specific designs lack scalability and cannot generalize to shoes and accessories. Common field-based 3D generators prioritize watertight meshes and tend to create flawed double-layer structures on open surfaces. Though Trellis2 adopts field-free representation, its open surface results still contain normal and topology errors. We present AnySurf, a unified framework generating open, closed and hybrid 3D surfaces with accurate face orientation. Built on directed-edge enhanced Flexible Dual Grid (FDG-D), our representation retains normal direction information via oriented grid edges. We also propose ROS-FT post-training and a lightweight DE-Adapter with merely 1% extra parameters, facilitating directed edge learning while preserving original generation performance. We further construct Outfit3D dataset containing industrial garments and closed accessories. Our work transforms garment modeling into a universal 3D generation task. Experimental results demonstrate superior mesh quality and better practicality for downstream applications.
138. 【2605.26144】VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents
链接:https://arxiv.org/abs/2605.26144
作者:JunJia Guo,Yuhang Yao,Jiawei(Joe)Zhou,Jingdi Chen
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:web-app generation capabilities, free stack choice, pruned Figma structure, stack choice, capabilities of LLM-based
备注:
点击查看摘要
Abstract:We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.
139. 【2605.26137】AssetGen: Deployable 3D Asset Generation at Interactive Speed
链接:https://arxiv.org/abs/2605.26137
作者:Dilin Wang,Xiaoyu Xiang,Kihyuk Sohn,Tom Monnier,Yu-Ying Yeh,Thu Nguyen-Phuoc,Jiawen Zhang,Yuchen Fan,Antoine Toisoul,Hyunyoung Jung,Prithviraj Dhar,Michael Bunnell,Nikolaos Sarafianos,Chuhang Zou,Roman Shapovalov,Andrea Vedaldi,Rakesh Ranjan
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:obtaining high-resolution assets, leaving user experience, generation is progressing, progressing rapidly, recent work
备注:
点击查看摘要
Abstract:While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.
140. 【2605.27139】Unsupervised Deep Image Prior for Sparse-View and Limited-Angle Electron Tomography
链接:https://arxiv.org/abs/2605.27139
作者:Serge Brosset,Daniel del Pozo Bueno,Thomas David,Laure Guetaz,Philippe Ciuciu,Zineb Saghi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
关键词:characterization of nanomaterials, Electron tomography, plays an important, important role, Electron
备注: 22 pages, 12 figures
点击查看摘要
Abstract:Electron tomography (ET) plays an important role in the three-dimensional (3D) characterization of nanomaterials. However, under limited-angle and sparse-view conditions, conventional algorithms produce degraded reconstructions, which compromise the quality and interpretability of resulting 3D data. In this paper, we present deep image prior (DIP), an unsupervised deep learning (DL) approach, for highly degraded tomography acquisitions and demonstrate, using simulated data, that its performance is comparable to that of supervised approaches requiring training datasets, even for tilt ranges as limited as 60° and tilt increments of 10°. We then apply it to experimental data and show that it enables reliable 3D quantification under both sparse-view and limited-angle conditions, highlighting its potential for a wide range of materials and acquisition modalities.
141. 【2605.26726】Measuring Prediction Uncertainty in Neural Cellular Automata
链接:https://arxiv.org/abs/2605.26726
作者:Ario Sadafi,Michael Deutges,Nassir Navab,Carsten Marr
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural cellular automata, Neural cellular, encoder-decoder segmentation networks, cellular automata, provide a lightweight
备注: Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026
点击查看摘要
Abstract:Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ($\Delta$Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.

