本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新902篇论文,其中:
- 自然语言处理200篇
- 信息检索29篇
- 计算机视觉170篇
自然语言处理
1. 【2605.30348】LLMSurgeon: Diagnosing Data Mixture of Large Language Models
链接:https://arxiv.org/abs/2605.30348
作者:Yaxin Luo,Jiacheng Cui,Xiaohan Zhao,Xinyi Shang,Jiacheng Liu,Xinyue Bi,Zhaoyi Li,Zhiqiang Shen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, shaping model behaviors, Language Models, failure modes
备注: ACL 2026 Main. Code at [this https URL](https://github.com/Yaxin9Luo/LLMSurgeon)
点击查看摘要
Abstract:The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.
2. 【2605.30345】SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations
链接:https://arxiv.org/abs/2605.30345
作者:Qinpei Luo,Ruichun Ma,Xinyu Zhang,Lili Qiu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Printed circuit board, Printed circuit, circuit board, manual and expertise-intensive, remains manual
备注: 19 pages, 7 figures
点击查看摘要
Abstract:Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.
3. 【2605.30343】Unlocking the Working Memory of Large Language Models for Latent Reasoning
链接:https://arxiv.org/abs/2605.30343
作者:Lukas Aichberger,Sepp Hochreiter
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reasoning, memory, test-time compute, compute is typically, typically scaled
备注: Preprint
点击查看摘要
Abstract:To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external communication. In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts. Drawing on this principle, we introduce Reasoning in Memory (RiM), a latent reasoning method that replaces the autoregressive generation of reasoning steps with memory blocks. These memory blocks are fixed sequences of special tokens that unlock the working-memory capacity of large language models. Since they are fixed rather than generated, they can be processed in a single forward pass, enabling compute-efficient latent reasoning. To operationalize these memory blocks, we employ a two-stage curriculum. First, we ground them by predicting explicit reasoning steps after each memory block. Second, we discard this step-level supervision and iteratively refine the final answer after each memory block. Our experiments on reasoning benchmarks show that, across language models of different families and sizes, RiM matches or exceeds existing latent reasoning methods while avoiding the autoregressive generation of thoughts. These results demonstrate that large language models can be trained to use working memory as an effective mechanism for latent reasoning.
4. 【2605.30335】Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
链接:https://arxiv.org/abs/2605.30335
作者:Anany Kotawala
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multi-component LLM agents, LLM agents assemble, agents assemble probabilistic, assemble probabilistic claims, violate basic probability
备注: 25 pages, 7 figures, 24 tables. Preliminary versions to appear at the ICML 2026 Workshops on Combining Theory and Benchmarks (CTB), Statistical Frameworks for Uncertainty in Agentic Systems (AgenticUQ), and Failure Modes of Agentic AI (FAGEN)
点击查看摘要
Abstract:Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.
5. 【2605.30334】Demystifying Data Organization for Enhanced LLM Training
链接:https://arxiv.org/abs/2605.30334
作者:Yalun Dai,Yangyu Huang,Tongshen Yang,Yonghan Wang,Xin Zhang,Wenshan Wu,Qihao Zhao,Hao Li,Yuanyuan Gao,Kim-Hui Yap,Scarlett Li
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, effective data curation, Language Models, revolutionized various fields
备注: ACL 2026 Main Conference
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: this https URL
6. 【2605.30333】COMPOSE: Composing Future Theorems from Citations and Formal Structure
链接:https://arxiv.org/abs/2605.30333
作者:David Busbib,Michael Werman
类目:Computation and Language (cs.CL)
关键词:satisfy two constraints, validly follow, direction of prior, prior work, work and respect
备注:
点击查看摘要
Abstract:A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of these sources, producing claims that are either weakly grounded or insufficiently motivated. We introduce grounded future mathematical generation, where the goal is to generate a plausible future theorem-like claim for an anchor paper using two complementary sources of context: its scientific citation graph and aligned formal theorem dependency graph. To address this setting, we propose COMPOSE, a dual-graph framework that conditions a language model on both scientific citation context and formal theorem structure. To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025. Experiments show that COMPOSE outperforms strong baselines on retrieval to real future papers and achieves the best overall performance under LLM-judge evaluation, producing more grounded and mathematically richer outputs. These results show that future mathematical generation benefits from combining scientific context with formal structure. Project page is available at this https URL.
7. 【2605.30327】Reasoning with Sampling: Cutting at Decision Points
链接:https://arxiv.org/abs/2605.30327
作者:Felix Zhou,Anay Mehrotra,Quanquan C. Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST); Machine Learning (stat.ML)
关键词:posttraining base language, Frontier reasoning models, power distribution, Frontier reasoning, reinforcement learning
备注:
点击查看摘要
Abstract:Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
8. 【2605.30324】On Language Generation in the Limit with Bounded Memory
链接:https://arxiv.org/abs/2605.30324
作者:Jon Kleinberg,Anay Mehrotra,Amin Saberi,Grigoris Velegkas
类目:Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:Toggle, Toggle Hugging Face, language generation, language, Connected Papers
备注: The abstract has been shortened to fit within the arXiv limit
点击查看摘要
Abstract:We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.
Comments:
The abstract has been shortened to fit within the arXiv limit
Subjects:
Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:
arXiv:2605.30324 [cs.DS]
(or
arXiv:2605.30324v1 [cs.DS] for this version)
https://doi.org/10.48550/arXiv.2605.30324
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Grigoris Velegkas [view email] [v1]
Thu, 28 May 2026 17:57:03 UTC (109 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled On Language Generation in the Limit with Bounded Memory, by Jon Kleinberg and 3 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.DS
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
cs.AI
cs.CL
cs.LG
stat
stat.ML
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
9. 【2605.30315】Resolution Diagnostics for Paired LLM Evaluation
链接:https://arxiv.org/abs/2605.30315
作者:Anany Kotawala
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Open LLM Leaderboard, public LLM leaderboards, displayed pairwise rankings, conventional paired-test resolution, paired-test resolution target
备注: 16 pages, 7 figures, 12 tables. Accepted to the ICML 2026 Workshop on Hypothesis Testing, Seoul, South Korea, 2026. Copyright 2026 by the author(s)
点击查看摘要
Abstract:Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.
10. 【2605.30295】MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
链接:https://arxiv.org/abs/2605.30295
作者:Valentina Bui Muti,Eugénie Dulout,Ziquan Fu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, electronic health record-congruent, settings remains limited, health record-congruent settings, record-congruent settings remains
备注: Accepted to ICML 2026 Structured Data for Health Workshop
点击查看摘要
Abstract:Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.
11. 【2605.30290】Self-Trained Verification for Training- and Test-Time Self-Improvement
链接:https://arxiv.org/abs/2605.30290
作者:Chen Henry Wu,Aditi Raghunathan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Self-improvement at scale, longstanding goal, natural places, test time, Self-improvement
备注:
点击查看摘要
Abstract:Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification.
12. 【2605.30280】Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
链接:https://arxiv.org/abs/2605.30280
作者:Qiuyue Wang,Mingsheng Li,Jian Guan,Jinhui Ye,Sicheng Xie,Yitao Liu,Junhao Chen,Zhixuan Liang,Jie Zhang,Xintong Hu,Xuhong Huang,Pei Lin,Junyang Lin,Dayiheng Liu,Shuai Bai,Jingren Zhou,Jiazhao Zhang,Haoqi Yuan,Gengze Zhou,Hang Yin,Ye Wang,Yiyang Huang,Zixing Lei,Wujian Peng,Delin Chen,Yingming Zheng,Jingyang Fan,Xianwei Zhuang,Xin Zhou,Haoyang Li,Anzhe Chen,Tong Zhang,Xuejing Liu,Yuchong Sun,Ruizhe Chen,Zhaohai Li,Chenxu Lü,Zhibo Yang,Tao Yu,Xionghui Chen
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:resulting in fragmented, studied through specialized, fragmented capabilities, capabilities and limited, Embodied intelligence
备注: 34 pages
点击查看摘要
Abstract:Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.
13. 【2605.30274】Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection
链接:https://arxiv.org/abs/2605.30274
作者:Yutong Wang,Xuebo Liu,Derek F. Wong,Zhilin Li,Rongqing Jiang,Min Zhang,Shimin Tao,Daimeng Wei,Min Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Document-level translation remains, large language models, impede global cohesion, limited context windows, redundant contextual information
备注:
点击查看摘要
Abstract:Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at this https URL.
14. 【2605.30273】LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback
链接:https://arxiv.org/abs/2605.30273
作者:Jiwon Kim,Maya Ajit,Sherry Gong,Soorya Ram Shimgekar,Dong Whi Yoo,Eshwar Chandrasekharan,Koustuv Saha
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
关键词:requires substantial compute, Large language models, Large language, expert input, mental health queries
备注:
点击查看摘要
Abstract:Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.
15. 【2605.30265】LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
链接:https://arxiv.org/abs/2605.30265
作者:Feng Han,Zhixiong Zhang,Zheming Liang,Yibin Wang,Jiaqi Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:achieved substantial progress, image-text training aimed, large-scale image-text training, driven by large-scale, achieved substantial
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.
16. 【2605.30260】How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
链接:https://arxiv.org/abs/2605.30260
作者:Ziwen Xu,Haiwen Hong,Linsong Yu,Benglei Cui,Longtao Huang,Hui Xue,Ningyu Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, dynamic real-world environments, real-world environments
备注: Ongoing work
点击查看摘要
Abstract:Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at this https URL.
17. 【2605.30256】VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
链接:https://arxiv.org/abs/2605.30256
作者:Amrita Mazumdar,Seonwook Park,Rajarshi Roy,Nikhil Srihari,Shengze Wang,Yuhao Zhou,Julia Wang,Koki Nagano,Shalini De Mello
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:people simultaneously speak, Natural human conversation, people simultaneously, producing nonverbal cues, simultaneously speak
备注: Project page: [this https URL](https://research.nvidia.com/labs/amri/projects/video-fdb/)
点击查看摘要
Abstract:Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.
18. 【2605.30251】Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models
链接:https://arxiv.org/abs/2605.30251
作者:Zizhuo Lin,Quanling Liu,Jinsheng Quan,Chao Zhang,Yifan Zhu,Xing Shi,Jingtao Xu,Zhihui Li,Yawei Luo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, clean FULL prompt, FULL prompt, revealed gradually
备注:
点击查看摘要
Abstract:Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.
19. 【2605.30245】Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning
链接:https://arxiv.org/abs/2605.30245
作者:Shaojie Wang,Liang Zhang
类目:Computation and Language (cs.CL)
关键词:large language models, Current plan-based reasoning, methods improve large, improve large language, Current plan-based
备注:
点击查看摘要
Abstract:Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination reveals an inherent paradigm-level gap: both the planning and its execution stages decide how to solve a problem, while the prior question of what to solve; recognizing the problem type, the applicable tools, and the foreseeable pitfalls; remains entirely implicit. To bridge this gap, we propose PPC (Preplan-Plan-CoT), a framework that introduces an explicit problem-understanding stage, the preplan, yielding a new question $\rightarrow$ preplan $\rightarrow$ plan $\rightarrow$ cot paradigm. Realizing this paradigm requires safeguarding the conceptual integrity of preplan at both ends. Specifically, we design a three-stage synthesis pipeline with a spoiler-score detector that filters out leakage and spoiler failures to build clean preplan supervision, and a composite GRPO reward enforces that the generated plan genuinely follows from the preplan. Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing additional inference token overhead.
20. 【2605.30241】CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild
链接:https://arxiv.org/abs/2605.30241
作者:Sahajpreet Singh,Insyirah Mujtahid,Min-Yen Kan,Kokil Jaidka
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
关键词:multilingual online settings, static benchmarks provide, verification increasingly occurs, Misinformation verification increasingly, occurs in public
备注:
点击查看摘要
Abstract:Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for misinformation detection in the wild, with three major goals: coverage, granularity, and redistributability. This release contains 15,992 standalone claims across five languages and two domains. We evaluate ten LLMs under varying inference-time capabilities, including thinking and web-search. Our results show that closed-input verification remains challenging, web access yields the largest gains, and web-enabled LLMs' source-selection policies are systematically misaligned with the sources human Community Notes raters converge on -- a gap that closes through model-specific mechanisms of retrieval expansion or pruning. We further find substantial variation across language-domain slices and across the evidence ecosystems used by web-enabled systems. Beyond evaluation, CommunityFact positions Community Notes as a training signal for claim-conditioned source suggesters that could improve factual verification on novel claims.
21. 【2605.30237】GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases
链接:https://arxiv.org/abs/2605.30237
作者:Yicheng Tao,Yiqun Wang,Xiangchen Song,Xin Luo,Kai Liu,Jie Liu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Semi-structured knowledge bases, academic paper search, embed textual documents, Semi-structured knowledge, product search
备注:
点击查看摘要
Abstract:Semi-structured knowledge bases (SKBs) embed textual documents in a typed graph of entities and relations, and underpin applications such as product search, academic paper search, and precision-medicine inquiries. Existing hybrid retrieval systems on SKBs either use the graph only for query expansion, mix textual and structural branches under a global weighting, or rely on fine-tuned graph-traversal generators. We present GRASP, a three-stage SKB retrieval framework unifying plan-based graph retrieval, plan-conditioned fusion with a dense retriever, and a fine-tuned reranker over the fused candidates. GRASP substantially advances the state of the art on every metric across the three STaRK benchmarks, lifting average Hit@1 from 62.0 to 73.9. Ablation and sensitivity studies further confirm the effectiveness and robustness of GRASP.
22. 【2605.30233】Do Language Models Track Entities Across State Changes?
链接:https://arxiv.org/abs/2605.30233
作者:Zilu Tang,Qiao Zhao,Gabriel Franco,Derry Wijaya,Aaron Mueller,Sebastian Schuster,Najoung Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:underlies complex reasoning, Entity tracking, fundamental skill, skill that underlies, texttt
备注: ICML main conference 2026, 9 pages
点击查看摘要
Abstract:Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not incrementally track world states across tokens or query-relevant states across layers, but simply aggregate relevant information in parallel at the last token when the query becomes evident. We further investigate mechanisms of individual operations ($\texttt{PUT}$, $\texttt{REMOVE}$, $\texttt{MOVE}$) to characterize this non-incremental ET mechanism. Surprisingly, LMs implement the $\texttt{REMOVE}$ operation with a fragile global suppression tag; this global removal mechanism predicts various failure modes that we confirm behaviorally. We provide a mechanistic solution of nullifying this tag to partially address this issue. Overall, our findings reveal that LMs solve a fundamentally sequential task using a non-sequential strategy. More broadly, our work illustrates how behavioral and mechanistic analyses can fruitfully interact. Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.
23. 【2605.30232】How's it going? Reinforcement learning in language models recruits a functional welfare axis
链接:https://arxiv.org/abs/2605.30232
作者:Andy Q Han,David J. Chalmers,Pavel Izmailov
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:reinforcement learning shape, reinforcement learning, learning shape, language model internal, model internal representations
备注: 81 pages, 43 figures, 32 tables
点击查看摘要
Abstract:How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.
24. 【2605.30219】When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
链接:https://arxiv.org/abs/2605.30219
作者:Haoming Xu,Weihong Xu,Zongrui Li,Mengru Wang,Yunzhi Yao,Chiyu Wu,Jin Shang,Yu Gong,Shumin Deng
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Long-horizon interactions require, manage accumulating information, interactions require language, Long-horizon interactions, Contextual Belief Management
备注: Work in progress
点击查看摘要
Abstract:Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at this https URL.
25. 【2605.30214】GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German
链接:https://arxiv.org/abs/2605.30214
作者:Fabian Mewes,Anne Lauscher,Vagrant Gautam
类目:Computation and Language (cs.CL)
关键词:Third-person singular pronouns, study stereotypical biases, Third-person singular, reason about reference, study stereotypical
备注:
点击查看摘要
Abstract:Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.
26. 【2605.30202】A Dual-Path Architecture for Scaling Compute and Capacity in LLMs
链接:https://arxiv.org/abs/2605.30202
作者:Markus Frey,Behzad Shomali,Joachim Koehler,Mehdi Ali
类目:Computation and Language (cs.CL)
关键词:Looped transformers apply, shared block multiple, block multiple times, parameter-efficient route, route to scaling
备注:
点击查看摘要
Abstract:Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.
27. 【2605.30189】oken-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection
链接:https://arxiv.org/abs/2605.30189
作者:Travis Lelle
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:baseline task performance, dominant distribution format, preserving baseline task, training data poisoning, fine-tuned LLMs
备注: 45 pages, 27 tables. Code and evaluation data: [this https URL](https://github.com/Travis-ML/lora-backdoors) . Trained adapter weights available on request
点击查看摘要
Abstract:We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.
Comments:
45 pages, 27 tables. Code and evaluation data: this https URL. Trained adapter weights available on request
Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
ACMclasses:
I.2.6; I.2.7; K.6.5
Cite as:
arXiv:2605.30189 [cs.CR]
(or
arXiv:2605.30189v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2605.30189
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
28. 【2605.30152】Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?
链接:https://arxiv.org/abs/2605.30152
作者:Xiaoze Liu,Ruowang Zhang,Amir H. Abdi,Michel Galley,Zhikai Chen,Siheng Xiong,Xiaoqian Wang,Jing Gao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Proactive agents read, read user activity, agents read user, Proactive agents, user activity
备注: 31 pages, 5 figures, 7 tables
点击查看摘要
Abstract:Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.
29. 【2605.30133】CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution
链接:https://arxiv.org/abs/2605.30133
作者:Milan Straka
类目:Computation and Language (cs.CL)
关键词:Multilingual Coreference Resolution, shared task focuses, Shared Task, Coreference Resolution, Multilingual Coreference
备注: Accepted to CODI-CRAC 2026
点击查看摘要
Abstract:We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation. The source code and the trained models are publicly available at this https URL.
30. 【2605.30131】CCS: Clinical Consensus Selection for Radiology Report Generation
链接:https://arxiv.org/abs/2605.30131
作者:Xi Zhang,Yingshu Li,Zaiqiao Meng,Jake Lever,Edmond S. L. Ho
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:large language model, single-path generation task, multimodal large language, generation task, produces one decoded
备注: 17 pages, 6 figures
点击查看摘要
Abstract:Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.
31. 【2605.30126】PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
链接:https://arxiv.org/abs/2605.30126
作者:Selim Kuzucu,Alessio Tonioni,Vasile Lup,Bernt Schiele,Federico Tombari,Muhammad Ferjad Naeem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Vision-Language Models, quadratic computational bottleneck, map visual inputs, dense token sequences, Large Vision-Language
备注: 33 pages, 4 figures
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.
32. 【2605.30107】Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking
链接:https://arxiv.org/abs/2605.30107
作者:Songbo Hu,Yinhong Liu,Ej Zhou,Evgeniia Razumovskaia,Xiaobin Wang,Alexander Fraser,Ivan Vulić,Anna Korhonen
类目:Computation and Language (cs.CL)
关键词:Creating spoken dialogue, Creating spoken, World Health Organization, methodologically challenging, challenges are amplified
备注: Accepted to Findings of ACL 2026
点击查看摘要
Abstract:Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.
33. 【2605.30104】SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?
链接:https://arxiv.org/abs/2605.30104
作者:Jiamin Chen,Yidi Wu,Qiexiang Wang,Qianben Chen,Yuchen Li,Yansen Zhang,Xiaokun Zhang,Wangchunshu Zhou,Chen Ma
类目:Computation and Language (cs.CL)
关键词:receiving near-tied scores, Widely used language-model, metrics cannot resolve, frontier systems, systems often receiving
备注:
点击查看摘要
Abstract:Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks. SEAL seeds candidate outputs into a single elimination and evaluates each match with task-level principles plus self-improving checklist criteria. We evaluate SEAL on multiple saturated benchmarks covering code generation, mathematical reasoning, knowledge-intensive question answering, and tool-use agent task completion. Across these settings, SEAL improves the ranking-accuracy--latency trade-off over competing protocols, attaining 0.83--1.00 Spearman agreement with full pairwise judging and 4/4 top-1 agreement, while requiring only 11.89 calls per task compared with 28.00 for full pairwise evaluation.
34. 【2605.30090】DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation
链接:https://arxiv.org/abs/2605.30090
作者:Jiamin Chen,Qianben Chen,Jiawen Zhang,Yidi Wu,Yuchen Li,Xiaokun Zhang,Wangchunshu Zhou,Chen Ma
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Long-form video generation, Long-form video, video generation, cinematic control, moving from short
备注:
点击查看摘要
Abstract:Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.
35. 【2605.30085】Conformal Certification of Reasoning Trace Prefixes
链接:https://arxiv.org/abs/2605.30085
作者:Matt Y. Cheung,Ashok Veeraraghavan,Hanjie Chen,Guha Balakrishnan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:Language model reasoning, Language model, critical error occurs, model reasoning traces, Conformal Reasoning Output
备注: Code available at [this https URL](https://github.com/matthewyccheung/crop)
点击查看摘要
Abstract:Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.
36. 【2605.30080】Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model
链接:https://arxiv.org/abs/2605.30080
作者:Thang Dang,Akira Nakagawa,Kenichi Kobayashi,Koichi Shirahata
类目:Computation and Language (cs.CL)
关键词:traditional Large Language, Large Language Models, Large Language, vocabulary design complexity, Tokenization-free hierarchical models
备注:
点击查看摘要
Abstract:Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.
37. 【2605.30076】UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering
链接:https://arxiv.org/abs/2605.30076
作者:Yingdong Shi,Ruiming Zhang,Changming Li,Zhiyu Yang,Kaixing Zhang,Jingyi Yu,Kan Ren
类目:Computation and Language (cs.CL)
关键词:Activation-based control steers, steers large language, large language models, control steers large, persona and style
备注: 16 pages,4 figures
点击查看摘要
Abstract:Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.
38. 【2605.30058】HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?
链接:https://arxiv.org/abs/2605.30058
作者:Weihan Peng,Chenxu Zhang,Qianao Wang,Yuling Shi,Heng Lian,Qihong Mao,Jiahao Pang,Chunliang Feng,Bowen Li,Xiaodong Gu
类目:Computation and Language (cs.CL)
关键词:hold equal importance, demonstrated remarkable task-oriented, remarkable task-oriented abilities, emotional dimensions hold, dimensions hold equal
备注: GitHub: [this https URL](https://github.com/peng-weihan/HEART-BENCH)
点击查看摘要
Abstract:While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.
39. 【2605.30052】REPOT: Recoverable Program-of-Thought via Checkpoint Repair
链接:https://arxiv.org/abs/2605.30052
作者:Parsa Mazaheri
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:action silently invalidates, single invalid action, invalid action silently, emits a Python, Python program
备注:
点击查看摘要
Abstract:One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears =30% on GPT-medium and =70% on Gemini, vs =3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.
40. 【2605.30051】Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues
链接:https://arxiv.org/abs/2605.30051
作者:Zhangqi Duan,Shuyan Huang,Alexander Scarlatos,Jaewook Lee,Simon Woodhead,Andrew Lan
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:tutor model evaluation, large language model, facilitate tutor model, automated tutoring tools, developing large language
备注:
点击查看摘要
Abstract:A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.
41. 【2605.30040】oken Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage
链接:https://arxiv.org/abs/2605.30040
作者:Shahinul Hoque,Jinghuai Zhang,Jinyuan Sun,Fnu Suya
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:standard pricing model, large language models, counts directly affects, commercial large language, Per-token billing
备注:
点击查看摘要
Abstract:Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts directly affects what users pay. We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. The audit therefore reduces to a consistency check on the provider's own reports. We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. At current frontier reasoning prices, that turns a \$100 honest bill into roughly a \$1,569 bill on the same query. Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.
42. 【2605.30036】aching Values to Machines: Simulating Human-Like Behavior in LLMs
链接:https://arxiv.org/abs/2605.30036
作者:Asaf Yehudai,Naama Rozen,Ariel Gera
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, demonstrate a remarkable, personas and roles
备注: GEM Workshop at ACL 2026
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.
43. 【2605.30031】Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation
链接:https://arxiv.org/abs/2605.30031
作者:Bo-Han Feng,Yu-Hsuan Li Liang,Chien-Feng Liu,You-Hsuan Chang,Yun-Nung Chen
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Audio Language, Audio Language Models, Large Audio, Audio Language, Language Models
备注: Submitted to ACL ARR 2026 May
点击查看摘要
Abstract:Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.
44. 【2605.30022】Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders
链接:https://arxiv.org/abs/2605.30022
作者:Pierre-Antoine Lequeu,Camille Barboule,Benjamin Piwowarski
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:represent sequence order, remains poorly understood, stored remains poorly, permutation-invariant Transformers represent, Transformers represent sequence
备注: 8 page + 10 pages of bibliography and appendix
点击查看摘要
Abstract:Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.
45. 【2605.30021】Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs
链接:https://arxiv.org/abs/2605.30021
作者:Vinay Samuel,Yapei Chang,Mohit Iyyer
类目:Computation and Language (cs.CL)
关键词:LLM output space, narrows an LLM, LLM output, multiple valid answers, open-ended instructions
备注: Under Review. 26 pages, 3 figures, 16 tables
点击查看摘要
Abstract:Many open-ended instructions have multiple valid answers that users can benefit from seeing, but post-training often narrows an LLM's output space toward a small set of canonical responses. We introduce REDIPO, an offline DPO data-construction pipeline for recovering distinct valid answer modes while preserving the alignment benefits of the instruct model. For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improves NoveltyBench distinct_k by 134%, 33%, and 44% relative to the instruct checkpoints, while DivPO changes diversity by 0%, -6%, and -4% on the same models. These gains largely maintain MTBench, IFEval, and Arena-Hard performance, and reduce direct-category HarmBench attack success rate. Ablations show that marginal-diversity pair selection and base-response rewriting drive the diversity gains, while filtering and quality-bounded pairing help maintain alignment. Overall, our results show that diverse valid answers from base-model generations can be reintroduced through carefully constructed preference data while retaining the alignment benefits of post-training. We release our code and data at this https URL.
46. 【2605.30018】Latent Performance Profiling of Large Language Models
链接:https://arxiv.org/abs/2605.30018
作者:Tanmoy Chakraborty,Ayan Sengupta,Suparna Bhattacharya,Partha Pratim Chakrabarti,Amlan Chakrabarti,Supratik Chakraborty,Partha Pratim Das,Lipika Dey,Richa Singh,Mayank Vatsa
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:frequently achieve impressive, Large language models, Large language, achieve impressive scores, frequently achieve
备注:
点击查看摘要
Abstract:Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture \textit{what} a model outputs on fixed test sets, not \textit{how} it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, \textit{state-centered intrinsic assessment} of LLMs. To this end, we introduce \textbf{Latent Performance Profiling (LPP)} -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.
47. 【2605.29992】Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
链接:https://arxiv.org/abs/2605.29992
作者:M. Ali Bayram,Banu Diri,Savaş Yıldırım
类目:Computation and Language (cs.CL)
关键词:Turkish-focused sentence embedding, sentence embedding model, semantic search, retrieval-augmented generation, foundational component
备注: 14 pages, 2 figures, 4 tables, Appendix included
点击查看摘要
Abstract:Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.
48. 【2605.29987】MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment
链接:https://arxiv.org/abs/2605.29987
作者:Dang Hong Nguyen,Nhi Ngoc-Yen Nguyen,Huy-Hieu Pham
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:learning enables elastic-dimension, enables elastic-dimension embeddings, multi-scales representation learning, representation learning enables, Soft Collapse Regularization
备注: Accepted at the GlobalSouthML Workshop at ICML 2026. 13 pages, 2 figures
点击查看摘要
Abstract:Although multi-scales representation learning enables elastic-dimension embeddings, nested subspaces often suffer from dimensional redundancy and spectral collapse. To address this, we introduce MIC, a framework that optimizes the geometric landscape of multi-granular embeddings through isotropic subspace alignment. MIC employs Soft Collapse Regularization (SCR) to mitigate redundancy between prefix and residual subspaces via cross-correlation penalties, alongside Spectral Isotropy Regularization (SIR) to ensure hyper-spherical uniformity in low-dimensional prefixes. By unifying these strategies through a self-distillation objective, MIC generates semantically dense representations that maintain high discriminative power. Our experiments demonstrate that MIC significantly outperforms standard baselines, particularly in high-compression scenarios where maintaining informational capacity is most critical.
49. 【2605.29971】Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning
链接:https://arxiv.org/abs/2605.29971
作者:Zhenghao Herbert Zhou,R. Thomas McCoy,Robert Frank
类目:Computation and Language (cs.CL)
关键词:largely targeted discrete, targeted discrete features, language model representations, grammatical number, representations have largely
备注:
点击查看摘要
Abstract:Causal interventions in language model representations have largely targeted discrete features, like grammatical number. However, language models must also make use of features that are graded. We introduce a method for causal intervention on continuous variables: given activation vectors paired with a graded target variable, we localize a low-dimensional direction for that variable and use this direction to edit a vectors toward counterfactual target values. We apply this method to a continuous feature that is well-studied in psycholinguistics, namely verb bias (which reflects which syntactic structures tend to follow a given verb). We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences. Verb bias has also previously been linked to in-context learning; in further analyses, we find that steering vectors encode error signals that could drive the error-driven update behavior seen in in-context learning but that these aspects of the steering vectors are not causally used in downstream production. Overall, these results show causal interventions can be applied to continuous variables, though connecting continuous variables to in-context learning remains a challenge.
50. 【2605.29951】MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization
链接:https://arxiv.org/abs/2605.29951
作者:Anisha Saha,Varsha Suresh,Teodora Kamova,Sophia Wiedmann,Timothy Hospedales,Vera Demberg
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:requires intent-aware cross-modal, pairs requires intent-aware, intent-aware cross-modal reasoning, benign image-text pairs, image-text pairs requires
备注:
点击查看摘要
Abstract:Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.
51. 【2605.29927】Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents
链接:https://arxiv.org/abs/2605.29927
作者:Alejandra Zambrano,Sara Vera Marjanovic,Imene Kerboua,Xing Han Lù,Leila Kosseim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:LLM-based web agents, recent advances, LLM-based web, limited exploration, omission of critical
备注: Extended version of paper submitted to EMNLP, waiting for acceptance
点击查看摘要
Abstract:Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.
52. 【2605.29897】ExCAM: Explainable Cultural Awareness Metrics
链接:https://arxiv.org/abs/2605.29897
作者:Christoph Leiter,Haiyue Song,Hour Kaing,Jin Tei,Hideki Tanaka,Masao Utiyama,Steffen Eger
类目:Computation and Language (cs.CL)
关键词:large language models, large language, language models, models is crucial, crucial to ensure
备注: preprint
点击查看摘要
Abstract:Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like behavior in stressful situations through the lens of question answering or text generation tasks. However, creating these benchmarks requires time-intensive and costly human annotations. Also, benchmarks that evaluate cultural awareness in free text are scarce and often rely on dated evaluation mechanisms. To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.
53. 【2605.29889】Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate
链接:https://arxiv.org/abs/2605.29889
作者:David Fraile Navarro,Berardino Como,Jialei Sheng,Soundariya Ananthan,Shlomo Berkovsky
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Patient-voiced clinical-triage benchmarks, clinical-triage benchmarks report, benchmarks report high, report high under-triage, high under-triage rates
备注: 9 pages main text, 27 pages total including appendices; 7 figures, 25 tables
点击查看摘要
Abstract:Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.
54. 【2605.29886】CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2605.29886
作者:Wenhan Xiao,Ziwei Zhang,Chuanyue Yu,Xingcheng Fu,Qingyun Sun,Runhua Xu,Jianxin Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:knowledge-intensive question answering, incorporating external evidence, Retrieval-augmented generation, improves knowledge-intensive question, knowledge-intensive question
备注: 17 pages,13 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at this https URL
Comments:
17 pages,13 figures
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.29886 [cs.CL]
(or
arXiv:2605.29886v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.29886
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
55. 【2605.29861】owards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
链接:https://arxiv.org/abs/2605.29861
作者:Chenghao Zhang,Guanting Dong,Yufan Liu,Tong Zhao,Zhicheng Dou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, synthesizes scattered evidence, concise factual answers
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose \textsc{Ptah}, a multi-agent harness for interleaved report generation. \textsc{Ptah} orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a \textit{Visual Working Memory}, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce \textsc{Ptah}Eval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that \textsc{Ptah} produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.
56. 【2605.29847】EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation
链接:https://arxiv.org/abs/2605.29847
作者:Xin Guan,Xiaomeng Hu,Shen Huang,Zhenyi Wang,Bo Zhang,Zijian Li,Pengjun Xie,Bo Liu,Jiuxin Cao
类目:Computation and Language (cs.CL)
关键词:Large Language Models, advanced Large Language, significantly advanced Large, Reinforcement Learning, Large Language
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.
57. 【2605.29826】owards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models
链接:https://arxiv.org/abs/2605.29826
作者:Leijiang Gu,Zhen Zeng,Feng Li,Xinjian Gao,Zenglin Shi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Knowledge Editing, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.
58. 【2605.29815】PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing
链接:https://arxiv.org/abs/2605.29815
作者:Krzysztof Żurawicki,Julia Farganus,Arkadiusz Gaweł,Mateusz Bystroński,Tomasz Jan Kajdanowicz
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, exploration of Large, Language Models, speed and scalability
备注:
点击查看摘要
Abstract:The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.
59. 【2605.29807】Data filtering methods for training language models
链接:https://arxiv.org/abs/2605.29807
作者:Egor Shevchenko,Elena Bruches
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:machine learning models, Confident Learning, Data quality, critical factor, machine learning
备注: AINL-2026
点击查看摘要
Abstract:Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.
60. 【2605.29801】AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
链接:https://arxiv.org/abs/2605.29801
作者:Dongrui Liu,Yu Li,Zhonghao Yang,Peng Wang,Guanxu Chen,Yuejin Xie,Qinghua Mao,Wanying Qu,Yanxu Zhu,Tianyi Zhou,Leitao Yuan,Zhijie Zheng,Qihao Lin,Yimin Wang,Haoyu Luo,Shuai Shao,Chen Qian,Qingyu Liu,Ling Tang,Ruiyang Qin,Qihan Ren,Junxiao Yang,Kun Wang,Zhiheng Xi,Linfeng Zhang,Ranjie Duan,Bo Zhang,Wenjie Wang,Wen Shen,Qiaosheng Zhang,Yan Teng,Chaochao Lu,Rui Mei,Man Li,Jialing Tao,Xi Lin,Tianhang Zheng,Yong Liu,Quanshi Zhang,Lei Zhu,Xingjun Ma,Junhua Liu,Hui Xue,Xiaoxiang Zuo,Xiangnan He,Chao Shen,Xianglong Liu,Minlie Huang,Jing Shao,Xia Hu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Modern open-world agents, exhibit powerful cross-environment, Modern open-world, OpenClaw exhibit powerful, powerful cross-environment execution
备注: 44 pages, 12 Figures, 9 Tables
点击查看摘要
Abstract:Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.
61. 【2605.29800】Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels
链接:https://arxiv.org/abs/2605.29800
作者:Guneet Kohli
类目:Computation and Language (cs.CL)
关键词:panels aggregate votes, diverse models yield, aggregate votes, votes from multiple, expectation that diverse
备注: 14 pages, 5 figures, 12 tables
点击查看摘要
Abstract:LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets (each with 100 human annotations per item), we find that the 9 judges effectively provide only about 2 independent votes' worth of information. Roughly three-quarters of the panel's nominal independence is lost because the models make the same mistakes on the same items. The consequences are stark: the panel's actual accuracy falls 8-22 percentage points short of what independent voting would achieve, and the best single judge matches or outperforms the full panel across all conditions. Neither adding more judges nor using smarter aggregation algorithms helps -- established methods close at most 11% of this gap, even with access to the correct answers. We quantify these findings using the Kish effective sample size (n_eff) and a Condorcet null model, and show the deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The bottleneck is correlated judges, not the aggregation algorithm, implying that scaling up panels cannot substitute for genuinely independent evaluation.
62. 【2605.29797】Metric-Dependent Annotation Saturation for Learning from Label Distributions
链接:https://arxiv.org/abs/2605.29797
作者:Guneet Kohli
类目:Computation and Language (cs.CL)
关键词:needed to capture, capture it depends, fine-tune NLI models, annotators disagree, annotators needed
备注: 16 pages, 3 figures, 14 tables
点击查看摘要
Abstract:When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.
63. 【2605.29796】SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
链接:https://arxiv.org/abs/2605.29796
作者:Yunbo Tang,Chengyi Yang,Shiyu Liu,Zhishang Xiang,Zerui Chen,Qinggang Zhang,Jinsong Su
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Agentic search enables, solve complex multi-hop, complex multi-hop questions, search enables LLMs, Agentic search
备注:
点击查看摘要
Abstract:Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code is anonymously released at this https URL.
64. 【2605.29791】ActTraitBench: Quantifying the Knowledge-Decision Gap in Large Language Models via Human-Grounded Behavioral Validation
链接:https://arxiv.org/abs/2605.29791
作者:Yutong Yang,Chenxi Miao,Weikang Li,Yunfang Wu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, convincingly simulate personas, implicit behavioral decisions, revealing a substantial
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) can convincingly simulate personas in explicit self-reports, they often deviate in implicit behavioral decisions, revealing a substantial Knowledge-Decision Gap ($G_{\text{KD}}$). Existing benchmarks struggle to measure this asymmetry due to limited construct validity, multi-dimensional entanglement, and distributional biases in LLM-based evaluation. To address these issues, we propose ActTraitBench, a human-grounded evaluation framework for measuring personality consistency in LLMs. Grounded in empirical human data, ActTraitBench establishes one-to-one mappings between psychometric facets and behavioral paradigms, and applies a Distributional Calibration via Quantile Mapping procedure to align LLM-judge score distributions with human norms. Experiments on 14 mainstream LLMs reveal a pervasive knowledge-decision asymmetry, where larger and more capable models often exhibit stronger behavioral divergence despite highly consistent self-reports. To mitigate this gap, we further introduce the Chain of Cognitive Alignment (CoCA), a plug-and-play inference-time intervention that improves alignment in reasoning-capable frontier models while exposing clear capability limitations in smaller architectures.
65. 【2605.29782】Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning
链接:https://arxiv.org/abs/2605.29782
作者:Zizhe Chen,Jiqian Dong,Yizhou Tian,Garry Yang,Yongqiang Chen,Zhitang Chen,James Cheng
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:refines large language, Reinforcement learning, large language models, directly optimizing model, optimizing model behavior
备注: Accepted at ICML 2026
点击查看摘要
Abstract:Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.
66. 【2605.29751】DySem: Uncovering Dynamic Semantic Components via Multilingual Consensus for Calculating Semantic Textual Similarity
链接:https://arxiv.org/abs/2605.29751
作者:Kaijie Zheng,Weiqin Wang,Yile Wang,Hui Huang
类目:Computation and Language (cs.CL)
关键词:Calculating semantic textual, natural language processing, Calculating semantic, foundational task, task in natural
备注: 18 pages, 23 figures, 5 tables
点击查看摘要
Abstract:Calculating semantic textual similarity is a foundational task in natural language processing. Current large language models (LLMs) based methods typically rely on extracting last-layer hidden states with fixed dimensions to compute similarity for every text pairs. We argue that this paradigm is suffer from two limitations: (i) The last hidden layer encodes more general knowledge rather than just semantic knowledge, making it suboptimal for semantic similarity computation; (ii) The hidden layer dimensions of LLMs are generally very large, which introduces some redundancy and noise for representing semantics. In this work, we propose DySem, a novel training-free framework that investigates more semantic-related internal components of LLMs via multilingual consensus, and shifts away from static representation spaces in favor of dynamic, sample-specific semantic dimensions by constructing text-dependent joint semantic set and computes similarity over this shared dimensional subset. Extensive experiments across various LLMs show that our method consistently outperforms recent baselines while maintaining lower dimensions for similarity calculation. The code is released at this https URL.
67. 【2605.29744】Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence
链接:https://arxiv.org/abs/2605.29744
作者:Yanan Wang,Shuaicong Hu,Jian Liu,Guohui Zhou,Aiguo Wang,Cuiwei Yang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:GPT and Claude, Claude in healthcare, generalist large language, large language models, domain-specific specialist models
备注: Accepted at ICML 2026. 12 pages main text, 16 pages appendix
点击查看摘要
Abstract:The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.
68. 【2605.29741】AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation
链接:https://arxiv.org/abs/2605.29741
作者:Idris Abdulmumin,Tajuddeen Gwadabe,Shamsuddeen Hassan Muhammad,David Ifeoluwa Adelani,Nomonde Khalo,Ibrahim Said Ahmad,Abiodun Modupe,Anina Mumm,Sibusiso Biyela,Michelle Rabie,Johanna Havemann,Marek Rei,Jade Abbott,Vukosi Marivate
类目:Computation and Language (cs.CL)
关键词:produce scientific knowledge, African languages access, scientific communication limits, African languages, dominance of colonial
备注:
点击查看摘要
Abstract:The dominance of colonial languages in African education and scientific communication limits how hundreds of millions of speakers of African languages access and produce scientific knowledge. A core obstacle is the lack of established scientific terminology in these languages. We introduce AfriScience-MT, a parallel corpus covering six African languages (Amharic, Hausa, Luganda, Northern Sotho, Yorùbá, and isiZulu) across 11 scientific domains. Professional translators, working with expert science communicators, translated plain-language summaries of scientific papers into each target language and created new terms where none existed. We benchmark machine translation systems and large language models in zero-shot, few-shot, and fine-tuned settings. Our results show that closed-source models outperform all open-source models at both the sentence and document levels: GPT-5.4 and Gemini-3.1-Flash-Lite lead with average sentence-level COMET scores of 68.3 and 68.0, respectively, and tie at an average document-level COMET of 48.3. Among open systems, fine-tuned NLLB-1.3B reaches 67.3 at the sentence level, and TranslateGemma-12B reaches 44.0 at the document level with 1-shot in-context learning. We release AfriScience-MT to support benchmarking and document-level scientific MT for African languages.
69. 【2605.29738】Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions
链接:https://arxiv.org/abs/2605.29738
作者:Volodymyr Ovcharov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:NLP benchmarks overwhelmingly, Legal NLP benchmarks, making cross-lingual comparison, cross-lingual comparison impossible, Legal NLP
备注: 14 pages, 5 figures, 8 tables. Dataset: [this https URL](https://huggingface.co/datasets/overthelex/multi-legal-bench)
点击查看摘要
Abstract:Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA-FR (Romance, -2.1 pp) transfers better than UA-PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.
70. 【2605.29737】Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs
链接:https://arxiv.org/abs/2605.29737
作者:Alexander Sternfeld,Andrei Kucharavy,Ljiljana Dolamic
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:offering substantial gains, LLM-based coding assistants, rapid adoption, offering substantial, developer productivity
备注:
点击查看摘要
Abstract:LLM-based coding assistants are seeing rapid adoption, offering substantial gains in developer productivity. As organizations increasingly ship code these agents produce, the security of that code becomes critical. Prior work has shown that minor prompt perturbations degrade the functional correctness of LLM-generated code, but whether they also compromise code security has remained unstudied. We apply token-level mutations to prompts across three models and five programming languages, and show that mutations as small as a single-character change can flip generated code from secure to vulnerable. Probing the models' hidden states reveals that this fragility is partially encoded in prompt representations, but unevenly so. Input-handling vulnerabilities, where the model omits validation or sanitization, are more predictable (mean AUC 0.753) than secure-defaults vulnerabilities, where insecure code stems from one local choice such as a weak algorithm or unsafe parameter (mean AUC 0.674). These results show that the threat model for LLM-assisted coding extends beyond prompt injection to ordinary prompt variation, and indicate that input-handling flaws can be caught before generation while secure-defaults flaws require intervention during decoding.
71. 【2605.29734】HTAM: Hierarchical Transition-Attended Memory for Operator Optimization
链接:https://arxiv.org/abs/2605.29734
作者:Yining Zhang,Mingyang Yi,Chen Wang,Xuwen Xiang,Tianhe Jia,Zedong Dan,Chengqing Zong,Yue Wang
类目:Computation and Language (cs.CL)
关键词:High-performance GPU kernels, efficient LLM deployment, High-performance GPU, LLM deployment, efficient LLM
备注: 24 pages, 5 figures
点击查看摘要
Abstract:High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.
72. 【2605.29715】User-Aware Active Knowledge Acquisition for Emotional Support Dialogue
链接:https://arxiv.org/abs/2605.29715
作者:Mufan Xu,Kehai Chen,Jiahao Hu,Xinchao Xu,Muyun Yang,Tiejun Zhao,Min Zhang
类目:Computation and Language (cs.CL)
关键词:Emotional support plays, large language models, strong reasoning capacity, existing emotional support, Emotional support
备注:
点击查看摘要
Abstract:Emotional support plays an important role in dialogue systems, and its success depends on adapting to a user's evolving and implicit needs across multi-turn interactions while leveraging the strong reasoning capacity of large language models. However, since signals about user needs are often weak, indirect, and can only be disambiguated through multi-turn interaction, existing emotional support methods often struggle to acquire and generalize relevant conversational knowledge efficiently. To bridge this gap, we introduce User-Aware Active Knowledge Acquisition (UKA), a gradient-free active dialogue learning framework that explicitly represents uncertainty about user needs and incorporates active learning into both knowledge acquisition and response this http URL propose a Theory-of-Mind uncertainty estimation mechanism that allows the model to prioritize responses, thereby eliciting more informative user feedback. UKA is capable of efficiently exploring user-aligned conversational knowledge during training while maintaining robustness at test time. Experiments across multiple dialogue benchmarks and model architectures demonstrate that our approach consistently outperforms strong baselines in dialogue quality and user alignment.
73. 【2605.29714】Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation
链接:https://arxiv.org/abs/2605.29714
作者:Aditi Khandelwal,Marius Mosbach,Verna Dankers,Siva Reddy,Golnoosh Farnadi
类目:Computation and Language (cs.CL)
关键词:setting remain underexplored, multilingual setting remain, remain underexplored, scale language models, expert routing behavior
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at this https URL.
74. 【2605.29712】aching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies
链接:https://arxiv.org/abs/2605.29712
作者:Yuxuan Ye,Raul Santos-Rodriguez,Edwin Simpson
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Grounded claim factuality, retrieval-augmented generation, generated outputs, large language model, Grounded claim
备注: ACL 2026 Main
点击查看摘要
Abstract:Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.
75. 【2605.29711】Personalized Turn-Level User Conversation Satisfaction Benchmark
链接:https://arxiv.org/abs/2605.29711
作者:Zhefan Wang,Zhiqiang Guo,Weizhi Ma,Min Zhang,Quanjia Yan,Hengliang Luo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:assistants is highly, disappoint another depending, User, satisfaction, turn-level user conversation
备注:
点击查看摘要
Abstract:User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.
76. 【2605.29708】Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs
链接:https://arxiv.org/abs/2605.29708
作者:Zhibo Zhang,Yuxi Li,Zhen Ouyang,Ling Shi,Kailong Wang
类目:Computation and Language (cs.CL)
关键词:specialization remains underexplored, router-driven expert activation, routed expert specialization, expert specialization remains, rely on sparse
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.
Comments:
11 pages, 4 figures
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.29708 [cs.CL]
(or
arXiv:2605.29708v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.29708
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
77. 【2605.29707】Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding
链接:https://arxiv.org/abs/2605.29707
作者:Jianuo Huang,Yaojie Zhang,Qituan Zhang,Hao Lin,Hanlin Xu,Linfeng Zhang
类目:Computation and Language (cs.CL)
关键词:accelerates LLM inference, decoding accelerates LLM, accelerates LLM, LLM inference, Speculative decoding accelerates
备注:
点击查看摘要
Abstract:Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.
78. 【2605.29682】Scaling Laws for Agent Harnesses via Effective Feedback Compute
链接:https://arxiv.org/abs/2605.29682
作者:Xuanliang Zhang,Dingzirui Wang,Keyan Xu,Qingfu Zhu,Wanxiang Che
类目:Computation and Language (cs.CL)
关键词:Agent harnesses increasingly, verify intermediate states, harnesses increasingly determine, Agent harnesses, models call tools
备注:
点击查看摘要
Abstract:Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.
79. 【2605.29678】Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models?
链接:https://arxiv.org/abs/2605.29678
作者:Pawel Batorski,Abtin Pourhadi,Jerzy Sarosiek,Przemyslaw Spurek,Paul Swoboda
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, task-relevant instructions, highly sensitive, studied through task-relevant
备注:
点击查看摘要
Abstract:Large language models are highly sensitive to prompts, but this sensitivity is usually studied through task-relevant instructions, demonstrations, or reasoning cues. In this paper, we study a different form of prompt sensitivity: whether prompts that are semantically unrelated to the task can nevertheless steer model behavior. We call them spurious prompts and show their surprising efficacy. We also propose a simple black-box search procedure for discovering them. Across reasoning and question-answering benchmarks, using models ranging from 0.8B to 27B parameters and spanning three model families, we show that spurious prompts can improve performance, often matching or outperforming standard prompting baselines and task-aware prompt optimization. We further show that they can steer models toward unintended behaviors, such as repeatedly selecting the first answer option, producing incorrect answers, returning an even, prime or small number without explicitly instructing the model to do so. These findings reveal a new kind of prompt sensitivity: LLMs can be systematically steered by prompts that are unrelated to the task they are asked to solve. Our code is available at this https URL
80. 【2605.29676】Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems
链接:https://arxiv.org/abs/2605.29676
作者:Lorenz Kutschka,Bernhard Geiger
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:systems consume tool, consume tool schemas, emit tool invocations, Large language models, Reduced Object Notation
备注: 16 pages, 6 figures, 4 tables
点击查看摘要
Abstract:Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models.
81. 【2605.29670】EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL
链接:https://arxiv.org/abs/2605.29670
作者:Huawei Zheng,Sen Yang,Zhaorui Yang,Yuhui Zhang,Haozhe Feng,Haoxuan Li,Xuan Yi,Chao Hu,Defeng Xie,Chen Hou,Danqing Huang,Wei Chen,Yingcai Wu,Peng Chen,Dazhen Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:sufficient schema context, step in large-scale, ambiguous databases, Schema linking, difficult and important
备注:
点击查看摘要
Abstract:Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.
82. 【2605.29668】GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents
链接:https://arxiv.org/abs/2605.29668
作者:Johannes Moll,Jean-Philippe Corbeil,Jiazhen Pan,Martin Hadamitzky,Daniel Rueckert,Lisa Adams,Keno Bressem
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:LLM agents acting, structured environments fail, LLM agents, acting in structured, fail in operational
备注:
点击查看摘要
Abstract:LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.
83. 【2605.29667】Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese
链接:https://arxiv.org/abs/2605.29667
作者:Wajdi Zaghouani,Kholoud K. Aldous,Yicheng Gao
类目:Computation and Language (cs.CL)
关键词:Large Language Models, troubling pattern emerges, Large Language, Chinese-language settings, deployed in Chinese-language
备注:
点击查看摘要
Abstract:When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.
84. 【2605.29659】Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content
链接:https://arxiv.org/abs/2605.29659
作者:Ihor Stepanov,Aleksandr Smechov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Real-time safety filtering, applications requires classifiers, covert harmful content, genuinely covert harmful, large language model
备注: 23 pages, 4 figures, 9 tables
点击查看摘要
Abstract:Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.
85. 【2605.29648】Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
链接:https://arxiv.org/abs/2605.29648
作者:Shicheng Fan,Haochang Hao,Dehai Min,Weihao Liu,Philip S. Yu,Lu Cheng
类目:Computation and Language (cs.CL)
关键词:Applying reinforcement learning, knowledge-intensive question answering, question answering faces, Applying reinforcement, reward design dilemma
备注:
点击查看摘要
Abstract:Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.
86. 【2605.29638】Classification of non-analyzable word types in web documents to implement an effective Korean e-learning system
链接:https://arxiv.org/abs/2605.29638
作者:Sang-Taek Park,Ae-Lim Ahn,Eric Laporte,Jee-Sun Nam
类目:Computation and Language (cs.CL)
关键词:E-learning systems, Korean e-learning systems, deliver contents, contents that reflect, reflect various phenomena
备注:
点击查看摘要
Abstract:E-learning systems should deliver contents that reflect various phenomena of the language as it is used. In addition to formal Korean, e-learning systems that would include real-world Korean expressions such as those in web documents, mobile text messages, or twitter posts, would be useful to high-level learners. We construct two types of corpora: one is made of formal documents like online news articles; the other is made of informal documents like customer reviews about new products in web blogs. By comparing these corpora, we show how expressions differ in these two types of corpora. We survey the main characteristics of the informal corpus. Given that a significant proportion of text is informal, we propose Local Grammar Graphs (LGG) as an appropriate model to treat them effectively in Korean e-learning systems.
87. 【2605.29637】Evaluating Cross-lingual Knowledge Consistency in Code-Mixed vis-a-vis Indian Languages using IndicKLAR
链接:https://arxiv.org/abs/2605.29637
作者:Debajyoti Mazumder,Divyansh Pathak,Prashant Kodali,Aditya Joshi,Akshay Agarwal,Jasabanta Patro
类目:Computation and Language (cs.CL)
关键词:Large language models, Indian language inputs, Large language, scheduled Indian languages, Indian languages
备注: 23 pages
点击查看摘要
Abstract:Large language models recall knowledge reliably in English but often fail on the same query posed in a lower-resourced language -- a crosslingual consistency gap that remains underexplored for Indian languages and their code-mixed counterparts. To study this gap, we introduce IndiKLAR, an Indic extension of the KLAR-CLC benchmark covering 18 of the 22 scheduled Indian languages and pairing them with code-mixed variants for 11 widely used language pairs, with native-speaker verification of both monolingual and code-mixed variants for these 11 settings. This three-way alignment offers a unique opportunity to examine how knowledge recall consistency varies across the spectrum of English, code-mixed, and native Indian language inputs. Evaluating across nine open-weight models, we find that the native-language accuracy gap to English can reach $\sim$0.50, while code-mixed inputs close most of it -- bringing performance within $\sim$0.05 of English without any model-level intervention. Motivated by this, we evaluate several prompting strategies that vary in how language conversion is exposed, including a two-stage translate-then-answer setup, a one-stage joint translation-and-answer prompt, and Translate-in-Thought (TinT) -- a single-step strategy in which the model converts the input internally and emits only the final answer. Across the performance trajectory native $\rightarrow$ code-mixed $\rightarrow$ English, we identify a consistent flip point -- the boundary between incorrect and correct prediction -- that lies between the native and code-mixed settings. Interestingly, this holds whether the trajectory is induced by the input surface form or by the model's internal conversion process.
88. 【2605.29631】Predicting Causal Effects from Natural Language Queries using Structured Representations
链接:https://arxiv.org/abs/2605.29631
作者:Giuliano Martinelli,Piriyakorn Piriyatamwong,Abelardo Carlos Martinez Lorenzo,Jasmin Baier,Riccardo Orlando,Satvik Garg,Sharif Kazemi,Linxi Wang,Arianna Legovini,Samuel Fraiberger
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Randomized controlled trials, enable reliable estimates, Randomized controlled, controlled trials, cornerstone of medicine
备注: 18 pages
点击查看摘要
Abstract:Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.
89. 【2605.29630】Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory
链接:https://arxiv.org/abs/2605.29630
作者:Youwang Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:confounding lexical leakage, distractor entity overlap, agent-memory benchmarks report, uncontrolled query, single hit
备注: 48 pages with appendix; 6-page body, mandatory Limitations, References, and 7 appendices. Code, benchmarks, and 37 reproduce scripts: [this https URL](https://github.com/youwangd/engram) (see paper/REPRODUCIBILITY.md). Apache 2.0
点击查看摘要
Abstract:End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.
90. 【2605.29628】COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings
链接:https://arxiv.org/abs/2605.29628
作者:Yonggang Zhu,Liting Gao,Aidong Men,Wenwu Wang
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
关键词:Contrastive Language-Audio Pretraining, Language-Audio Pretraining, modality gap, support modality-agnostic condition, models are widely
备注:
点击查看摘要
Abstract:Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.
91. 【2605.29626】DLM-SWAI: Steering Diffusion Language Models Before They Unmask
链接:https://arxiv.org/abs/2605.29626
作者:Hyeseon An,Yo-Sub Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:desired textual properties, diffusion language models, practical deployment, language models, diffusion language
备注: preprint
点击查看摘要
Abstract:Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.
92. 【2605.29615】DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?
链接:https://arxiv.org/abs/2605.29615
作者:Linhao Zhang,Aiwei Liu,Yuan Liu,Xiao Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:high-level image-text alignment, made strong progress, perceive subtle visual, differences remains limited, image-text alignment
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.
93. 【2605.29612】CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems
链接:https://arxiv.org/abs/2605.29612
作者:Ziyang Ma,Dingyi Zhang,Sichu Liang,Jiajia Chu,Pengfei Xia,Hui Zang,Deyu Zhou
类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)
关键词:large language model, solve complex tasks, single agent systems, huge computational overheads, based multi-agent systems
备注:
点击查看摘要
Abstract:Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.
94. 【2605.29601】raining Deliberative Monitors for Black-Box Scheming Detection
链接:https://arxiv.org/abs/2605.29601
作者:Aditya Sinha,Akshat Naik,Victor Gillioz,Simon Storf,Kilian Merkelbach,Rich Barton-Cooper,Axel Højmark,Marius Hobbhahn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:performing real-world tasks, benign task pursuit, distinguishing scheming behavior, control problem, real-world tasks
备注:
点击查看摘要
Abstract:As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.
95. 【2605.29585】World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models
链接:https://arxiv.org/abs/2605.29585
作者:Emmanuelle Bourigault
类目:Computation and Language (cs.CL)
关键词:Vision-language models, Vision-language, evaluations reduce performance, physical scenes, physical
备注: 8 pages, 3 figures, 5 tables
点击查看摘要
Abstract:Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce \wmw, an evaluation framework for auditing the \emph{language-expressed physical commitments} of VLMs. Instead of scoring only $I,q\mapsto a$, we ask models to produce a typed trace $I,q\mapsto(s_0,\Delta s,s_1,a)$: an initial state, a state transition, a resulting state, and an answer. A hybrid verifier then checks schema validity, state grounding, transition consistency, and answer-trace compatibility, yielding typed error labels such as object, relation, force, transition, temporal, unit/scale, and faithfulness errors. We release \tracebank, a controlled trace resource with \nSeed schema- and recomputation-validated synthetic scenarios across \nFamilies physics families, \nPairs minimally perturbed contrastive preference pairs, verifier code, audit guidelines, and model outputs. We evaluate \nModels VLMs on both controlled and external physical-reasoning examples. \wmw reveals failures that answer-only evaluation misses: 35\% of correct answers from mid-tier models are backed by physically invalid traces. Verifier-guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy, and trace-level preference tuning reduces hidden inconsistency by 41\% relative. The contribution is not another final-answer physics benchmark, but a reusable protocol for measuring whether a VLM's stated physical world can be true at the same time as its answer.
96. 【2605.29584】GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering
链接:https://arxiv.org/abs/2605.29584
作者:Xin Sun,Jianan Xie,Zhongqi Chen,Qiang Liu,Shu Wu,Bowen Song,Weiqiang Wang,Zilei Wang,Liang Wang
类目:Computation and Language (cs.CL)
关键词:observe knowledge-base feedback, base question answering, agentic knowledge base, knowledge base question, Reinforcement learning
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA systems mainly optimize sparse rewards from the final answer, leaving intermediate action errors weakly supervised. This is especially limiting for logical-form annotated KBQA benchmarks: gold logical forms can be converted into executable action sequences, but existing pipelines use them mainly for warm-start data construction rather than for on-policy RL updates. We propose GAPD, a training-time Gold-Action Policy Distillation framework that adds dense token-level guidance to outcome-based RL. To align gold actions with on-policy student rollouts, GAPD uses MID-ANCHOR MATCHING: it treats the intermediate entities reached during student exploration and gold execution as state anchors, and matches student states to gold states through these explored entity sets. The current policy conditioned on this aligned gold action serves as a stop-gradient teacher, whose token distribution is distilled back to the ordinary student policy over generated action-token spans. GAPD consistently surpasses the current state of the art on WebQSP, GrailQA, and GraphQ.
97. 【2605.29582】PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning
链接:https://arxiv.org/abs/2605.29582
作者:Qikai Chang,Zhenrong Zhang,Linbo Chen,Pengfei Hu,Jianshu Zhang,Youhui Guo,Jun Du
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, provide progressive Socratic, progressive Socratic guidance, effective tutoring requires
备注: 16 pages, 7 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.
98. 【2605.29559】LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents
链接:https://arxiv.org/abs/2605.29559
作者:Xiaoxuan Peng,Kaiqi Zhang,Xinyu Lu,Boxi Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun
类目:Computation and Language (cs.CL)
关键词:dynamic state adaptation, requires language agents, language agents capable, environments requires language, feedback-grounded execution
备注:
点击查看摘要
Abstract:Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.
99. 【2605.29555】From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals
链接:https://arxiv.org/abs/2605.29555
作者:Yeyong Yu,Wenya Hu,Xing Wu,Quan Qian
类目:Computation and Language (cs.CL)
关键词:high-throughput experimentation advance, massive candidate sets, making reliable evaluations, Preference Signals Framework, experimentation advance
备注: 33 pages, 5 figures
点击查看摘要
Abstract:As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowledge-Augmented Preference Signals Framework, MaterEval, that automatically produces, for the same candidate, two evaluations: an informed judgment that follows expert rules and provides supporting evidence, and a rule-removed blind guess. By pairing the two evaluations as preference data, we guide general-purpose large language models (LLMs), originally lacking materials-specific criteria, from intuitive judgment toward reliable evaluation supported by explicit evidence. To balance throughput, cost, and reliability, we further introduce a fast-slow reasoning scheme that decouples large-scale rapid screening from in-depth review on a small subset. Using high-entropy alloy (HEA) assessment as a case study, we show that, without external retrieval and relying solely on internalized capabilities, small open-source LLMs achieve substantial gains in accuracy, conclusion consistency, and evidence discrimination, approaching the performance of rule-based closed-source LLMs. These results demonstrate that expert rules can be systematically transformed into learnable preference signals, enabling a low-cost and deployable evaluation module for autonomous materials discovery loops.
100. 【2605.29543】SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring
链接:https://arxiv.org/abs/2605.29543
作者:Qihan Deng,Minghua Zhang,Yang Yang,Zhenyu Gao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:Air Traffic Control, Traffic Control, Pilot readback, voice instructions, primary safeguard
备注:
点击查看摘要
Abstract:Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.
101. 【2605.29511】DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration
链接:https://arxiv.org/abs/2605.29511
作者:Yanxing Guo,Zihao Zheng,Fangzhou Wu,Ling Liang,Lin Bao,Zongwei Wang,Yimao Cai
类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Tackling complex reasoning, Tackling complex, tasks typically relies, severe computational redundancy, massive monolithic LLMs
备注:
点击查看摘要
Abstract:Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi-model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time-division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer-grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self-healing: Fine-grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.
102. 【2605.29502】Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation
链接:https://arxiv.org/abs/2605.29502
作者:Zeli Su,Ziyin Zhang,Zewei Pan,Zhou Liu,Dingcheng Huang,Dehan Li,Zhankai Xu,Longfei Zheng,Xiaolu Zhang,Jun Zhou,Wentao Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:standard supervised fine-tuning, source-language monolingual data, high-resource source-language monolingual, Reinforcement Learning, supervised fine-tuning
备注:
点击查看摘要
Abstract:Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.
103. 【2605.29498】Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting
链接:https://arxiv.org/abs/2605.29498
作者:Runze Xu,Arpit Garg,Hemanth Saratchandran,Simon Lucey
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:adapting large language, large language models, adaptation distribution differs, models original training, widely used fine-tuning
备注: In Submission
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.
104. 【2605.29496】On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
链接:https://arxiv.org/abs/2605.29496
作者:Xueqing Wu,Yu-Chi Lin,Kai-Wei Chang,Nanyun Peng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:frontier vision-language models, remain comparatively limited, Post-training has greatly, greatly improved reasoning, perception remain comparatively
备注: Project: [this https URL](https://asymmetric-vlm-post-training.github.io/)
点击查看摘要
Abstract:Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.
105. 【2605.29486】PhoneWorld: Scaling Phone-Use Agent Environments
链接:https://arxiv.org/abs/2605.29486
作者:Zhengyang Tang,Yuxuan Liu,Xin Lai,Junyi Li,Pengyuan Lyu,Jason,Yiduo Guo,Zhengyao Fang,Yang Ding,Yi Zhang,Weinong Wang,Huawen Shen,Xingran Zhou,Liang Wu,Fei Tang,Sunqi Fan,Shangpin Peng,Zheng Ruan,Anran Zhang,Benyou Wang,Rui Yan,Ji-Rong Wen,Chengquan Zhang,Han Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:reproducible environments covering, central bottleneck, phone-use environments, environments covering real, PhoneWorld
备注: work in progress
点击查看摘要
Abstract:A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.
106. 【2605.29476】Comparative Evaluation of Machine Translation Systems on Images with Text
链接:https://arxiv.org/abs/2605.29476
作者:Blai Puchol,Sergio Gómez González,Miguel Domingo,Francisco Casacuberta
类目:Computation and Language (cs.CL)
关键词:natural language processing, machine translation systems, translation systems applied, textual information, work presents
备注:
点击查看摘要
Abstract:This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.
107. 【2605.29475】MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery
链接:https://arxiv.org/abs/2605.29475
作者:Hongran An,Zonglin Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC)
关键词:Large language models, show remarkable potential, Large language, scientific hypothesis discovery, language models
备注: Accepted to ACL 2026 (System Demonstrations)
点击查看摘要
Abstract:Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory ideation and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and regenerative feedback. Quantitative evaluations demonstrate that injecting these structured expert signals significantly outperforms purely autonomous baselines, establishing a performance ceiling under oracle guidance. Furthermore, to democratize this paradigm, we develop an intuitive web-based interface featuring interactive tree visualization. This explicitly eliminates the steep learning curve of complex command-line agentic tools, empowering interdisciplinary researchers to directly leverage, visually orchestrate, and accelerate end-to-end scientific breakthroughs.
108. 【2605.29473】Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles
链接:https://arxiv.org/abs/2605.29473
作者:Drishti Goel,Agam Goyal,Veda Duddu,Olivia Pal,Jeongah Lee,Qiuyue Joy Zhong,Violeta J. Rodriguez,Daniel S. Brown,Dong Whi Yoo,Ravi Karkar,Koustuv Saha
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
关键词:informal caregiving contexts, seek emotional reassurance, complex care decisions, caregivers seek emotional, relationally complex care
备注:
点击查看摘要
Abstract:Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.
109. 【2605.29459】Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models
链接:https://arxiv.org/abs/2605.29459
作者:Rohan Shravan
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, language models route, Large language, frontier scale, language models
备注: 28 pages, 16 tables. Reference implementation: [this https URL](https://github.com/theschoolofai/kronecker-embeddings)
点击查看摘要
Abstract:Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte-level character-position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91--94% of input-side trainable parameters at frontier scale. We provide five contributions. First, a cross-model probe across six LMs (135M-671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three-seed comparison on nanoGPT GPT-2 124M over 2.5B tokens of FineWeb-Edu shows Kronecker reaching 2.5 +- 0.2% lower validation loss than the BPE-tied baseline (gap 0.083 +- 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling-robustness probe over 110 clean/typo pairs shows Kronecker preserves the top-1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte-novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on-the-fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01--0.24% step-time overhead. Byte-level locality has a tradeoff: byte-similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.
110. 【2605.29458】Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment
链接:https://arxiv.org/abs/2605.29458
作者:Ruoxi Su,Yuhan Liu,Jingyu Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:specific individual remains, individual remains challenging, contextual cues needed, Accurately simulating, individual-level decision simulation
备注: 20 pages, 2 figures, 12 tables
点击查看摘要
Abstract:Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.
111. 【2605.29447】Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
链接:https://arxiv.org/abs/2605.29447
作者:Tianpeng Bu,Xin Liu,Qihua Chen,Hao Jiang,Shurui Li,Hongtao Duan,Lu Jiang,Lulu Hu,Bin Yang,Minying Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:hindering real-world deployment, Robustness-driven Trajectory Synthesis, propose Robustness-driven Trajectory, advanced rapidly, hindering real-world
备注: ICML 2026 Spotlight. 36 pages, 19 figures, includes appendix
点击查看摘要
Abstract:While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at this https URL.
112. 【2605.29440】SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents
链接:https://arxiv.org/abs/2605.29440
作者:Wentao Hu,Zhendong Chu,Yiming Zhang,Junda Wu,Ming Jin,Xiangyu Zhao,Yilei Shao,Yanfeng Wang,Qingsong Wen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:reusable textual principles, guide decision making, Retrieval-augmented LLM agents, Retrieval-augmented LLM, agents increasingly rely
备注: 16 pages. Preprint. Under review
点击查看摘要
Abstract:Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.
113. 【2605.29434】AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing
链接:https://arxiv.org/abs/2605.29434
作者:Yuexin Li,Wenjie Qu,Linyu Wu,Yulin Chen,Yufei He,Tri Cao,Bryan Hooi,Jiaheng Zhang
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Existing sentence-level watermarking, watermarking methods enhance, methods enhance robustness, Existing sentence-level, sentence-level watermarking methods
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.
114. 【2605.29430】owards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
链接:https://arxiv.org/abs/2605.29430
作者:Zixuan Jiang,Yanqiao Zhu,Peng Wang,Qinyuan Chen,Xinjian Zhao,Xipeng Qiu,Wupeng Wang,Zhifu Gao,Xiangang Li,Kai Yu,Xie Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Automatic speech recognition, Automatic speech, increasingly important front-end, speech recognition, assistants and agents
备注:
点击查看摘要
Abstract:Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: this https URL and the live demo is available at this https URL
115. 【2605.29427】FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions
链接:https://arxiv.org/abs/2605.29427
作者:Huaixia Dou,Jie Zhu,Minghao Wu,Shuo Jiang,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang
类目:Computation and Language (cs.CL)
关键词:single non-compliant interaction, direct consumer harm, large language models, large language, increasingly deployed
备注:
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.
116. 【2605.29421】Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design
链接:https://arxiv.org/abs/2605.29421
作者:Shengchao Chen,Ting Shu,Sufen Ren
类目:Computation and Language (cs.CL)
关键词:Photonic crystal fiber, satisfy coupled optical, coupled optical targets, Photonic crystal, design remains challenging
备注: AI4Physics@ICML 2026
点击查看摘要
Abstract:Photonic crystal fiber (PCF) inverse design remains challenging because candidate geometries must satisfy coupled optical targets under expensive electromagnetic simulation. Existing pipelines improve surrogate prediction or one-shot parameter recommendation, but they do not accumulate reusable design knowledge across iterative trials. We formulate PCF inverse design as a memory-policy learning problem and propose SkillPCF, a closed-loop agent framework that combines a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded skill evolution. We further construct a real-world dataset with 479 expert interaction traces (2,507 spans) and 553 memory-dependent evaluation queries covering dispersion engineering, loss optimization, and multi-objective design. Experiments across multiple LLM backbones and classical baselines show that SkillPCF achieves stronger design-quality and efficiency trade-offs under practical simulation budgets, demonstrating the effectiveness of our proposed memory-skill learning paradigm for physics-aware PCF inverse design.
117. 【2605.29414】Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning
链接:https://arxiv.org/abs/2605.29414
作者:Shunta Asano,Jeonghun Baek,Toshihiko Yamasaki
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, Recent studies, alignment in large, improve cross-lingual transfer, Recent
备注:
点击查看摘要
Abstract:Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.
118. 【2605.29400】Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark
链接:https://arxiv.org/abs/2605.29400
作者:Rahul Bissa,Abhishek Vyas,Yash Jain
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Pew American Trends, American Trends Panel, Trends Panel demographics, screen-anchored behavioural rationales, behavioural rationales curated
备注: 14 pages, 7 figures, 2 tables. PiSAR corpus and fine-tuned weights are proprietary to AprioriLabs; methodology and recipe released
点击查看摘要
Abstract:We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim = 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.
119. 【2605.29397】Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework
链接:https://arxiv.org/abs/2605.29397
作者:Masafumi Enomoto,Ryoma Obara,Haochen Zhang,Masafumi Oyamada
类目:Computation and Language (cs.CL)
关键词:extremely long, observations in LLM-based, remains unclear, HTML observations, Minimal Failure Set
备注: 22 pages, 8 figures, 4 tables
点击查看摘要
Abstract:HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is the high cost of end-to-end evaluation: in our experiments, evaluating 11 methods across 32 configurations on 33 tasks of WorkArena L1 required 232.4 cumulative hours. To address this, we propose a lightweight evaluation framework based on the Minimal Failure Set (MFS), the minimal set of HTML elements whose removal causes task failure. We define coverage as the fraction of instances in which a reduction method fully retains the MFS, which serves as a proxy metric that requires neither web access nor LLM inference. We validate that coverage strongly correlates with end-to-end success rate, with over 100$\times$ speedup in cumulative evaluation time on both benchmarks. Using this framework, we find that extractive HTML reduction methods require either high computation cost or domain-specific optimization to reduce agent latency while maintaining performance. Building on this, we optimize a pruning program on MFS training data, achieving 2.2$\times$ faster per-step latency on WorkArena L1 while retaining 84\% of the original success rate, and 3.1$\times$ faster on WebLinx while retaining 89\%.
120. 【2605.29392】Offloading Score: Measuring AI Reliance Through Counterfactual Workflows
链接:https://arxiv.org/abs/2605.29392
作者:Vishakh Padmakumar,Lujain Ibrahim,Zora Zhiruo Wang,Jennifer Wang,Q. Vera Liao,Diyi Yang
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:offloading score, increasingly integrated, integrated into real-world, reliance, score
备注: Preprint
点击查看摘要
Abstract:AI tools are increasingly integrated into real-world workflows. However, existing measures of reliance on these tools focus on AI output adoption or on self-reported indicators, rather than how task effort is distributed between users and tools. Here, we introduce offloading score, a measure of reliance that quantifies the fraction of cognitive effort offloaded to an AI tool. Offloading Score is simulation-based -- we construct a counterfactual workflow by estimating how the user would have completed the task without the tool, and then computing the fraction of steps saved by using the tool. We validate offloading score through intrinsic evaluations of metric validity, and a controlled user study ($n=40$) with developers performing programming tasks using AI tools. We vary time pressure to test whether reliance measures capture the known increase in reliance under time pressure. We show that offloading score detects significantly higher reliance in time-constrained settings ($+43\%$, $p=0.018$), while usage-based and self-reported baseline measures of reliance do not distinguish the conditions. We complement this with descriptive insights showing that higher reliance manifests as greater delegation of subtasks to the tool and more direct reuse of AI outputs. Finally, we demonstrate an approach of using offloading score in combination with target outcomes of a task (e.g., code understanding) to identify when reliance may be (in)appropriate. Our framework offers two contributions: an instrument users can apply to measure and reflect on their own reliance, and a quantitative signal that agent designers can utilize to mitigate overreliance.
121. 【2605.29384】Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies
链接:https://arxiv.org/abs/2605.29384
作者:Benjamin Clavié,Sean Lee,Aamir Shakir,Makoto P. Kato
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:retrieval-ready sparse features, propose Latent Terms, learn representations, Latent Terms, trivially be decomposed
备注:
点击查看摘要
Abstract:We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.
122. 【2605.29379】BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base
链接:https://arxiv.org/abs/2605.29379
作者:Rohan Shravan
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:byte-level BPE tokenizer, byte-level BPE, Brahmic Unicode blocks, Brahmic compression gap, class while preserving
备注: 24 pages, 15 tables, 3 code listings. Tokenizer artifact, verification scripts, and reproduction code at [this https URL](https://huggingface.co/theschoolofai/BrahmicTokenizer-131K) and [this https URL](https://github.com/theschoolofai/BrahmicTokenizer-131K)
点击查看摘要
Abstract:We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at this https URL.
Comments:
24 pages, 15 tables, 3 code listings. Tokenizer artifact, verification scripts, and reproduction code at this https URL and this https URL
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
ACMclasses:
I.2.7
Cite as:
arXiv:2605.29379 [cs.CL]
(or
arXiv:2605.29379v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.29379
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
123. 【2605.29368】SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow
链接:https://arxiv.org/abs/2605.29368
作者:Dongsheng Shi,Yue Li,Xin Yi,Yongyi Cui,Huawei Feng,Linlin Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:support collaborative decision-making, extensive patient records, entire perioperative workflow, modern surgical care, surgical care necessitates
备注: preprint
点击查看摘要
Abstract:The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.
124. 【2605.29367】Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour Amplification
链接:https://arxiv.org/abs/2605.29367
作者:Joy Bose
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
关键词:workers lose jobs, AI-driven restructuring, Twitter, workers lose, lose jobs
备注: 18 pages, 3 figures, 9 tables
点击查看摘要
Abstract:When workers lose jobs to AI-driven restructuring, two very different conversations happen on X (formerly Twitter) at the same time. Tech executives and AI researchers talk about productivity, transformation, and opportunity. Laid-off workers and labour critics talk about job loss, uncertainty, and fear. This paper asks a simple question: which conversation gets more reach? We report three studies using two collection methods and 763 tweets from 20 named public accounts. Study 1 used keyword-based collection (n=392) and found no significant difference between corpora (p=0.891), revealing that keyword search is too noisy for this task. Study 2 used account-based collection (n=96) and found a 3.12x mean amplification advantage for capital discourse over labour discourse (p=0.000003, Cohen's d=0.555). Study 3 combined both methods (n=763) and confirmed the finding at 4.18x mean and 10.77x median amplification ratio (p0.000001). Critically, after normalising for follower count, the asymmetry persists at 2.69x (p=0.000009, Cohen's d=0.491), demonstrating that the effect is not simply a consequence of capital accounts having larger audiences. The finding is robust across all tested amplification metric weightings. We introduce the Amplification Ratio and Amplification Normalisation Index as simple metrics for measuring platform-level discourse inequality. A cross-platform replication on Reddit (n=647 posts) did not replicate the finding, suggesting the asymmetry may be specific to X's account-based amplification architecture. We discuss the methodological implications for cross-platform discourse analysis.
125. 【2605.29365】Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset
链接:https://arxiv.org/abs/2605.29365
作者:Hyojeong Yu,Hyukhun Koh,Minsung Kim,Kyomin Jung
类目:Computation and Language (cs.CL)
关键词:symmetric bidirectional task, Toggle, Formality transfer, Toggle Hugging Face, Formality
备注: HEAL@CHI 2026 Workshop Paper
点击查看摘要
Abstract:Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human rewrites encode relative stylistic shifts rather than absolute human notions of formality. Consequently, models learn to generate pseudo-formal outputs that satisfy benchmark labels while failing to produce genuinely formal language. We quantify this misalignment by re-evaluating benchmark formal labels under a human-aligned definition of formality, revealing substantial discrepancies that propagate to consistent informal-to-formal failures across model families. To address this issue, we reconceptualize formality transfer as a graded dimension rather than a binary attribute. We introduce a three-level spectrum: informal, casual, and formal, where casual serves as an explicit intermediate state that clarifies supervision signals. Based on this framework, we introduce 3LF, a dataset providing parallel supervision across all three levels. Training on 3LF substantially reduces informal-to-formal failures and improves alignment with human perception. For example, GPT-4.1-nano improves from 0.06 to 0.88 F1 in the informal-to- formal direction despite 3LF being significantly smaller than GYAFC. We further demonstrate that these gains cannot be reproduced through in-context learning alone and provide qualitative analyses of ambiguity-driven errors and meaning distortions. Overall, our findings demonstrate how supervision design shapes stylistic alignment and highlight the importance of alignment-aware benchmark construction in controllable text generation.
Comments:
HEAL@CHI 2026 Workshop Paper
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.29365 [cs.CL]
(or
arXiv:2605.29365v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.29365
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Hyojeong Yu [view email] [v1]
Thu, 28 May 2026 05:07:02 UTC (492 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset, by Hyojeong Yu and 3 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.CL
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
126. 【2605.29343】Draft-OPD: On-Policy Distillation for Speculative Draft Models
链接:https://arxiv.org/abs/2605.29343
作者:Haodi Lei,Yafy Li,Haoran Zhang,Shunkai Zhang,Qianjia Cheng,Xiaoye Qu,Ganqu Cui,Bowen Zhou,Ning Ding,Yun Luo,Yu Cheng
类目:Computation and Language (cs.CL)
关键词:accelerates large language, decoding accelerates large, large language model, language model inference, lightweight draft model
备注:
点击查看摘要
Abstract:Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.
127. 【2605.29341】WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
链接:https://arxiv.org/abs/2605.29341
作者:Chengzhi Liu,Yuzhe Yang,Sophia Xiao Pu,Yepeng Liu,Lin Long,Yichen Guo,Nuo Chen,Zhaotian Weng,Elena Kochkina,Simerjot Kaur,Charese Smiley,Xiaomo Liu,James Zou,Sheng Liu,Yuheng Bu,Songyou Peng,Xin Eric Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:large language models, Multimodal large language, decision time, large language, language models
备注: 25 pages, 8 figures
点击查看摘要
Abstract:Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.
128. 【2605.29340】A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities
链接:https://arxiv.org/abs/2605.29340
作者:Kenji Imamura,Masao Ideuchi,Atsushi Fujita
类目:Computation and Language (cs.CL)
关键词:LLM safety evaluation, discuss question-answer dataset, dataset for LLM, LLM safety, safety evaluation
备注: 10 pages, 1 figure
点击查看摘要
Abstract:In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods for creating question-answer examples, and a rubric for evaluating LLM-generated responses. The outcomes of this study are intended to be shared with the "JAI-Trust" project.
129. 【2605.29336】Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding
链接:https://arxiv.org/abs/2605.29336
作者:Riza Setiawan Soetedjo,Yusuke Sakai,Hidetaka Kamigaito,Jingun Kwon,Manabu Okumura,Taro Watanabe
类目:Computation and Language (cs.CL)
关键词:Improving the quality, remains a challenge, quality of model-generated, Minimum Bayes Risk, source content
备注: Accepted to ACL 2026 Findings
点击查看摘要
Abstract:Improving the quality of model-generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality-aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at this https URL .
130. 【2605.29327】Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization
链接:https://arxiv.org/abs/2605.29327
作者:Junlin He,Yihong Tang,Tong Nie,Guilong Li,Binyu Yang,Jinxiao Du,Lijun Sun,Wei Ma
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:compresses large language, large language models, structured pruning parameters, tuning lightweight modules, Reasoning-preserved Efficient Distillation
备注:
点击查看摘要
Abstract:Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state-of-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.
131. 【2605.29324】STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments
链接:https://arxiv.org/abs/2605.29324
作者:Junyang Wang,Haiyang Xu,Xi Zhang,Zhaoqing Zhu,Ming Yan,Jieping Ye,Jitao Sang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:GUI agents excel, Mobile GUI agents, Mobile GUI, GUI agents, GUI
备注: 24 pages, 4figures, 21 tables
点击查看摘要
Abstract:Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.
132. 【2605.29319】Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective
链接:https://arxiv.org/abs/2605.29319
作者:Shenghao Ye,Yuxiang Wang,Yu Guo,Dong Jin,Shuangwu Chen,Jian Yang
类目:Computation and Language (cs.CL)
关键词:Large Reasoning Models, incur substantial inference, substantial inference cost, inference cost due, long reasoning traces
备注: 17pages, 15 figures, submitted to EMNLP 2026
点击查看摘要
Abstract:Large Reasoning Models (LRMs) achieve strong performance on table reasoning tasks but incur substantial inference cost due to long reasoning traces. Stepwise model routing mitigates this issue by dynamically assigning reasoning steps to smaller or larger models. However, stepwise model routing for table reasoning remains underexplored. Through empirical analysis, we find that reasoning steps involving tables contain two types of tokens with distinct uncertainty distributions: table tokens grounded in table structure, such as cell values and headers, and text tokens representing surrounding natural-language reasoning. The uncertainty of both token types is correlated with the risk that the model makes an error in the next reasoning step. However, existing methods fail to model them separately, leading to suboptimal routing decisions. To address this, we propose EcoTab, a table-aware stepwise routing framework for efficient table reasoning. At each reasoning step, EcoTab separately estimates the uncertainties of table tokens and text tokens, maps them to next-step failure risks for the small model, and combines the two risks for routing. Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency.
133. 【2605.29317】FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning
链接:https://arxiv.org/abs/2605.29317
作者:Juneyoung Park,Seongbae Lee,Han-Sang Lee,Kyuho Lee,Minjae Kim,Seungheon Hyeon,Kiduk Kwon,Seongwan Kim,Jaeho Lee
类目:Computation and Language (cs.CL)
关键词:Parameter-efficient fine-tuning, reducing trainable parameters, accuracy-oriented variants, leaving the original, receivedcomparatively little attention
备注: EMNLP 2026
点击查看摘要
Abstract:Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.
134. 【2605.29313】PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration
链接:https://arxiv.org/abs/2605.29313
作者:Shuyu Zhang,Yaqi Shi,Lu Wang
类目:Computation and Language (cs.CL)
关键词:LLM multi-agent systems, making intermediate state, LLM multi-agent, intermediate state difficult, structured shared memory
备注:
点击查看摘要
Abstract:LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collaboration architecture that replaces inter-agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role-specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.
135. 【2605.29310】Rubric-Guided Process Reward for Stepwise Model Routing
链接:https://arxiv.org/abs/2605.29310
作者:Shenghao Ye,Yu Guo,Zhengheng Li,Shuangwu Chen,Jian Yang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Reasoning Models, efficiency of Large, Stepwise model routing, Large Reasoning, model routing improves
备注: 17 pages, 9 figures, submitted to EMNLP 2026
点击查看摘要
Abstract:Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.
136. 【2605.29307】GrepSeek: Training Search Agents for Direct Corpus Interaction
链接:https://arxiv.org/abs/2605.29307
作者:Alireza Salemi,Chang Zeng,Atharva Nijasure,Jui-Hui Chung,Razieh Rahimi,Fernando Diaz,Hamed Zamani
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Model, knowledge-intensive language tasks, shown strong promise, Language Model, knowledge-intensive language
备注:
点击查看摘要
Abstract:Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.
137. 【2605.29300】MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs
链接:https://arxiv.org/abs/2605.29300
作者:Daeyong Kwon,Qiyu Wu,Shinobu Kuriya,Junghyun Koo,Shuyang Cui,Zhi Zhong,Wei-Hsiang Liao,Hiromi Wakaki,Yuki Mitsufuji
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:Recent Large Audio-Language, Recent Large, Large Audio-Language Models, demonstrated promising abilities, Large Audio-Language
备注:
点击查看摘要
Abstract:Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.
138. 【2605.29278】Accommodation Goes Both Ways: Studying Linguistic Convergence Between Humans and Language Models
链接:https://arxiv.org/abs/2605.29278
作者:Terra Blevins
类目:Computation and Language (cs.CL)
关键词:daily life, open question, increasingly integrated, integrated into daily, presence will shape
备注:
点击查看摘要
Abstract:As LLMs become increasingly integrated into daily life, understanding how their presence will shape human linguistic behavior is an open question. We present a large-scale study of linguistic convergence in human-LLM dialogue, examining how humans and LLMs accommodate each other's linguistic style during multi-turn conversations. Using an asymmetric convergence metric on WildChat, a corpus of real-world ChatGPT transcripts, we find that while LLMs significantly overconverge toward their users on both function word and open-class features across eight languages, human convergence rates in this setting are broadly consistent with human-human baselines. These findings suggest that accommodation in human-LLM dialogue is asymmetric: while LLMs dramatically overfit to their users' style, humans linguistically accommodate LLMs no differently than they would another person.
139. 【2605.29275】Prompt-Level Reward Specifications for Open-Ended Post-Training
链接:https://arxiv.org/abs/2605.29275
作者:Zijun Weng,Xiaohui Hu,Shuangyong Song,Yongxiang Li,Kaidong Yu,Xuanjing Huang
类目:Computation and Language (cs.CL)
关键词:make prompt-specific success, prompt-specific success conditions, success conditions explicit, Open-ended post-training benefits, post-hoc scalar scores
备注: 39 pages, 4 figures, 16 tables
点击查看摘要
Abstract:Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.
140. 【2605.29274】Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative Optimization
链接:https://arxiv.org/abs/2605.29274
作者:Yun Wang,Xin Xia,Xuansheng Wu,Xiaoming Zhai,Ninghao Liu
类目:Computation and Language (cs.CL)
关键词:approaches near-human performance, tasks remains bottlenecked, per-item human configuration, automated scoring approaches, scoring approaches near-human
备注: 12 pages, 5 figures
点击查看摘要
Abstract:LLM-based automated scoring approaches near-human performance, but scaling to new tasks remains bottlenecked by the per-item human configuration of upstream stages such as rubric construction. Human experts bypass this bottleneck through evaluation heuristics developed over extensive practice. We ask whether LLMs can learn similar heuristics directly from scoring experience, and formalize this as the concept of assessment skills: item-independent natural-language procedural knowledge that guides LLMs through specific stages of the scoring workflow. Focusing on rubric construction as a first instantiation, we propose an iterative framework that decomposes a skill into a fixed scaffold and learnable item-agnostic rules, refining the rules through LLM-driven diagnosis of scoring errors and validation-gated selection. The framework requires no expert-written rubric. On all ten ASAP-SAS items, optimized skills substantially improve LLM-based scoring and frequently surpass the dataset-provided expert rubric. Cross-item transfer experiments further reveal that learned skills capture both generalizable and item-specific patterns.
141. 【2605.29268】Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits
链接:https://arxiv.org/abs/2605.29268
作者:Sixue Xing,Haoyu He,Kerui Wu,Zhuo Yang,Haozheng Luo,Tianfan Fu,Aarthy Nagarajan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
关键词:LLM-guided evolutionary search, existing systems report, Evolve systems, distribution undocumented, LLM-guided evolutionary
备注:
点击查看摘要
Abstract:LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.
142. 【2605.29256】DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents
链接:https://arxiv.org/abs/2605.29256
作者:Rongsheng Zhang,Jiji Tang,Junnan Ren,Zuyi Bao,Weijie Chen,Ruofan Hu,Zhou Zhao,Tangjie Lv,Yan Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:extended multi-turn conversations, large language models, sustain character identity, large language, identity and interaction
备注:
点击查看摘要
Abstract:Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.
143. 【2605.29250】OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources
链接:https://arxiv.org/abs/2605.29250
作者:Jinheon Baek,Soyeong Jeong,Sangwoo Park,Woongyeong Yeo,Minki Kang,Patara Trirat,Heejun Lee,Sung Ju Hwang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Real-world information, property graphs, access to structurally, structurally diverse knowledge, Real-world
备注:
点击查看摘要
Abstract:Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.
144. 【2605.29247】DenseSteer: Steering Small Language Models towards Dense Math Reasoning
链接:https://arxiv.org/abs/2605.29247
作者:Yang Ouyang,Shuhang Lin,Jung-Eun Kim
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, multi-step reasoning tasks, Large language, demonstrate strong, significantly underperform
备注: ICML 2026
点击查看摘要
Abstract:Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.
145. 【2605.29245】Implicit Identity Technologies for LLMs: Fingerprinting and Watermarking across Datasets, Models, and Generated Content
链接:https://arxiv.org/abs/2605.29245
作者:Bing Liu,Shunping Wang,Yufan Zhu,Xinyi Yu,Jing Huang,Linkang Du,Hongbin Pei,Wei Luo
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:paper presents, identity, studying LLM identity, LLM identity technologies, LLM fingerprinting
备注: Accepted by IJCAI-ECAI 2026. 11 pages, 1 figure. Survey and taxonomy of LLM fingerprinting and watermarking for identity, provenance, generated-content attribution, and asset protection
点击查看摘要
Abstract:This paper presents a survey and taxonomy of LLM fingerprinting and watermarking for identity, ownership verification, provenance, and generated-content attribution. Large language models (LLMs) require substantial investments in data, computation, and expertise, and are increasingly deployed in high-stakes settings, making it critical to protect LLM-related assets and trace their origins. Existing work has rapidly expanded across dataset provenance, model ownership, and generated-content detection, but the field remains fragmented: fingerprinting and watermarking are often used inconsistently, and methods are typically studied within isolated asset-specific settings. To address this gap, we introduce implicit identity as a unifying abstraction for verifiable but not directly observable identity signals in LLM systems. We distinguish fingerprinting as non-intrusive identity derived from intrinsic characteristics, and watermarking as intrusive identity deliberately embedded into data, models, or generated content. We then propose a lifecycle-based taxonomy that organises techniques across datasets, models, and generated content, and further separates them by verification semantics: similarity-based attribution and keyed verification. Finally, we establish an evaluation framework centred on identifiability, robustness, and deployability, summarising representative metrics under realistic access and transformation regimes. By unifying terminology, lifecycle stages, and evaluation objectives, this survey provides a structured foundation for studying LLM identity technologies and for developing more reliable mechanisms for asset protection and provenance.
146. 【2605.29243】Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment
链接:https://arxiv.org/abs/2605.29243
作者:Laerdon Kim,Vivian Nguyen,Cristian Danescu-Niculescu-Mizil
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:personal attacks, Forecasting conversational derailment, eventually derail, derail into personal, conversation unfolds
备注: To appear in the Proceedings of ACL 2026
点击查看摘要
Abstract:Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into personal attacks. Since forecasting models operate in an online fashion, they must decide whether to "trigger" an alert after each utterance--for example, to notify participants or a moderator that the conversation is at risk of derailing. Existing approaches make this decision solely based on the estimated likelihood of derailment given the preceding utterances, implicitly assuming that the conversation's future trajectory is fixed. As a result, they ignore the possibility of future recovery and incur an unnecessarily high rate of false positives. In this work we propose a method for decoupling the decision to trigger from derailment likelihood estimation. Our approach is inspired by the first human baseline on this task, which shows that humans achieve dramatically lower false positive rates by selectively deferring their decision to trigger when they anticipate that tension is likely to subside. We operationalize this insight with a deferral mechanism that uses forward-looking simulations to assess whether a tense moment admits plausible paths to recovery. Incorporating this mechanism into a state-of-the-art forecasting model substantially reduces false positives without sacrificing forecasting accuracy. More broadly, this work highlights the value of treating decision-making as a first-class component of forecasting systems.
Comments:
To appear in the Proceedings of ACL 2026
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:
arXiv:2605.29243 [cs.CL]
(or
arXiv:2605.29243v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.29243
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
147. 【2605.29240】Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI
链接:https://arxiv.org/abs/2605.29240
作者:Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:AI-augmented classrooms generate, classrooms generate rich, timely instructional decisions, generate rich teacher, AI-augmented classrooms
备注: Accepted to HAI-Agency Workshop on Orchestrating Human and AI Agency for Proactive and Reflective Learning
点击查看摘要
Abstract:AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $\rho=0.80$) and student-reported topic difficulty ($\rho=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.
148. 【2605.29224】Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
链接:https://arxiv.org/abs/2605.29224
作者:Aditya Nawal,Manit Baser,Mohan Gurusamy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:augment large language, large language models, agents augment large, augment large, large language
备注:
点击查看摘要
Abstract:AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.
149. 【2605.29218】GTA: Generating Long-Horizon Tasks for Web Agents at Scale
链接:https://arxiv.org/abs/2605.29218
作者:Tenghao Huang,Kung-Hsiang Huang,Prafulla Kumar Choubey,Yilun Zhou,Muhao Chen,Jonathan May,Chien-Sheng Wu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:open web assistants, couple language models, web assistants, open web, tool-use capabilities
备注: Published at Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics
点击查看摘要
Abstract:Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.
150. 【2605.29192】ReasonOps: Operator Segmentation for LLM Reasoning Traces
链接:https://arxiv.org/abs/2605.29192
作者:Daniel Lee,Owen Queen,James Zou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:thousands of tokens, span tens, tens of thousands, lack a vocabulary, vocabulary for describing
备注:
点击查看摘要
Abstract:Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.
151. 【2605.29190】When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer
链接:https://arxiv.org/abs/2605.29190
作者:Mayug Maniparambil,Arjun Karuvally,Terrence Sejnowski,Fergal Reid
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:improves LLM reasoning, improves LLM, Reinforcement learning, remain under-explored, LLM reasoning
备注: Preprint
点击查看摘要
Abstract:Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model whose SFT and RL post-training stages use only constraint-satisfaction puzzles, with no mathematics problems in the post-training data. To analyze how transfer emerges, we introduce a reasoning primitive-level framework that combines a 9-class span classifier with motif extraction, allowing us to segment chain-of-thought traces into primitive motifs and track their evolution across training stages and domains. We find that puzzle SFT induces a reasoning-primitive vocabulary, yielding a $+7$pp \texttt{pass@32} gain on OlymMATH-Hard. Vanilla GSPO then composes these primitives into longer compute-verify chains, adding a further $+6$pp. However, this RL stage also suppresses exploratory primitives such as \textit{hypothesize} and \textit{backtrack}. To address this, we introduce a novelty bonus that rewards diverse correct rollouts, using perplexity under the reference model as a signal. This restores recovery primitives during RL and adds a further $+7$pp \texttt{pass@32} relative to vanilla GSPO. Finally, the end-to-end recipe raises the hard-math capability ceiling from $16.0\%$ at the OLMo3-7B-Instruct-SFT base to $36.0\%$, without adding any mathematics problems during the SFT or RL stages.
152. 【2605.29188】Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches
链接:https://arxiv.org/abs/2605.29188
作者:Ting Gong,Shangquan Sun
类目:Computation and Language (cs.CL)
关键词:CSS and management, entrepreneurial spirit, topic models, management research, corporate speeches
备注: 15 pages, 2 figures, 7 tables
点击查看摘要
Abstract:Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement diagnostic for such instruments rather than a new extraction model. On a corpus of 80 speeches by leaders of centrally administered Chinese state-owned enterprises, we exploit a natural experiment of 24 same-company different-speaker pairs and 5 same-company same-speaker pairs to test whether a method's per-document indices vary with leader identity holding firm constant. LDA fails (Cohen d=0.20, 95% CI [-0.72, 1.20]); a dictionary scorer reaches d=0.81 and a Chinese sentence encoder d=0.65 on doc-vector distances of order 10^-3. A zero-shot 9B open-weight LLM (Qwen3.5:9b) raises paired-contrast d to 1.09 (exact permutation p1=0.034). We downgrade three claims accordingly: gold F1 measures consistency with the LLM's own prompt rule rather than external construct recovery; doc-level style residualisation cuts the LLM's d to 0.43 (p1=0.22), so roughly half of the effect is consistent with leader idiolect; and a confidence-weighted calibration trades Delta for variance with an auto-mined slogan lexicon near-inert in ablation. We release the 2,190-segment scored corpus, the 170-paragraph pilot, the slogan lexicon, two-family LLM scores, and the evaluation harness.
153. 【2605.29170】UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
链接:https://arxiv.org/abs/2605.29170
作者:Volodymyr Ovcharov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:leaving failure modes, overwhelmingly English-centric, Unified State Register, Legal NLP benchmarks, Legal NLP
备注: 13 pages, 5 figures, 4 tables. Data: [this https URL](https://huggingface.co/datasets/overthelex/ua-legal-bench)
点击查看摘要
Abstract:Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.
154. 【2605.29157】Parallax: Parameterized Local Linear Attention for Language Modeling
链接:https://arxiv.org/abs/2605.29157
作者:Yifei Zuo,Dhruv Pai,Zhichen Zeng,Alec Dewulf,Shuming Hu,Zhaoran Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, remained structurally unchanged, Local Linear Attention
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.
155. 【2605.29156】RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
链接:https://arxiv.org/abs/2605.29156
作者:Haoxiang Jiang,Zihan Dong,Tianci Liu,Wanying Wang,Ran Xu,Tony Yu,Linjun Zhang,Haoyu Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:modeling offers critical, offers critical signals, non-verifiable settings, reward modeling offers, modeling offers
备注:
点击查看摘要
Abstract:Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.
156. 【2605.29146】SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation
链接:https://arxiv.org/abs/2605.29146
作者:Xinyu Wang,Hanwei Wu,Zhenghan Tai,Sicheng Lyu,Qincheng Lu,Ziyu Zhao,Jijun Chi,Jingrui Tian,Xiao-Wen Chang,Ziyang Song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:key challenges, face two key, Medication recommendation predicts, recommendation predicts medications, Medication
备注:
点击查看摘要
Abstract:Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.
157. 【2605.29123】he Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
链接:https://arxiv.org/abs/2605.29123
作者:Dueun Kim,Albert No
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Masked diffusion language, uniquely support any-order, diffusion language models, standard inference policy, facto standard inference
备注:
点击查看摘要
Abstract:Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de facto standard inference policy. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation. However, we argue that confidence-based decoding is inherently misaligned with the logical-flow trajectories required for complex reasoning, and that confidence-aligned training actively entrenches this misalignment. We make this concrete using multi-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long-range dependencies, producing high-confidence errors on challenging inputs. While traditional random masking keeps the failure rate low on this challenging tail, confidence-aligned training amplifies the error rate by an order of magnitude. Across five distinct reasoning tasks, this same pattern emerges with task-dependent severity: confidence-based decoding induces failures on highly complex inputs, and confidence-aligned training exacerbates them. In contrast, random masking -- despite its perceived inefficiency -- robustly preserves the reasoning-trajectory conditionals essential for solving the challenging tail.
158. 【2605.29084】Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
链接:https://arxiv.org/abs/2605.29084
作者:Yubo Li,Rema Padman,Ramayya Krishnan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:multi-author institutional corpus, paradigm cannot diagnose, mode the dominant, corpus can give, failure mode
备注:
点击查看摘要
Abstract:A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.
159. 【2605.29076】Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text
链接:https://arxiv.org/abs/2605.29076
作者:Tianyang Zhou,Wenbo Chen,Pierre Jinghong Liang,Leman Akoglu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:broader model transparency, offers human-readable instructions, optimization offers human-readable, prompt optimization offers, lacks broader model
备注:
点击查看摘要
Abstract:LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.
160. 【2605.29068】Robust and Efficient Guardrails with Latent Reasoning
链接:https://arxiv.org/abs/2605.29068
作者:Siddharth Sai,Xiaofei Wen,Muhao Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:large language models, real-world applications, large language, increasingly deployed, deployed in real-world
备注:
点击查看摘要
Abstract:Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.
161. 【2605.29064】Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
链接:https://arxiv.org/abs/2605.29064
作者:Neemias da Silva,Myriam Delgado,Rodrigo Minetto,Daniel Silver,Thiago H Silva
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
关键词:prompting shapes language, shapes language generated, multimodal large language, large language models, urban perception setting
备注: 10 pages, 6 figures
点击查看摘要
Abstract:We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona-related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.
162. 【2605.29062】Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies
链接:https://arxiv.org/abs/2605.29062
作者:Abhilekh Borah
类目:Computation and Language (cs.CL)
关键词:Communities can sustainably, finding of Ostrom, Ostrom theory, sustainably manage shared, cooperative norms
备注: Paper under review
点击查看摘要
Abstract:Communities can sustainably manage shared resources (commons) through self-governance and cooperative norms, a central finding of Ostrom's theory of self-governance. However, real-world commons (e.g., fisheries, forests, and irrigation systems) are often governed under asymmetric power structures, where certain individuals or institutions possess disproportionate control over resource extraction and collective outcomes. As Large Language Models (LLMs) are increasingly explored as agents in synthetic governance simulations, understanding how LLM societies behave under asymmetric power structures is becoming increasingly important, yet existing evaluations largely ignore such asymmetries. We introduce Sovereignty over the Commons Simulation (SovSim), a generative multi-agent simulation framework that incorporates an agent with asymmetric power (boss or king) into a society of symmetric agents (workers or peasants), where all agents extract from a shared resource, collectively determining its sustainability over time. Across eleven state-of-the-art models, we find that introducing asymmetric power leads to severe breakdowns in cooperation and sustainability, with up to an 87.3% degradation in survival rate relative to symmetric settings.
163. 【2605.29054】Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence
链接:https://arxiv.org/abs/2605.29054
作者:Linxin Song,Jiefeng Chen,Yue Huang,Bhavana Dalvi Mishra,Chi Wang,Jieyu Zhao,Jinsung Yoon,Tomas Pfister
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Coding agents increasingly, agents increasingly act, local validation routines, satisfy surface checks, Coding agents
备注:
点击查看摘要
Abstract:Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can match on a shallow outcome, such as a single forward loss, while diverging in gradients, optimizer behavior, or short-horizon training dynamics. We introduce T2J-Bench, a benchmark for codebase conversion that reformulates conversion as transfer under a fixed equivalence contract. A fixed verifier then compares source and converted codebases through three ordered stages: Spec (interface admissibility), Numeric (forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (short training dynamics under fixed seeds). Across 355 blind conversion attempts, the best system reaches only 26.7--28.9% overall pass rate despite Spec pass rates up to 91.1%; a 4.7x token-budget spread yields only a 2.2x pass-rate spread; and all systems overestimate success by 66.6--97.8 points relative to the fixed evaluator. This suggests that failures stem more from contract-misaligned self-validation than from limited budget or backbone strength.
164. 【2605.29048】LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English
链接:https://arxiv.org/abs/2605.29048
作者:Lauren Levine,Amir Zeldes
类目:Computation and Language (cs.CL)
关键词:bridging resolution, bridging resolution evaluation, referential bridging resolution, Resolution Evaluation Setting, bridging resolution pipeline
备注:
点击查看摘要
Abstract:In this paper, we introduce LLMBridge, a new LLM based system for the task of end-to-end referential bridging resolution in English. Our bridging resolution pipeline combines heuristic pre/post-processing with the natural language inference ability that comes from LLMs. We evaluate our bridging resolution pipeline on 3 datasets which have been used for referential bridging resolution evaluation in English: ISNotes, BASHI, and GUMBridge. Comparison to previous bridging resolution systems shows that the performance of LLMBridge surpasses previous state-of-the-art (SoTA) systems for all 3 datasets in the challenging End-to-end Evaluation Setting, as well as the Basic Bridging Resolution Evaluation Setting (gold bridging anaphor given). We also conduct a thorough error analysis of the LLMBridge performance, examining what varieties of bridging remain difficult for LLM based systems to identify. With this paper, we release the code for the LLMBridge pipeline.
165. 【2605.29027】Mind Your Tone: Does Tone Alter LLM Performance?
链接:https://arxiv.org/abs/2605.29027
作者:Om Dobariya,Akhil Kumar
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Large Language Models, Large Language, Language Models, observed to vary, vary based
备注: 10 pages, 6 tables, 1 figure. Accepted as a full paper at the Thirty-second Americas Conference on Information Systems (AMCIS 2026), Reno. Follow-up to [arXiv:2510.04950](https://arxiv.org/abs/2510.04950)
点击查看摘要
Abstract:The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.
166. 【2605.29018】Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild
链接:https://arxiv.org/abs/2605.29018
作者:Rebecca M. M. Hicke,Kiran Tomlinson
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Microsoft Bing Copilot, sampled Microsoft Bing, Bing Copilot users, largely static, growing body
备注:
点击查看摘要
Abstract:Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational trajectories of $\sim$12,000 randomly sampled Microsoft Bing Copilot users and compare these with data from WildChat-4.8M. While the Copilot data contains significant population-level trends, we find that trends in individual user trajectories are much weaker; user habits prove to be overwhelmingly sticky. We also find stark differences between users of different activity levels: more active users have more successful conversations and use the LLM for more complex and professionally oriented tasks. Some user trends also appear in WildChat-4.8M, but we find evidence that this dataset is significantly skewed towards highly proficient "power" users. Ultimately, our results suggest that existing user behavior is difficult to change and demonstrate the extent of user heterogeneity. Our comparison between datasets highlights that WildChat does not represent typical user-AI interactions, an important caveat for downstream uses of the data.
167. 【2605.29007】Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation
链接:https://arxiv.org/abs/2605.29007
作者:Xinming Yang,Jun Li
类目:Computation and Language (cs.CL)
关键词:IRB constraints make, constraints make labelled, Personalized tutoring, privacy and IRB, IRB constraints
备注:
点击查看摘要
Abstract:Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle generate synthetic errors at scale, but producing an arbitrary wrong answer is easy for a modern LLM while producing one that matches a specified cognitive failure mode is much harder. We present a framework that generates errors targeted to a five-class taxonomy adapted from the revised Bloom's taxonomy, evaluated on questions from the TheoremQA dataset. A Generation Agent (GA) drafts a candidate erroneous solution conditioned on a target class, and an Examination Agent (EA) judges whether the draft is incorrect and class-consistent. The framework yields a reusable recipe for building class-stratified synthetic error datasets where authentic student corpora are unavailable. As a secondary diagnostic, targeted error generation is substantially harder than free-form incorrect-answer generation, and answer-grounding contributes more than expanded examples or external textbook content.
168. 【2605.29000】xt-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction
链接:https://arxiv.org/abs/2605.29000
作者:Yuchun Zou,Junhong Tong,Jun Li
类目:Computation and Language (cs.CL)
关键词:Traditional lossless text, realistic operating regimes, Traditional lossless, lossless text compression, text compression preserves
备注:
点击查看摘要
Abstract:Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.
169. 【2605.28999】Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening
链接:https://arxiv.org/abs/2605.28999
作者:Mohan Zhang,Yuqi Jia,Zhen Tan,Steven Jiang,Neil Zhenqiang Gong,Tianlong Chen,Dawn Song
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:LLMs are vulnerable, prompt injection, injection, prompt, real-world
备注: Published in USENIX Security Symposium 2026; Code and artifacts are available at [this https URL](https://github.com/UNITES-Lab/resume-injection-measurement)
点击查看摘要
Abstract:LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real-world LLM-based applications are largely unexplored. In this work, we present the first systematic study of prompt-injection attacks in a widely used application: LLM-based resume screening. Our analysis is based on approximately 200K real-world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small-scale dataset demonstrates that our detectors achieve high precision and outperform state-of-the-art general-purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real-world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large-scale prompt injection in real-world LLM-based applications and lay the groundwork for future studies to understand and mitigate such attacks.
170. 【2605.28969】Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization
链接:https://arxiv.org/abs/2605.28969
作者:Aarik Gulaya
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:agent makes decisions, person behalf, Specification, decisions must align, representational accuracy
备注: 134 pages, 4 figures. Code, data, judge prompts, and reproduction instructions: [this http URL](http://github.com/agulaya24/beyond-recall)
点击查看摘要
Abstract:If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.
Comments:
134 pages, 4 figures. Code, data, judge prompts, and reproduction instructions: this http URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
ACMclasses:
I.2.7; I.2.0
Cite as:
arXiv:2605.28969 [cs.CL]
(or
arXiv:2605.28969v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.28969
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
171. 【2605.28966】he Trust Paradox: How CS Researchers Engage LLM Leaderboards
链接:https://arxiv.org/abs/2605.28966
作者:Pouya Sadeghi,Anamaria Crisan,Jimmy Lin
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Large language model, Large language, reliability and robustness, highly visible, LLM
备注:
点击查看摘要
Abstract:Large language model (LLM) leaderboards rank AI models using standardized benchmarks and have become highly visible across computer science, despite known limitations in their reliability and robustness. Yet how they shape researchers' actual practice remains empirically uncharted. We address this gap through semi-structured interviews with eight researchers across four computer science subfields, analyzed using reflexive thematic analysis. We find a near-universal paradox of pragmatic skepticism: while participants expressed deep distrust of leaderboard rankings, they continued to use them as rough decision-making aids. Peer networks, not leaderboards, emerged as the primary model selection mechanism, and arena-based (human-voting) leaderboards were consistently preferred over static benchmark leaderboards. Leaderboard influence varied sharply across subfields, revealing that disciplinary culture, not individual attitudes, mediates engagement; for instance, NLP researchers faced state-of-the-art comparison pressure while HCI and Systems/Privacy researchers reported none. Across these differences, however, participants converged on cost transparency as the most demanded missing feature (seven of eight). We translate these findings into concrete design recommendations that align evaluation infrastructure with how researchers actually use it, such as task-specific score breakdowns, cost integration, and voter-demographic disclosure.
172. 【2605.28919】CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models
链接:https://arxiv.org/abs/2605.28919
作者:Venkat Akhil Lakkapragada
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, Large language, strong reasoning capabilities, achieved strong reasoning, massive parameter counts
备注: 17 pages, 4 figures. Exploratory study of adaptive reasoning depth in compact autoregressive language models. Code available at [this https URL](https://github.com/MistyozAI/CosmicFish-HRM)
点击查看摘要
Abstract:Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish-HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high-level and low-level reasoning cycles and learns when to halt based on input complexity. CosmicFish-HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non-uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.
173. 【2605.28913】Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models
链接:https://arxiv.org/abs/2605.28913
作者:Xinyuan Cheng,Beiduo Chen,Philipp Mondorf,Barbara Plank
类目:Computation and Language (cs.CL)
关键词:Large reasoning models, Large reasoning, producing a final, Large, reasoning
备注: 20 pages, 17 figures
点击查看摘要
Abstract:Large reasoning models (LRMs) often generate extensive chain-of-thought (CoT) traces before producing a final answer. As explicit textual artifacts, these traces can be passed to other models to solve the same task, enabling cross-model reasoning transfer. Yet successful transfer alone does not reveal how the provided CoT contributes to another model's answer. We study this question with a controlled provider--receiver framework, where a provider generates a reasoning trace and a receiver solves the same problem from increasingly longer trace prefixes. We compare force-answer, where the receiver answers directly from the prefix, with free-generation, where it may continue reasoning before answering. Across models and benchmarks, full traces often transfer successfully, but prefix trajectories reveal distinct mechanisms. In force-answer mode, AIME transfer is largely driven by explicit answer availability. MMLU-Pro instead reflects a larger role for receiver competence, while ZebraLogic depends on partial structured-answer information rather than complete-answer leakage alone. In free-generation mode, partial CoTs improve performance across benchmarks, indicating that prefixes can guide continued reasoning. Finally, answer agreement among receivers provides a gold-free signal for stopping provider reasoning early. Overall, cross-model CoT transfer is not a single phenomenon: it can reflect answer extraction, reasoning scaffolding, or receiver-dependent competence.
174. 【2605.28910】Hallucination Detection-Guided Preference Optimization for Clinical Summarization
链接:https://arxiv.org/abs/2605.28910
作者:Shamanth Kuthpadi Seethakantha,Dung Ngoc Thai,Vara Prasad Gudi,Simran Tiwari,Rami Matar,Avijit Mitra,Wenlong Zhao,Wael Salloum,Andrew McCallum
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:specialized healthcare applications, Large language models, Large language, healthcare applications, shown promise
备注:
点击查看摘要
Abstract:Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce \itermodelfull (\itermodel), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose \itermodel for Preference Learning (\model), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from \MimicIV. For example, \itermodel reduces 24\% and \model reduces 48\% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.
175. 【2605.28882】GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
链接:https://arxiv.org/abs/2605.28882
作者:Yihang Lin,Yunze Gao,Zeyang Lin,Dongbo Li,Kun Peng,Chenglong Song,Yue Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:large language models, increasingly important, rapid advancement, advancement of large, large language
备注:
点击查看摘要
Abstract:With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-like is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. With minimal human seed annotations as the first mover, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution, expanded through new seeds when the evaluation target moves. Applied to human-likeness evaluation in open-ended conversation, the generated rubrics not only substantially outperform existing methods in alignment with human judgments, but also uncover issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.
176. 【2605.28874】From Data to Insights: Exploring Program-of-Thoughts Prompting for Chart Summarization
链接:https://arxiv.org/abs/2605.28874
作者:Yutong Qu,Wei Zhang
类目:Computation and Language (cs.CL)
关键词:conveying numerical data, numerical data insights, play a critical, critical role, role in conveying
备注: 22 pages, 9 figures
点击查看摘要
Abstract:Charts play a critical role in conveying numerical data insights through structured visual representations. However, semantic visual understanding and numerical reasoning requirements hinder the accurate description of charts, interpreting a challenging task in chart summarization. Despite recent advancements in visual language models (VLMs), approaches lack robust mechanisms for verifying statistical fact correctness and are computationally heavy. To address this gap, this paper explores a strategy of using zero-shot learning to motivate the lightweight VLMs to perform computational reasoning, via Python programs as intermediaries to derive valid summary statistics for chart understanding. Specifically, we introduce a novel chart-to-dictionary auxiliary task, offering a more flexible representation compared to traditional chart-to-table methods, making it particularly well-suited for integration with the Program-of-Thought (PoT) strategy. Experimental results demonstrate our strategy performs on par with existing chart summarization methods across semantic and factual metrics. Code is available on this https URL.
177. 【2605.28864】he Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling
链接:https://arxiv.org/abs/2605.28864
作者:Al Kari
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Cognitive Categorical Transformer, cognitively grounded components, grounded components derived, cognitive science, Categorical Transformer
备注:
点击查看摘要
Abstract:The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science. Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103. Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark. Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the *structure/consistency distinction*, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.
178. 【2605.28860】Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
链接:https://arxiv.org/abs/2605.28860
作者:Jeanmely Rojas Nunez,Viraj Sawant,Nathan Allen,Nomgondalai Amgalanbaatar,Yannis Zongo,Vasu Sharma,Maheep Chaudhary
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, Fine-tuning large language, frequently induces catastrophic, prior capabilities, language models
备注:
点击查看摘要
Abstract:Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: this https URL.
179. 【2605.28854】Large language models reorganize representational geometry during in-context learning
链接:https://arxiv.org/abs/2605.28854
作者:Hua-Dong Xiong,Li Ji-An,Robert C. Wilson,Kwonjoon Lee,Xue-Xin Wei
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
关键词:exhibit remarkable flexibility, Large language models, Large language, ICL, exhibit remarkable
备注:
点击查看摘要
Abstract:Large language models (LLMs) exhibit remarkable flexibility: they can adapt to novel tasks from in-context examples without any parameter updates, a capability known as in-context learning (ICL). Prior work on synthetic tasks has shown that ICL can implement specific algorithms, demonstrating architectural competence, and mechanistic analyses have identified key circuits that support this behavior. However, because in-context computation -- regardless of its algorithmic form -- relies on transformations in high-dimensional representation space, it remains unclear how the geometry of that space shapes ICL effectiveness. Motivated by the neuroscience view of classification as the untangling of neural representations, we hypothesize that ICL depends on the successful online untangling of task-relevant representations. To test this idea, we study how LLMs classify in-context examples whose labels are defined by the model's own internal representations with known structure. We show that ICL performance correlates systematically with the representational structure of the underlying classification task and that successful ICL is accompanied by geometric reorganization that increases online separability. We further find that LLM behavior is well described by a prototype-like algorithm that integrates evidence while reshaping representations to support classification. These findings offer a geometric account of ICL in pretrained LLMs, establish representational geometry as a mechanistic constraint on ICL, and quantify the gap between what pretrained representations afford and what in-context learning can exploit.
180. 【2605.28848】GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models
链接:https://arxiv.org/abs/2605.28848
作者:Mohd Ariful Haque,Fahad Rahman,Kishor Datta Gupta,Roy George
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Deployed language models, Deployed language, retrieval layers, safety systems, non-stationary environment
备注:
点击查看摘要
Abstract:Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how models frame newly emerging events for different prompted audiences. We introduce GPF-LIVENEWS, a streaming evaluation protocol and benchmark snapshot for auditing group-conditioned framing in open-ended LLM outputs. The protocol expands fresh BBC/Reuters news anchors across 42 identity labels and seven prompt families, then evaluates response bundles using semantic-sensitivity and sentiment-disparity signals. In a pilot over 12 monitoring runs and 23 hosted models, Policy/Action prompts produce the strongest semantic movement, while sentiment variation is flatter across dimensions and prompt families. The released artifact includes article metadata, prompt templates, instantiated prompts, model-output metadata, score tables, documentation, and reproduction scripts. We interpret all scores as observed-window audit signals for human review, not as permanent fairness rankings or direct proof of harmful bias.
181. 【2605.28842】houghts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning
链接:https://arxiv.org/abs/2605.28842
作者:Dong Liu,Yanxuan Yu,Ying Nian Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:diverse NLP tasks, diverse NLP, reasoning chain, reasoning chain optimization, aligning model behavior
备注:
点击查看摘要
Abstract:The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce \textbf{Thoughts-as-Planning}, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at this https URL.
182. 【2605.28840】How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines
链接:https://arxiv.org/abs/2605.28840
作者:Abel Yagubyan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:Large language model, question remains under-explored, fundamental reliability question, reliability question remains, Large language
备注: 16 pages, 6 figures
点击查看摘要
Abstract:Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.
183. 【2605.28838】Specialty-Specific Medical Language Model for Immune-Mediated Diseases
链接:https://arxiv.org/abs/2605.28838
作者:Veysel Kocaman,Gursev Pirge,Yigit Gul,Ace Vo,Zhenya Nargizyan,David Talby
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:free-text medical narratives, medical narratives remains, Natural Language Processing, Extracting detailed clinical, general-purpose Natural Language
备注: 15 pages, 5 figures. Funded in part by NIAID/NIH under contract 75N93024C00010
点击查看摘要
Abstract:Extracting detailed clinical information from free-text medical narratives remains a practical challenge for researchers and healthcare systems. Terminology for immune-mediated and infectious diseases is especially inconsistent across sources, which often limits the ability of general-purpose Natural Language Processing (NLP) systems to capture the relevant biomedical concepts with sufficient granularity. We developed a domain-specific Named Entity Recognition (NER) model tailored to identify disease-related entities occurring in immunology and infectious disease contexts. We assembled and manually annotated a dataset of 371 case reports in collaboration with two clinical specialists, defining twelve entity classes covering immune-mediated and infectious conditions as well as related symptoms and clinical descriptors. We evaluated several modeling strategies, including the MedicalNER architecture with multiple healthcare-specific embeddings, a BERT-based token classification model, and zero-shot NER systems. The strongest performance was obtained with a transformer-based model trained on clinical-domain embeddings, which reached an F1 score of 0.89, consistently outperforming baseline and zero-shot approaches. The combination of specialized embeddings and expert annotation proved particularly valuable for capturing nuanced disease terminology and improving generalization across heterogeneous biomedical text. The prompted LLM baseline achieved substantially lower performance under the same evaluation protocol, reflecting difficulties in producing span-consistent outputs for fine-grained entity boundaries despite detailed prompting. The resulting model provides a structured way to analyze case reports and can support downstream tasks such as cohort identification, disease monitoring, and clinical decision support.
184. 【2605.28837】SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2605.28837
作者:Gyumin Kim,Juhwan Park,Jaeha Kim,Seunggyun Han,Kyungrak Son,Ikbeom Jang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:demonstrated remarkable capabilities, Large Language Models, Large Language, remarkable capabilities, demonstrated remarkable
备注: 15 pages, 2 figures, 6 tables. To appear in the Proceedings of the 28th International Conference on Pattern Recognition (ICPR 2026). Code available at [this https URL](https://github.com/labhai/SERC)
点击查看摘要
Abstract:While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations. Existing intrinsic self-correction methods attempt to address this, but often fail due to self-bias, where models struggle to identify errors in their own outputs without external verification. To overcome these limitations, we propose the LDPC-inspired semantic error correction for retrieval-augmented generation (SERC), providing a theoretical framework to interpret and mitigate LLM hallucinations. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise-corrupted codewords. Inspired by low-density parity-check (LDPC) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low-density verification queries and validates them against external evidence to efficiently detect and correct errors. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama-3-8B and Qwen2.5-14B. Experimental results demonstrate that SERC outperforms both intrinsic self-correction methods and strong retrieval-augmented baselines, demonstrating significant gains especially in factual precision (FactScore). Notably, SERC enables small language models (SLMs) to surpass the performance of larger baselines in hallucination reduction and information preservation. Our findings demonstrate that SERC provides a training-free, model-agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade-off between cost and fidelity in resource-constrained environments.
185. 【2605.28836】No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand
链接:https://arxiv.org/abs/2605.28836
作者:Jimin Jung,MyoungJin Kim,Jaehyung Seo,Heuiseok Lim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:United States requires, Plain Writing Act, States requires government, Writing Act, United States
备注:
点击查看摘要
Abstract:The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.
186. 【2605.28835】GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling
链接:https://arxiv.org/abs/2605.28835
作者:Hao-Xiang Xu,Chong Deng,Jiaqing Liu,Wen Wang,Qian Chen,Lujia Bao,Xiangang Li,Zhen-Hua Ling
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, broad coverage, Large
备注: Accepted by ACL 2026 Main
点击查看摘要
Abstract:Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi-stage evaluation system. We fine-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open-source models in in-domain FC performance and out-of-domain generalization, while reaching FC capabilities comparable to some of the latest API-based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real-world applicability.
187. 【2605.28834】Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning
链接:https://arxiv.org/abs/2605.28834
作者:Gus Lathouwers,Wieke Harmsen,Catia Cucchiarini,Helmer Strik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Syllabification, Dutch syllabification, Dutch orthographic syllabification, Dutch syllabification algorithms, describes the task
备注: Published in CLIN Journal
点击查看摘要
Abstract:Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep-learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.
188. 【2605.28833】ranscribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions
链接:https://arxiv.org/abs/2605.28833
作者:Gus Lathouwers,Lingyun Gao,Catia Cucchiarini,Helmer Strik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Automatic speech recognition, generating automatic transcriptions, child speech, generating automatic, substantially reduce manual
备注:
点击查看摘要
Abstract:Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42.0% [for JASMIN] and 18.1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98.3% and higher) and reducing the need for manual verification.
189. 【2605.28832】A comparative study of transformer-based embeddings for topic coherence
链接:https://arxiv.org/abs/2605.28832
作者:Alex Ding,Tarun Rapaka,Willy Rodriguez,Jason Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Latent Dirichlet Allocation, Natural Language Processing, Dirichlet Allocation, Latent Dirichlet, word co-occurrence patterns
备注:
点击查看摘要
Abstract:Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{ö}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.
190. 【2605.28831】S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
链接:https://arxiv.org/abs/2605.28831
作者:Encheng Su,Jinouwen Zhang,Jianyu Wu,Qiucheng Yu,Chen Tang,Pengze Li,Lintao Wang,Yizhou Wang,Xinzhu Ma,Shixiang Tang,Aoran Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:earlier events reliably, accumulate large trajectory, large trajectory histories, events reliably, accumulate large
备注:
点击查看摘要
Abstract:Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory-to-answer interface of long-term memory. When histories are stored as plain-text chunks and queried with standard retrieval-augmented generation (RAG), systems often retrieve locally relevant but chain-incomplete evidence, especially for spatial, temporal, repeated-event, and multi-hop state questions. We propose S3MEM, a structured scene-event episodic memory framework for long-horizon interactive question answering (QA). S3MEM writes trajectories into structured memory units, retrieves evidence through anchor-sensitive retrieval, and exposes a compact token-budget-aware evidence interface for answer-time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query-aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out-of-family environments (SciWorld, ALFWorld). Under a shared frozen answer-time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph-NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines -- A-MEM-inspired, MemoryOS-adapted, and LightMem-adapted -- improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy-efficiency frontier. Overall, the evidence supports a bounded conclusion: under the current frozen answer-time protocol, structured writing and anchor-sensitive evidence routing provide a stronger accuracy-efficiency frontier for long-horizon interactive QA than more generic memory interfaces.
191. 【2605.28830】Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation
链接:https://arxiv.org/abs/2605.28830
作者:Reetu Raj Harsh,Bhaskarjit Sarmah,Stefano Pasquali
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:Large Language Models, Large Language, robust content moderation, moderation becomes essential, Language Models
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.
192. 【2605.28829】Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
链接:https://arxiv.org/abs/2605.28829
作者:Ritvik Rastogi,Vishal Singh,Tejas Chaudhari,Sandeep Varma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:require multi-step symbolic, deep conceptual understanding, NEET require multi-step, precise numerical computation, Competitive STEM examinations
备注:
点击查看摘要
Abstract:Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes. We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2.0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:
arXiv:2605.28829 [cs.CL]
(or
arXiv:2605.28829v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.28829
Focus to learn more
arXiv-issued DOI via DataCite</p>
193. 【2605.28828】Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models
链接:https://arxiv.org/abs/2605.28828
作者:Yujie Feng,Jian Li,Zhihan Zhou,Pengfei Xu,Yujia Zhang,Xiaoyu Li,Xiaohui Zhou,Alan Zhao,Xi Chen,Xiao-Ming Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, achieve impressive performance, redundant retrieved contexts, amplify factual errors, chains amplify factual
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form generation where redundant retrieved contexts and lengthy reasoning chains amplify factual errors. Recent studies highlight a critical phenomenon: the closer key information appears to the model outputs, the higher the factual accuracy. However, existing retrieval-augmented language models (RALMs) lack effective mechanisms to ensure this proximity - external evidence is injected into reasoning via multi-turn retrieval, but this cannot ensure key information stays close to the outputs. We propose Micro-Macro Retrieval (M2R), a novel retrieve-while-generate framework to fill this gap. At the macro level, M2R retrieves coarse-grained evidence from external sources; at the micro level, it extracts essential results from a key information repository built during reasoning and reuses them while generating answers. This design directly addresses the key-information-to-output proximity bottleneck, effectively reducing hallucination in long-form tasks. M2R is trained with a curriculum learning-based reinforcement learning strategy using customized rule-based rewards, enabling stable acquisition of retrieval and grounding skills. Extensive experiments across different benchmarks demonstrate the effectiveness of M2R, especially in lengthy-context settings.
194. 【2605.28827】RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment
链接:https://arxiv.org/abs/2605.28827
作者:Jaber Jaber,Osama Jaber
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Arabic large language, language models split, large language models, Open Arabic large, Arabic-specialized decoder LLM
备注: 12 pages, 7 tables, 4 figures, 1 algorithm. Weights: [this https URL](https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo)
点击查看摘要
Abstract:Open Arabic large language models split into two classes: sub-1B multilingual models that treat Arabic as an afterthought (Qwen2.5-0.5B, Falcon-H1-0.5B), and 7B-70B Arabic-specialized models that require a server to run (Jais, AceGPT, ALLaM, SILMA). The one published attempt at a sub-2B Arabic-specialized model, Kuwain-1.5B, never released its weights. We present RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized decoder LLM built on Qwen2.5-0.5B. The pipeline adds 27,032 Arabic tokens via mean-subtoken initialization, continues pretraining on 504M Arabic tokens on 8xH100 with FSDP, FlashAttention varlen packing, and Liger fused kernels, then applies supervised fine-tuning on 129,116 Arabic instruction pairs with response-only loss masking, direct preference optimization on 6,750 Arabic preference pairs, and weight soup merging across three checkpoints. On three lm-evaluation-harness Arabic benchmarks (COPA-ar, Arabic HellaSwag, ArabicMMLU) the merged model reaches 35.9% mean accuracy, beats every same-class open model, ties Falcon-H1-1.5B on COPA-ar (58.4%) at one-third the size, and recovers 67% of SILMA-9B's mean at 1/18 the parameters. The edge build quantizes to 398 MB (q4_k_m) and delivers 635 tokens/s at batch size 1 on a single H100 via this http URL. All code (5,555 lines across 25 scripts), weights (bf16, int8, and four GGUF quantizations), and benchmark scripts are released at this https URL.
195. 【2605.28826】From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale
链接:https://arxiv.org/abs/2605.28826
作者:Rohan Mahapatra
类目:Computation and Language (cs.CL)
关键词:linguistic features function, features function, linguistic features, training alignment objectives, linguistic
备注: 26 pages, 13 tables, 2 figures. Planning to submit to NeurIPS 2026
点击查看摘要
Abstract:In modern LLMs, linguistic features function not as stylistic artifacts but as probes of probability mass, allocated under training alignment objectives. Language models trained with contemporary pipelines exhibit severe reshaping of linguistic features, leading to extreme language re-distribution. While previous stylometric analyses explored linguistic differences between AI-generated and human texts, we focus on the reshaping plaguing the LLM training pipeline itself. We analyze 17 models (410M-100B+ parameters) across 24 linguistically-motivated probes, documenting that instruction-tuned systems systematically collapse language entropy along discourse and structural dimensions (mean amplification: 1,949-16,853%, peaks: 5,181-209,675%), while selectively suppressing complex punctuation to 3.2-23.2% of baseline frequencies. These effects do not worsen under RLHF, as divergence patterns are statistically indistinguishable (p 0.25) across matched base and instruction-tuned model pairs. Weak intervention (lambda=1.0) exacerbates collapse by 240%, while strong control (lambda=5.0) achieves 40.5% improvement and outperforms frontier models by 96.7-98.2% despite 200-1000x scale disadvantage. Additionally, lambda=5.0 delivers 15% higher distinct-4, 27% higher vocabulary diversity, and 78% lower repetition than moderate regularization, establishing that alignment requires sufficient control strength, not merely distributional smoothing. Our findings underscore how modern LLMs reallocate stylistic probability mass, despite RLHF and scale. More broadly, our work reveals a structural limitation of current alignment pipelines: preference optimization reshapes language distributions invisible to standard quality metrics yet detectable through distributional probes, with implications for AI detection, training data contamination, and long-term linguistic evolution.
196. 【2605.28825】MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models
链接:https://arxiv.org/abs/2605.28825
作者:Ji-jun Park,Soo-joon Choi,Jiwon Jeong,Taeyang Yoon,Ju-Wan Lee
类目:Computation and Language (cs.CL)
关键词:Large language models, frequently encode factual, Large language, Contrastive Consistency Search, latent knowledge
备注:
点击查看摘要
Abstract:Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs -- a phenomenon known as \emph{latent knowledge}. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to \emph{understand} model behavior rather than to \emph{extract} hidden knowledge. We present \textbf{MechELK}, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) \textbf{Locate} -- using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) \textbf{Verify} -- employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) \textbf{Elicit} -- applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7\%, outperforming CCS by 6.2\% and direct linear probing by 9.1\%. Crucially, MechELK successfully identifies latent knowledge in 78.3\% of cases where the model's surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.
197. 【2605.28824】A Modular Architecture for Typologically Controlled Lexicon Generation
链接:https://arxiv.org/abs/2605.28824
作者:Sankalp Tattwadarshi Swain,Dhruv Kumar
类目:Computation and Language (cs.CL)
关键词:Constructing artificial lexicons, semantically structured remains, Constructing artificial, typologically plausible, computational linguistics
备注:
点击查看摘要
Abstract:Constructing artificial lexicons that are pronounceable, typologically plausible, and semantically structured remains an open challenge in computational linguistics. Existing conlang generators either lack formal phonotactic guarantees or delegate generation to opaque, non-reproducible LLM-based pipelines. We propose a modular framework that samples phoneme inventories from PHOIBLE, generates word forms under interchangeable phonological grammars (deterministic, OT, and MaxEnt), and assigns meanings via a Swadesh--Leipzig--Jakarta ontology with explicit form--meaning alignment. Evaluation on character $n$-gram perplexity, log-likelihood, and KL divergence against PHOIBLE across lexicon sizes of 100-5,000 forms shows that probabilistic grammars consistently outperform deterministic and random baselines on both phonotactic coherence and typological realism.
198. 【2605.28823】What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs
链接:https://arxiv.org/abs/2605.28823
作者:Mohamed Abdelwahab,Michelle Yu Collins,Sihan Chen,Yi Cheng Zhao,Zafarullah Mahmood,Jiading Zhu,Soliman Ali,Jonathan Rose
类目:Computation and Language (cs.CL)
关键词:imperative to gain, gain insight, LLMs expands, probes, LLM
备注:
点击查看摘要
Abstract:As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of concepts within the embeddings computed in an LLM - which is what we might say a model is "thinking" about. Such probes should be low-cost and easily applicable to any LLM, so that monitoring for many concepts is possible during normal operation. In this paper, we take the first steps towards developing the capability of creating many such probes by defining and executing examples of the key tasks needed: first, the careful delineation of a concept through the creation of a dataset with the concept both present and then absent. Then, the training and testing of a set of linear probes to detect the concept on any layer of an LLM, including an exploration of the complexity of the probe needed. Finally, we show that such probes can track concepts across larger contexts. This is done with four separate concepts and three different LLMs. When this process is scaled to many more concepts, it will create the ability to easily monitor new models.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.28823 [cs.CL]
(or
arXiv:2605.28823v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.28823
Focus to learn more
arXiv-issued DOI via DataCite</p>
199. 【2605.28822】Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment
链接:https://arxiv.org/abs/2605.28822
作者:Tao Wang,Lipeng Zhu,Jiayong Li,Feng Gao,Siwen Liang
类目:Computation and Language (cs.CL)
关键词:power transmission equipment, electric energy transmission, transmission equipment, power transmission, energy transmission
备注: 9pages, 6figures
点击查看摘要
Abstract:Defect grading of power transmission equipment (DGPTE) is crucial to the stability of electric energy transmission. Although existing machine learning methods exhibit strong capabilities in defect detection, they are plagued by difficulties in integrating expert experience and facing class imbalance in more refined defect grading field. To address this issue, this paper introduces a novel defect grading framework based on multimodal large language model (MLLM). Specifically, this approach maximizes the commercial MLLMs' potential of DGPTE through in-context learning and obtains the state-of-te-art (SOTA) model. By sending a secondary request to this model, a small number of chain of thought-based question-answer pairs (Q\As) are generated, which effectively reduces the cost of manual annotation. In this way, these high-quality interpretable Q\As are used to train Qwen3-VL-8B via Low-Rank Adaption-based supervised fine-tuning (SFT). Experimental results on three DGPTE tasks demonstrate that fine-tuning only the language model layer yields the SOTA performance. Furthermore, multi-task joint fine-tuning verifies the feasibility of handling multiple grading tasks within only a single lightweight MLLM.
200. 【2605.29859】MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables
链接:https://arxiv.org/abs/2605.29859
作者:Sung-Lin Yeh,Wei Zhou,Gil Keren,Duc Le,Zhong Meng,Hao Tang,Jay Mahadeokar,Ozlem Kalinli,Alexandre Mourachko
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:Recent speech language, Recent speech, language models rely, optimized separately, Recent
备注:
点击查看摘要
Abstract:Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.
信息检索
1. 【2605.30237】GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases
链接:https://arxiv.org/abs/2605.30237
作者:Yicheng Tao,Yiqun Wang,Xiangchen Song,Xin Luo,Kai Liu,Jie Liu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Semi-structured knowledge bases, academic paper search, embed textual documents, Semi-structured knowledge, product search
备注:
点击查看摘要
Abstract:Semi-structured knowledge bases (SKBs) embed textual documents in a typed graph of entities and relations, and underpin applications such as product search, academic paper search, and precision-medicine inquiries. Existing hybrid retrieval systems on SKBs either use the graph only for query expansion, mix textual and structural branches under a global weighting, or rely on fine-tuned graph-traversal generators. We present GRASP, a three-stage SKB retrieval framework unifying plan-based graph retrieval, plan-conditioned fusion with a dense retriever, and a fine-tuned reranker over the fused candidates. GRASP substantially advances the state of the art on every metric across the three STaRK benchmarks, lifting average Hit@1 from 62.0 to 73.9. Ablation and sensitivity studies further confirm the effectiveness and robustness of GRASP.
2. 【2605.30205】LexPath: A domain-oriented multi-path framework for legal article retrieval
链接:https://arxiv.org/abs/2605.30205
作者:Weixuan Liu,Qingfeng Zhuge,Xuyang Chen
类目:Information Retrieval (cs.IR)
关键词:critical for building, building traceable, traceable and reliable, grounded in specific, specific legal articles
备注:
点击查看摘要
Abstract:Legal article retrieval is critical for building traceable and reliable legal AI systems, where conclusions must be grounded in specific legal articles. However, existing open-domain retrieval methods rely heavily on surface-level lexical or semantic similarity, making it difficult for them to distinguish legally relevant articles from those that are textually similar but legally inapplicable or misaligned with the user's underlying intent. To bridge this gap, we propose \textsc{LexPath}, a domain-oriented multi-path framework comprising a multi-path retrieval module and an intent-aware reranking module. The retrieval module combines two complementary legal-specific paths to collect candidate articles: an IRAC-guided sparse path that expands queries with legally informative keywords, and a structure-guided dense path trained with hard negatives derived from legal hierarchy and citation relations. Then, the reranking module further refines the candidate ranking by incorporating the intent consistency score between queries and legal articles. We evaluate \textsc{LexPath} on two publicly available benchmarks focusing on general-public queries and a self-constructed benchmark targeting domain-professional scenarios. Experimental results demonstrate that \textsc{LexPath} consistently outperforms lexical, dense, hybrid, and adaptive retrieval-augmented generation (RAG) baselines. Ablation studies further verify the effectiveness of each component.
3. 【2605.30120】No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval
链接:https://arxiv.org/abs/2605.30120
作者:Lixuan Guo,Yifei Wang,Tiansheng Wen,Aosong Feng,Stefanie Jegelka,Chenyu You
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:fine-grained token-level interactions, preserving fine-grained token-level, exemplified by ColBERT, token-level interactions, Multi-vector retrieval
备注: Accepted by ICML2026
点击查看摘要
Abstract:Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grained token-level interactions. However, this granularity imposes prohibitive storage and retrieval efficiency bottlenecks: to manage the immense memory footprint and computational overhead of billion-scale token vectors, state-of-the-art systems are forced to rely on aggressive dimension reduction and complex clustering (e.g., K-means). This compromise introduces two critical limitations: excessive indexing latency of clustering large-scale corpora and semantic information loss inherent to compression. In this paper, we propose Single-stage Sparse Retrieval (SSR}, a paradigm shift that replaces expensive clustering with efficient sparse coding. Instead of compressing features into low-dimensional dense vectors, we utilize Sparse Autoencoder (SAE) to project token embeddings into a high-dimensional but highly sparse representation. This transformation enables us to bypass vector clustering entirely and leverage inverted indexing for precise, high-throughput retrieval. Extensive experiments on the BEIR benchmark demonstrate that SSR achieves a "trifecta" of improvements: it reduces indexing time by 15x compared to ColBERTv2, halves retrieval latency, and simultaneously improves retrieval performance over leading baselines.
4. 【2605.30027】DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
链接:https://arxiv.org/abs/2605.30027
作者:Ruofan Hu,Menghui Zhu,Jieming Zhu,Bo Chen,Shengyang Xu,Minjie Hong,Xiaoda Yang,Sashuai Zhou,Li Tang,Tao Jin,Zhou Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Multimodal documents, complicate retrieval tasks, retrieval tasks, Multimodal, complicate retrieval
备注: Accepted at KDD 2026 Research Track
点击查看摘要
Abstract:Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.
5. 【2605.29956】Uncertainty Quantification for Multimodal Retrieval Augmented Generation
链接:https://arxiv.org/abs/2605.29956
作者:Simon Binz,Heydar Soudani,Faegheh Hasibi
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Retrieval Augmented Generation, question answering capabilities, incorporating external knowledge
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) improves the question answering capabilities of Large Language Models (LLMs) by incorporating external knowledge and has recently been extended to multimodal settings through Vision-Language Models (VLMs) that integrate visual and textual information. Despite these advances, generated answers can still be incorrect or misleading. Uncertainty Quantification (UQ) methods aim to estimate the reliability of model outputs, but most existing approaches are designed for text-only models and perform poorly in multimodal RAG scenarios. A key challenge is capturing uncertainty arising from multiple stages of the pipeline, including retrieval, visual understanding, and generation. In this work, we show that modeling uncertainty using multimodal and retrieval-aware probability signals improves estimation in multimodal RAG systems. We introduce LeMUQ, a Learnable Multimodal UQ method that analyzes token probabilities under input modifications, such as removing modalities or retrieved context. By encoding these signals as probability tokens and processing them with a finetuned model, our approach captures interactions between modalities and retrieval. Experiments across datasets, retrievers, and VLMs show consistent improvements over baseline and finetuned UQ methods. Our proposed LeMUQ increases the AUROC metric by 3.8% on average. Additionally, our method shows strong generalization performance across different retrieval setups and datasets with mixed results when transferring across different VLMs. Our findings highlight the importance of modeling multimodal uncertainty and provide a step toward more reliable and safer multimodal RAG systems. Code is available on GitHub.
6. 【2605.29755】Rec-Distill: An Industrial Distillation Pipeline for Large-Scale Recommendation Models
链接:https://arxiv.org/abs/2605.29755
作者:Haoran Ding,Wenlin Zhao,Yuchen Jiang,Juren Li,Jie Zhu,Xinchun Li,Yishujie Zhao,Yi Zhang,Ao Qiao,Jianhui Dong,Cheng Chen,Ziyan Gong,Deping Xie,Peng Xu,Zikai Wang,Yuwei Wang,Huizhi Yang,Zhe Chen,Yuchao Zheng
类目:Information Retrieval (cs.IR)
关键词:Large recommendation models, strict serving efficiency, demonstrated substantial potential, Large recommendation, latency guarantees
备注:
点击查看摘要
Abstract:Large recommendation models have demonstrated substantial potential gains under scaling laws, yet these gains are difficult to realize in industrial recommendation systems because real-world deployment requires lightweight models with strict serving efficiency and latency guarantees. This creates a fundamental gap between offline model scaling and online deployment. In this work, we present Rec-Distill, an industrial distillation pipeline that transfers the performance gains of large-scale recommendation modeling to efficient serving models. Rec-Distill combines large-teacher scaling with student-side transfer optimization through decoupled training, black-box distillation, debiasing mechanism, and a hybrid batch-streaming pipeline for dynamic recommendation environments. Across multiple recommendation and advertising scenarios on real-world platforms, our framework scales teacher models up to 24B dense parameters and 20K behavior sequence length, while enabling lightweight students to recover a substantial portion of teacher gains, with distillation transferability exceeding 60% in the best setting. Extensive offline and online experiments further show that these transferred gains consistently translate into measurable business improvements under industrial constraints. These results demonstrate that Rec-Distill provides a practical framework for distilling large-scale recommendation models into deployable, cost-efficient serving systems, while also establishing a reliable path toward scaling recommendation models to even larger regimes in the future.
7. 【2605.29675】From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration
链接:https://arxiv.org/abs/2605.29675
作者:Ngoc Luyen Le,Marie-Hélène Abel,Bertrand Laforge
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:leaving implicit, shaped the process, collaboration, Generative, short prompt
备注:
点击查看摘要
Abstract:Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what task was being pursued, which resources were used, and which constraints should have shaped the process. This limited contextual explicitness hinders trust, traceability, and accountability, particularly when Generative AI is embedded in information-intensive workflows such as search, querying, and profile management. This paper introduces From Prompts to Context, an ontology-driven framework for representing Human-Generative AI collaboration. Its core component, the Contextual Collaboration AI Ontology (CCAI), models key elements of collaboration - including tasks, agent roles, resources, and constraints - as a shared machine-interpretable vocabulary. By combining populated CCAI instances with SPARQL-based context retrieval in operational workflows, the framework turns otherwise ephemeral prompt-response interactions into structured and queryable collaboration traces linking prompts, outputs, and their surrounding context. The approach is illustrated through a case study involving a software development team building a competency-based education feature for viewing and updating learner competency profiles. The case study shows how the framework can support the representation and documentation of collaboration episodes across requirements analysis, design, implementation, and testing. Within this setting, the results indicate that explicit collaboration modelling helps make task context more explicit, improves the traceability of AI-generated contributions, and supports more transparent and accountable Human-Generative AI practices. We conclude by outlining design principles for future Human-Generative AI systems that emphasise not only output quality, but also the explicit representation of the collaborative context in which outputs are produced.
8. 【2605.29630】Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory
链接:https://arxiv.org/abs/2605.29630
作者:Youwang Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:confounding lexical leakage, distractor entity overlap, agent-memory benchmarks report, uncontrolled query, single hit
备注: 48 pages with appendix; 6-page body, mandatory Limitations, References, and 7 appendices. Code, benchmarks, and 37 reproduce scripts: [this https URL](https://github.com/youwangd/engram) (see paper/REPRODUCIBILITY.md). Apache 2.0
点击查看摘要
Abstract:End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.
9. 【2605.29606】HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering
链接:https://arxiv.org/abs/2605.29606
作者:Joongmin Shin,Gyuho Shim,Jeongbae Park,Jaehyung Seo,Heuiseok Lim
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Open-domain Question Answering, document-based Open-domain Question, Question Answering, Open-domain Question, integrating scattered information
备注: Accepted to ACL2026 Main
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existing approaches relying on flat text chunks or page-level images inherently struggle to (i) precisely pinpoint the target document among thousands of candidates and (ii) organically connect multimodal evidence, such as tables and figures, within a limited token budget. To address these challenges, we propose HiKEY, a hierarchical tree-based multimodal retrieval framework that elevates document hierarchy to a first-class retrieval signal. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph via Document Hierarchical Parsing (DHP), explicitly encoding parent-child relationships. Adopting a hierarchical coarse-to-fine strategy, the framework (1) performs global routing to rapidly prune the search space using hierarchical indexing, and (2) conducts fine-grained retrieval to rank sections by employing a multimodal fusion strategy that captures the most discriminative evidence. Finally, HiKEY assembles a token-efficient evidence subgraph via a hybrid structural-semantic packing strategy. Experiments on ODQA benchmarks demonstrate that HiKEY significantly outperforms page- and chunk-based baselines, improving retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8%.
10. 【2605.29543】SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring
链接:https://arxiv.org/abs/2605.29543
作者:Qihan Deng,Minghua Zhang,Yang Yang,Zhenyu Gao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:Air Traffic Control, Traffic Control, Pilot readback, voice instructions, primary safeguard
备注:
点击查看摘要
Abstract:Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.
11. 【2605.29517】FLASH-MAXSIM: IO-Aware Fused Kernels for Late-Interaction Scoring
链接:https://arxiv.org/abs/2605.29517
作者:Roi Pony,Adi Raz Goldfarb,Idan Friedman,Daniel Ezer,Udi Barzelay
类目:Information Retrieval (cs.IR)
关键词:Toggle, document tokens, Toggle Hugging Face, Bibliographic Explorer Toggle, Toggle Bibliographic Explorer
备注:
点击查看摘要
Abstract:Late-interaction retrieval (ColBERT, ColPali) scores a query against a document with the MaxSim operator: for every query token, the maximum similarity over the document tokens, summed over query tokens. The standard implementation materializes the full query-token x document-token similarity tensor in GPU memory; for visual ColPali at 10K documents this tensor alone is 21 GB in FP16, created only to be reduced to one score per document and discarded. It exhausts a 40 GB GPU and bounds the achievable batch size in both inference and training. We present Flash-MaxSim, an IO-aware fused GPU kernel that computes exactly the same scores without ever materializing the tensor, by streaming query and document tiles through on-chip SRAM and folding the row-maximum reduction into the same pass. We extend the IO-aware principle through the training backward pass, an inverse-grid CSR construction that reuses the forward argmax for an atomic-free, destination-owned gradient reduction, and through INT8xINT8 quantization and variable-length (padding-free) scoring. Flash-MaxSim is up to 3.9x faster on an A100 (4.7x on an H100) than naive PyTorch at matched precision, uses up to 16x less inference memory and ~28x less training memory, unlocks corpus and batch sizes that exhaust PyTorch entirely, preserves the exact ranking (100% top-20 agreement with an FP32 reference)
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2605.29517 [cs.IR]
(or
arXiv:2605.29517v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2605.29517
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Roi Pony [view email] [v1]
Thu, 28 May 2026 07:38:27 UTC (354 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled FLASH-MAXSIM: IO-Aware Fused Kernels for Late-Interaction Scoring, by Roi Pony and 4 other authorsView PDFHTML (experimental)TeX Source
view license
Additional Features
Audio Summary
Current browse context:
cs.IR
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
12. 【2605.29507】Xetrieval: Mechanistically Explaining Dense Retrieval
链接:https://arxiv.org/abs/2605.29507
作者:Zhixin Cai,Jun Bai,Yang Liu,Jiaqi Li,Yichi Zhang,Taichuan Li,Zhuofan Chen,Zixia Jia,Zilong Zheng,Wenge Rong
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:assign high relevance, high relevance scores, relevance scores remains, scores remains challenging, Xetrieval
备注: Code: [this https URL](https://github.com/Hihiczx/Xetrieval) ; Project page: [this https URL](https://hihiczx.github.io/Xetrieval)
点击查看摘要
Abstract:Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at this https URL .
13. 【2605.29440】SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents
链接:https://arxiv.org/abs/2605.29440
作者:Wentao Hu,Zhendong Chu,Yiming Zhang,Junda Wu,Ming Jin,Xiangyu Zhao,Yilei Shao,Yanfeng Wang,Qingsong Wen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:reusable textual principles, guide decision making, Retrieval-augmented LLM agents, Retrieval-augmented LLM, agents increasingly rely
备注: 16 pages. Preprint. Under review
点击查看摘要
Abstract:Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.
14. 【2605.29384】Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies
链接:https://arxiv.org/abs/2605.29384
作者:Benjamin Clavié,Sean Lee,Aamir Shakir,Makoto P. Kato
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:retrieval-ready sparse features, propose Latent Terms, learn representations, Latent Terms, trivially be decomposed
备注:
点击查看摘要
Abstract:We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.
15. 【2605.29322】ACE: Anisotropy-Controllable Embedding for LLM-enhanced Sequential Recommendation
链接:https://arxiv.org/abs/2605.29322
作者:Dongcheol Lee,Hye-young Kim,Jongwuk Lee
类目:Information Retrieval (cs.IR)
关键词:paradigm leverage large, leverage large language, transfer semantically rich, Recent advances, large language models
备注: Accepted by SIGIR 2026. 5 pages
点击查看摘要
Abstract:Recent advances in the LLM-as-Extractor paradigm leverage large language models (LLMs) to transfer semantically rich item embeddings into sequential recommendation (SR) backbones. However, LLM-generated embeddings often suffer from strong anisotropy. Most vectors are concentrated in similar directions, resulting in a geometric imbalance that makes it difficult to adapt to collaborative signals during fine-tuning. To address this challenge, we propose Anisotropy-Controllable Embedding (ACE), which explicitly controls the anisotropy of LLM-generated embeddings. Specifically, ACE utilizes a linear autoencoder (LAE) to reshape the embedding distribution while preserving its semantic structure. In this process, the L2-regularization term mitigates the anisotropy by controlling the dispersion of embedding dimensions, while the reconstruction loss maintains semantic relationships among items. That is, ACE balances geometric uniformity and semantic embedding preservation for more stable learning. Extensive experiments demonstrate that ACE consistently outperforms existing LLM-enhanced SR models, yielding improvements of up to 12.4% and 11.8% in Recall@20 and NDCG@20, respectively.
16. 【2605.29307】GrepSeek: Training Search Agents for Direct Corpus Interaction
链接:https://arxiv.org/abs/2605.29307
作者:Alireza Salemi,Chang Zeng,Atharva Nijasure,Jui-Hui Chung,Razieh Rahimi,Fernando Diaz,Hamed Zamani
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Model, knowledge-intensive language tasks, shown strong promise, Language Model, knowledge-intensive language
备注:
点击查看摘要
Abstract:Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.
17. 【2605.29287】UniNote: A Unified Embedding Model for Multimodal Representation and Ranking
链接:https://arxiv.org/abs/2605.29287
作者:Jinghan Zhao,Wenwei Jin,Anqi Li,Jintao Tong,Luya Mo,Jiawei Li,Bin Li,Yao Hu
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:supporting critical industrial, critical industrial workflows, modern content platforms, supporting critical, fundamental part
备注: Accepted by KDD Ads Track 2026
点击查看摘要
Abstract:Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to the challenges of balancing global content representation with fine-grained local retrieval, the systemic inefficiency of decoupled embedding-and-ranking pipelines, and the inherent trade-offs between model precision and serving latency. To solve these issues, we propose \textbf{UniNote}, a unified embedding model designed for industrial I2I retrieval. Tailored retrieval strategies are introduced to support representation learning over complex, multimodal content at varying granularities. To operationalize these strategies, UniNote employs a two-stage training paradigm: the first stage leverages contrastive SFT to establish robust base embeddings, while the second stage refines ranking quality through a reinforcement learning (RL) process that aligns the model with content relevance. Our results show that UniNote achieves SOTA performance across diverse I2I tasks. Deployed at Xiaohongshu and integrated with Matryoshka Representation Learning (MRL), UniNote achieved significant improvements in retrieval quality and cost efficiency in large-scale applications.
18. 【2605.29286】CrossAlpha: An Annual-Report Benchmark for Cross-Market Factor Research
链接:https://arxiv.org/abs/2605.29286
作者:Qian Wang,Zhongyi Tong,Nuo Chen,Zhaomin Wu,Bingsheng He
类目:Information Retrieval (cs.IR)
关键词:factor research studies, Cross-market factor research, existing public benchmarks, studies whether firm-level, predict returns
备注:
点击查看摘要
Abstract:Cross-market factor research studies whether firm-level signals from one or more markets can predict returns in a target market, but existing public benchmarks do not support cross-market disclosure-to-return evaluation. Building such a benchmark is challenging because filings differ across languages and regulatory systems, disclosure-derived similarity can be biased by common reporting components, and cross-market signals must be evaluated under feasible trading-time alignment. We introduce \textbf{CrossAlpha}, a public annual-report benchmark for cross-market factor research. CrossAlpha addresses these challenges through three corresponding components: \emph{Disclosure Distillation}, which standardises heterogeneous filings into ten-category English business descriptions; \emph{Residual Schema Graph Construction}, which builds PCA-whitened cross-market firm-pair scores from schema-level disclosures; and \emph{Timing-Aligned Evaluation}, which pairs the graph with 11 years of daily OHLCV data to construct forward-return labels under feasible cross-market execution protocols. CrossAlpha covers about 3,600 firms and 10,700 firm-year reports from the United States, Japan, Taiwan, South Korea, and Hong Kong, and releases about 19M directed firm-pair scores. In experiments, disclosure-derived cross-market peers outperform domestic text, industry-code, and return-correlation peers in the US-to-Japan setting (ICIR 0.39 versus 0.07--0.18), and cross-market sources beat the domestic text baseline in most target markets. CrossAlpha offers an open-sourced, reusable, return-grounded benchmark for cross-market financial NLP.
19. 【2605.29280】LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation
链接:https://arxiv.org/abs/2605.29280
作者:Shali Jiang,Hua Zheng,Boyang Liu,Laming Chen,Kenny Lov,Chuanqi Xu,Lisang Ding,Qinghai Zhou,Can Cui,Xiaolong Liu,Xiaoyi Liu,Yasmine Badr,Xin Xu,Jiyan Yang,Ellie Dingqiao Wen,Gerard Jonathan Mugisha Akkerhuis,Chenxiao Guan,Rong Jin,Ruichao Qiu,Xian Chen,Shifu Xu,Zhehui Zhou,Ping Chen,Rui Yang,Haicheng Chen,Xiangge Meng,Song Zhou,Dharak Kharod,Shuyu Xu,Qiang Jin,Qiao Yang,Wankun Zhu,Qin Huang,Yuzhen Huang,Darren Liu,Parish Aggarwal,Hui Zhou,Erzhuo Wang,Shuo Chang,Xiaorui Gan,Wenlin Chen,Santanu Kolay,Huayu Li
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:single scalar prediction, large foundation model, compact vertical models, single scalar, larger FMs learn
备注: Shali Jiang, Hua Zheng, Boyang Liu contributed equally to this work
点击查看摘要
Abstract:Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffering from diminishing transfer ratio -- the fraction of FM improvement captured by the VM -- as a single scalar cannot convey the rich intermediate knowledge that larger FMs learn. To address this bottleneck, we propose LoopFM (Learning frOm HistOrical ReP*resentations of FM), a framework that opens a high-bandwidth transfer channel by structuring FM intermediate embeddings as input features (e.g., user history sequence) for downstream VMs, without requiring real-time FM inference at serving and architectural coupling between FM and VM. We provide a theoretical framework for LoopFM with a gain decomposition and transfer-ratio analysis. On three public benchmarks, LoopFM demonstrates strong AUC improvements (e.g., 6\%+ on TaobaoAd) and complementary knowledge transfer capability with KD. On industrial-scale systems (billions of examples, trillion-parameter FMs), LoopFM approximately doubles the knowledge transfer ratio on top of KD, delivering a +0.5\% conversion improvement in Y1H1, and a +1.03\% and +1.22\% conversion improvement from two individual launches respectively in Y1H2.
20. 【2605.29271】CoHyDE: Iterative Co-Training of LLM Rewriter Dense Encoder for Tool Retrieval
链接:https://arxiv.org/abs/2605.29271
作者:Vaishali Senthil,Ashutosh Hathidara,Sebastian Schreiber
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:technical API vocabulary, large API catalogs, user queries arrive, technical API, API vocabulary
备注:
点击查看摘要
Abstract:Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.
21. 【2605.29250】OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources
链接:https://arxiv.org/abs/2605.29250
作者:Jinheon Baek,Soyeong Jeong,Sangwoo Park,Woongyeong Yeo,Minki Kang,Patara Trirat,Heejun Lee,Sung Ju Hwang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Real-world information, property graphs, access to structurally, structurally diverse knowledge, Real-world
备注:
点击查看摘要
Abstract:Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.
22. 【2605.29240】Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI
链接:https://arxiv.org/abs/2605.29240
作者:Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:AI-augmented classrooms generate, classrooms generate rich, timely instructional decisions, generate rich teacher, AI-augmented classrooms
备注: Accepted to HAI-Agency Workshop on Orchestrating Human and AI Agency for Proactive and Reflective Learning
点击查看摘要
Abstract:AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $\rho=0.80$) and student-reported topic difficulty ($\rho=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.
23. 【2605.29234】Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth
链接:https://arxiv.org/abs/2605.29234
作者:Gaurav Sahu,Laurent Charlin,Christopher Pal
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:large-scale literature search, study large-scale literature, Deep Research pipeline, human reference list, improving the retrieval
备注:
点击查看摘要
Abstract:We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.
24. 【2605.29232】On the Practice of Scaling Search Conversion Rate Prediction
链接:https://arxiv.org/abs/2605.29232
作者:James Pak,Jyun-Yu Jiang,Fan Zhang,Sen Wang,Taekmin Kim,Henry Tsai,Vijay Rajaram,Juexin Lin,Mohitdeep Singh,Alessandro Magnani,Johnny Chen,Qian Zhao,Rao Fu,Zhirong Liang,Jordan Gilliland,Winter Jiao
类目:Information Retrieval (cs.IR)
关键词:search CVR prediction, Search Conversion Rate, search CVR models, CVR prediction models, search CVR
备注:
点击查看摘要
Abstract:Scaling a Search Conversion Rate (CVR) prediction model, especially in high-traffic environments, presents a challenge: superior model quality needs to be balanced with strict constraints on training cost and serving latency. This paper details an effective approach for scaling modern search CVR prediction models. We begin with an empirical study to understand the scaling performance of search CVR models, analyzing how quality improves as we scale three key factors of model backbone computation, the size of embedding parameters, and the volume of training data. We use a large-scale production dataset, comprising over a year of customer interaction logs from a high-traffic e-commerce platform, to evaluate the scalability of several state-of-the-art architectures and their ensembles. Our key findings are: (1) selecting the right backbone and scaling factors is crucial; (2) the impact of scaling backbone, embedding, and data is largely independent and additive, which has implications for more efficient scaling exploration; (3) a streamlined warmstart strategy can accelerate training iterations while simplifying new updates; (4) inference optimization strategies such as decoupled graph execution and dynamic batching can enable low-latency GPU serving even for high-capacity models. Compared to a baseline of a pre-scaling production model, we ultimately deployed a model trained on 2.5x larger training data with 8x more inference compute while having minimal latency impact. Online A/B tests also demonstrate that our launches achieved a combined +2.6% gain in a key metric of search conversion rate.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2605.29232 [cs.IR]
(or
arXiv:2605.29232v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2605.29232
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
25. 【2605.29158】PROTOCOL: Late Interaction Retrieval for Protein Homolog Search
链接:https://arxiv.org/abs/2605.29158
作者:Gabrielle Cohn,Rohan Gumaste,Minh Hoang,Vihan Lakshman
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Biomolecules (q-bio.BM)
关键词:underlies function annotation, global sequence similarity, classical alignment methods, alignment methods lose, methods lose sensitivity
备注:
点击查看摘要
Abstract:Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity. Protein language models provide context-aware representations that could improve alignment sensitivity in this regime. However, prior protein embedding-based retrieval pipelines often pool these representations into a single vector, potentially obscuring local motifs, domains, or conserved residues that reveal remote homology. We introduce ProtoCol, a model which represents proteins as sets of residue embeddings and uses ColBERT-style late interaction to test whether residue-level comparison improves homolog retrieval. ProtoCol encodes proteins independently, keeps candidate representations pre-computable, and scores candidates with MaxSim over residue embeddings. On SCOPe superfamily and Pfam clan benchmarks, ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines, supporting late interaction as an effective retrieval layer for remote homology search.
26. 【2605.29141】oward User Preference Alignment in LLM Recommendation via Explicit Context Feedback
链接:https://arxiv.org/abs/2605.29141
作者:Weizhi Zhang,Wooseong Yang,Yuxin Cui,Zhaohui Guo,Hins Hu,Liangwei Yang,Henry Peng Zou,Qifei Wang,Hanqing Zeng,Jiayi Liu,Yinglong Xia,Philip S. Yu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Traditional recommender systems, Traditional recommender, primarily infer user, rich explicit contextual, infer user preferences
备注: Published in CogMI 2025. [this https URL](https://ieeexplore.ieee.org/abstract/document/11417068)
点击查看摘要
Abstract:Traditional recommender systems (RecSys) primarily infer user preferences from implicit signals (such as clicks, watches, and purchases), often neglecting the rich explicit contextual feedback users provide through verbal text, like comments and reviews. This explicit context feedback captures the nuanced reasons behind user decisions regarding their preferences. In addition, it offers critical heterogeneous information for user preference alignment and more explainable recommendations. Overlooking such signals can lead to misaligned user preferences and further reinforce filter bubbles, as algorithms fail to understand the "semantic context" behind user choices. Recent advances in Large Language Models (LLMs) present new opportunities to harness user-generated content for more accurate and diverse recommendations, yet current LLM-based recommendations still focus on using item meta-data and underutilize this resource. In this paper, we advocate for prioritizing explicit context feedback in the next generation of LLM-based RecSys. We review the evolution of recommendation paradigms, highlight the value of context-rich feedback, call for new benchmarks and metrics, and introduce frameworks for integrating explicit user signals into scalable LLM-driven RecSys. Centering on user-preference modeling, we aim to foster more personalized, transparent, and explainable RecSys online platforms.
27. 【2605.29084】Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
链接:https://arxiv.org/abs/2605.29084
作者:Yubo Li,Rema Padman,Ramayya Krishnan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:multi-author institutional corpus, paradigm cannot diagnose, mode the dominant, corpus can give, failure mode
备注:
点击查看摘要
Abstract:A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.
28. 【2605.28918】When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
链接:https://arxiv.org/abs/2605.28918
作者:Youting Wang,Yuan Tang,Bowen Liu,Xuan Liu,Dingyan Shang
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:LLM-generated reward shaping, one-shot generation, LLM-generated reward, structured reinforcement-learning tasks, reward shaping
备注:
点击查看摘要
Abstract:For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.
29. 【2605.28888】Generative Spatiotemporal Intent Sequence Recommendation via Implicit Reasoning in Amap
链接:https://arxiv.org/abs/2605.28888
作者:Sicong Wang,Ruiting Dong,Yue Liu,Bowen Zheng,Jun Meng,Jie Li,Shuaijun Guo,Yu Gu,Fanyi Di,Xin Li
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:user behavior rarely, behavior rarely consists, forms intent flows, intent flows governed, Real-world user behavior
备注: 9 pages, 1 figure
点击查看摘要
Abstract:Real-world user behavior rarely consists of isolated actions; instead, it often forms intent flows governed by spatiotemporal dependencies. To provide integrated service recommendations, we focus on the task of Generative Spatiotemporal Intent Sequence Recommendation (GSISR), which aims to generate intent sequences that are logically coherent and physically executable within complex spatiotemporal contexts. While LLMs offer strong reasoning potential for GSISR, direct industrial deployment is limited by high inference latency and context-mismatched or physically infeasible plans. To address these challenges, we propose a generative framework, GPlan, that internalizes LLM reasoning into lightweight models through two components. First, to enable reasoning under strict latency constraints, we introduce Progressive Implicit CoT Distillation, which compresses explicit reasoning processes into reserved latent tokens, allowing small models to inherit complex planning logic without generating long reasoning text. Second, to address the disconnect between general knowledge and real-world constraints, we design Spatiotemporal Counterfactual DPO. By aligning the model with counterfactual context-plan pairs, we improve sensitivity to spatiotemporal context and reduce context-mismatched plans. Offline experiments and online A/B testing demonstrate that our approach improves sequence coherence and context responsiveness. Our implementation and the anonymized GSISR dataset are available at this https URL.
计算机视觉
1. 【2605.30352】GMOS: Grounding Moving Object Segmentation in 3D Space and Time
链接:https://arxiv.org/abs/2605.30352
作者:Junyu Xie,Tengda Han,Weidi Xie,Andrew Zisserman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Moving Object Segmentation, aims to discover, Video Object Segmentation, move independently, Object Segmentation
备注: Project Page: [this https URL](https://www.robots.ox.ac.uk/vgg/research/gmos/)
点击查看摘要
Abstract:Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.
2. 【2605.30351】VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
链接:https://arxiv.org/abs/2605.30351
作者:Hidir Yesiltepe,Jiazhen Hu,Tuna Han Salih Meral,Adil Kaan Akan,Kaan Oktay,Hoda Eldardiry,Pinar Yanardag
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Long-rollout causal video, recent progress innovating, Long-rollout causal, causal video diffusion, sliding-window KV cache
备注: Project Page: [this https URL](https://videomla.github.io/)
点击查看摘要
Abstract:Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
3. 【2605.30349】AdaState: Self-Evolving Anchors for Streaming Video Generation
链接:https://arxiv.org/abs/2605.30349
作者:Yusuf Dalva,Pinar Yanardag
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autoregressive video diffusion, Autoregressive video, video diffusion models, producing frames sequentially, diffusion models generate
备注: Project page: [this https URL](https://adastate.github.io/)
点击查看摘要
Abstract:Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.
4. 【2605.30347】NeuROK: Generative 4D Neural Object Kinematics
链接:https://arxiv.org/abs/2605.30347
作者:Chen Geng,Guangzhao He,Yue Gao,Yunzhi Zhang,Shangzhe Wu,Jiajun Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:approaches have revolutionized, enabling transformers, transformers to effectively, effectively reconstruct, reconstruct and generate
备注: CVPR 2026
点击查看摘要
Abstract:Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under various physical conditions -- remains challenging and often ad hoc, despite its importance in building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space representing all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this neural simulation framework across diverse dynamic object types, showing clear advantages over prior works. Project page: this https URL
5. 【2605.30346】YoCausal: How Far is Video Generation from World Model? A Causality Perspective
链接:https://arxiv.org/abs/2605.30346
作者:You-Zhe Xie,Yu-Hsuan Li,Jie-Ying Lee,Kaipeng Zhang,Yu-Lun Liu,Zhixiang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:key question arises, video diffusion models, statistical temporal patterns, diffusion models, world models
备注: Project page: [this https URL](https://www.youzhexie.me/papers/YoCausal/index.html)
点击查看摘要
Abstract:As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
6. 【2605.30342】Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
链接:https://arxiv.org/abs/2605.30342
作者:Shangjie Xue,Jesse Dill,Dhruv Ahuja,Frank Dellaert,Panagiotis Tsiotras,Danfei Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:
备注: Accepted to CVPR 2026. Project page [this https URL](https://gatech-rl2.github.io/GAVIS/)
点击查看摘要
None
7. 【2605.30341】GPIC: A Giant Permissive Image Corpus for Visual Generation
链接:https://arxiv.org/abs/2605.30341
作者:Keshigeyan Chandrasegaran,Kyle Sargent,Suchir Agarwal,Michael Jang,Michael Poli,Juan Carlos Niebles,Justin Johnson,Jiajun Wu,Li Fei-Fei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Studying scalable methods, Studying scalable, modeling requires large, Giant Permissive Image, Permissive Image Corpus
备注: 25 pages; Dataset: [this https URL](https://huggingface.co/datasets/stanford-vision-lab/giant-permissive-image-corpus;) Project website: [this https URL](https://gpic.stanford.edu)
点击查看摘要
Abstract:Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at this https URL. Evaluation toolkit and code are available at this https URL
8. 【2605.30339】Benchmarking Single-Factor Physical Video-to-Audio Generation
链接:https://arxiv.org/abs/2605.30339
作者:Tingle Li,Siddharth Gururani,Kevin J. Shih,Gantavya Bhatt,Sang-gil Lee,Zhifeng Kong,Arushi Goel,Gopala Anumanchipalli,Ming-Yu Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:highly plausible soundtracks, produce highly plausible, models produce highly, plausible soundtracks, produce highly
备注: CVPR 2026
点击查看摘要
Abstract:Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: this https URL
9. 【2605.30338】REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image
链接:https://arxiv.org/abs/2605.30338
作者:Xiaoxuan Ma,Jiashun Wang,Nicolas Ugrinovic,Yehonathan Litman,Kris Kitani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single RGB image, RGB image enables, simulation-ready digital assets, Reconstructing physically stable, single RGB
备注: Project page: [this https URL](https://shirleymaxx.github.io/REST3D/)
点击查看摘要
Abstract:Reconstructing physically stable 3D scenes from a single RGB image enables casual images to be converted into simulation-ready digital assets for applications such as immersive interaction and content creation. However, existing single-image reconstruction methods fall short in capturing the physical structure of a scene. As a result, they often produce geometrically plausible but physically inconsistent results, including object floating and penetration, which lead to unstable behavior in physics simulations. Image-conditioned scene generation methods improve physical plausibility but often rely on strong scene priors, yielding plausible yet inaccurate object arrangements that fail to match the input image. We propose REST3D, a single-image reconstruction framework that can reconstruct physically stable 3D scenes by integrating physical scene understanding with physics-constrained refinement. We first introduce an agentic physical scene understanding technique that constructs a scene-tree representation capturing object physical states and inter-object relationships from a gravity-support perspective, providing a structural prior for reconstruction. Leveraging this structure, we initialize the scene using image-to-3D models, followed by scene-tree-guided alignment and physics-constrained optimization to resolve physical violations while preserving visual consistency with the input image. Experiments show that our method significantly reduces physical errors and improves simulation stability on both synthetic and real-world datasets while maintaining strong reconstruction quality. We further demonstrate the reconstructed scenes in VR-based human-object interaction, showing their potential for immersive applications.
10. 【2605.30332】Colored Noise Diffusion Sampling
链接:https://arxiv.org/abs/2605.30332
作者:Hadar Davidson,Noam Issachar,Sagie Benaim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:resolving low-frequency global, generative trajectories fundamentally, trajectories fundamentally exhibiting, low-frequency global structures, global structures early
备注:
点击查看摘要
Abstract:Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers the generated distribution toward the true data manifold. Extensive experiments demonstrate that CNS significantly outperforms standard ODE and SDE baselines as a strictly plug-and-play, inference-time sampler substitution across diverse architectures (SiT, JiT, FLUX). Compared to standard sampling on ImageNet-256, CNS achieves substantial unguided FID reductions, improving from 8.26 to 6.27 on SiT-XL/2, 32.39 to 26.69 on JiT-B/16, and 11.88 to 8.31 on JiT-H/16, while yielding consistent relative FID improvements with Classifier-Free Guidance. Project page is available at this https URL.
11. 【2605.30328】Supercharging Thermal Gaussian Splatting with Depth Estimation
链接:https://arxiv.org/abs/2605.30328
作者:Manoj Biswanath,Chenxin Cai,Hannah Schieber,Daniel Roth,Benjamin Busam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Efficient and robust, scene representation, autonomous driving, representation is crucial, crucial in autonomous
备注: 8 pages, 4 figures. Accepted and will be published in ISPRS proceedings (ISPRS Congress 2026)
点击查看摘要
Abstract:Efficient and robust 3D scene representation is crucial in autonomous driving, robotics, and related fields. While RGB images provide valuable content for 3D reconstruction, other modalities like thermal or depth can enable additional information on the environment. Lately, novel view synthesis methods like 3D Gaussian Splatting have started using multiple modalities to further boost their performance. But fusing or combining multimodal data can make the process slower and can bring in additional challenges. Therefore, our project aims to use single modality based on thermal infrared domain, by removing the reliance on visible light as much as possible. This single modality can be expected to be faster as it does not rely on multimodal data. We propose a method, Thermal-to-Depth Gaussian Splatting (TDg), that uses only thermal images and depth estimation in its architecture to derive the radiance fields. Our TDg method outperforms the MSMG (Multiple Single-Modal Gaussians) baseline in most cases on our test datasets, RGBT-Scenes and ThermalMix. On average, the rendering quality metrics such as learned perceptual image patch similarity (LPIPS), structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) of TDg are 1.12%, 0.034%, and 0.01% better than the baseline MSMG values. It also reduces the training time significantly, by 12 mins 47 secs (55% improvement). Overall, our method is successful in deriving these thermal radiance fields, which can ultimately have several applications, such as identifying heat sources critical in surveillance, search or rescue operations, and industrial inspections where temperature is widely used to monitor machines.
12. 【2605.30325】Veda: Scalable Video Diffusion via Distilled Sparse Attention
链接:https://arxiv.org/abs/2605.30325
作者:Shihao Han,Hao Yang,Xinting Hu,Xiaofeng Mei,Yi Jiang,Xiaojuan Qi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scaling Diffusion Transformers, Scaling Diffusion, attention methods degrade, Diffusion Transformers, Transformers to generate
备注: Accepted to ICML 2026
点击查看摘要
Abstract:Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality is determined not by the sparsity ratio itself, but by how well the sparse mask aligns with the tile-wise geometry of full attention. Based on this insight, we propose Veda, a distilled sparse attention framework that formulates tile selection as an explicit reconstruction problem from full attention. Veda integrates statistics-aware tile scoring with head-aware tiling to reduce estimation error and structural mismatch, enabling aggressive sparsity. A hardware-efficient tile-skipping kernel converts theoretical sparsity into practical wall-clock speedups. Experiments on large video diffusion models, including Waver and Wan2.1, demonstrate substantial acceleration with no noticeable degradation in generation quality. To generate 720P 10-second videos on Waver-T2V-12B, Veda achieves a 5.1$\times$ end-to-end speedup and a 10.5$\times$ self-attention speedup, reducing attention overhead from 92% to 50%. Notably, the gains increase with sequence length, indicating that Veda scales favorably with spatiotemporal resolution across models.
13. 【2605.30320】MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular Videos
链接:https://arxiv.org/abs/2605.30320
作者:Daniel Rho,Jun Myeong Choi,Matthew Thornton,Biswadip Dey,Roni Sengupta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:methods recover physical, physics methods recover, recover physical parameters, inverse physics methods, views resolve scale
备注:
点击查看摘要
Abstract:Existing inverse physics methods recover physical parameters from multi-view videos, where geometric constraints across views resolve scale and 3D structure. In monocular settings, however, such constraints are absent, leading to severe scale ambiguity, inaccurate geometry, and weak coupling between appearance optimization and physical simulation. We propose MonoPhysics, a framework for monocular inverse physics estimation of deformable objects using differentiable MPM simulation and 3D Gaussian Splatting, which jointly optimizes geometry, appearance, and physical parameters from a single camera view. We address these challenges through three visual-physical bridges: global scale alignment, physics-aware geometry refinement, and a differentiable position map, which together enable accurate optimization from monocular observations alone. We evaluate on Vid2Sim and our new dataset of elastic and plastic objects, showing that MonoPhysics outperforms existing baselines in monocular settings and achieves performance comparable to multi-view baselines using only a single camera. Our project page is available at this https URL
14. 【2605.30318】Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes
链接:https://arxiv.org/abs/2605.30318
作者:Ruixiang Jiang,Chang Wen Chen
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:shutter opens, subject pose, largely decided, camera configuration, Photographic Scene Graph
备注:
点击查看摘要
Abstract:Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: this https URL
15. 【2605.30317】VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation
链接:https://arxiv.org/abs/2605.30317
作者:Xinyao Liao,Qiyuan He,Yicong Li,Jiayin Zhu,Xiaoye Qu,Wei Wei,Angela Yao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inference time, making them vulnerable, generated prefix, generators are trained, trained with teacher-forced
备注:
点击查看摘要
Abstract:Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.
16. 【2605.30311】Archon: A Unified Multimodal Model for Holistic Digital Human Generation
链接:https://arxiv.org/abs/2605.30311
作者:Chong Bao,Shichen Liu,Lijun Yu,David Futschik,Stylianos Moschoglou,Shefali Srivastava,Ziqian Bai,Feitong Tan,Guofeng Zhang,Zhaopeng Cui,Sean Fanello,Yinda Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:including text, immersive interaction, visual content, remains an open, unified multimodal model
备注: Accepted to CVPR 2026. Project Page: [this https URL](https://zju3dv.github.io/archon/)
点击查看摘要
Abstract:Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: this https URL.
17. 【2605.30310】City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images
链接:https://arxiv.org/abs/2605.30310
作者:Sayan Paul,Sourav Ghosh,Siddharth Katageri,Soumyadip Maity,Sanjana Sinha,Brojeshwar Bhowmick
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:poses highly challenging, challenging problems due, highly challenging problems, Gaussian Splatting, challenging problems
备注: Accepted to the USM3D Workshop Proceedings at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 as an Oral Presentation. Project page: [this https URL](https://citymesh3r.github.io/)
点击查看摘要
Abstract:City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.
18. 【2605.30307】Grounded 3D-Aware Spatial Vision-Language Modeling
链接:https://arxiv.org/abs/2605.30307
作者:An-Chieh Cheng,Yang Fu,Yatai Ji,Ligeng Zhu,Guanqi Zhan,Zhuoyang Zhang,Zhaojing Yang,Song Han,Yao Lu,Pavlo Molchanov,Vidya Nariyambut Murali,Jan Kautz,Xiaolong Wang,Hongxu Yin,Sifei Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision language model, language model equipped, spatial vision language, single framework, vision language
备注: CVPR 2026 [this https URL](https://www.anjiecheng.me/gr3d)
点击查看摘要
Abstract:We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.
19. 【2605.30269】Boosting Image Quality Assessment Performance: Unsupervised Score Fusion by Deep Maximum a Posteriori Estimation
链接:https://arxiv.org/abs/2605.30269
作者:Zhongling Wang,Raymond Zhou,Shahrukh Athar,Wenbo Yang,Zhou Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Image Quality Assessment, numerous Image Quality, Quality Assessment, Image Quality, perceptual quality
备注: 2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)
点击查看摘要
Abstract:Over the past decades, numerous Image Quality Assessment (IQA) models have emerged, aiming to predict the perceptual quality of images. However, individual models are often biased toward certain types of image content or distortions, depending on the design principle and process. An intuitive idea is to harness the strengths and mitigate the weaknesses of each IQA model, by fusing the scores of multiple models into a stronger one. Here we make one of the first attempts to seek an optimal solution for the idea and propose a general framework for unsupervised IQA score fusion using deep Maximum a Posteriori (MAP) estimation. The proposed model conducts fine-grained uncertainty estimation at the score level to increase the accuracy and reduce the uncertainty in fused predictions. Comprehensive experiments demonstrate the superiority of the proposed model over individual IQA models and other fusion methods. It also exhibits an interesting capability of rejecting ``bad" models in the fusion process.
20. 【2605.30268】PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions
链接:https://arxiv.org/abs/2605.30268
作者:Omer Benishu,Gal Fiebelman,Sagie Benaim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:generating physically accurate, visually faithful, address the task, task of generating, accurate and visually
备注:
点击查看摘要
Abstract:We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: this https URL
21. 【2605.30265】LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
链接:https://arxiv.org/abs/2605.30265
作者:Feng Han,Zhixiong Zhang,Zheming Liang,Yibin Wang,Jiaqi Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:achieved substantial progress, image-text training aimed, large-scale image-text training, driven by large-scale, achieved substantial
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.
22. 【2605.30263】minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
链接:https://arxiv.org/abs/2605.30263
作者:Min Zhao,Hongzhou Zhu,Bokai Yan,Zihan Zhou,Yimin Chen,Wenqiang Sun,Kaiwen Zheng,Guande He,Xiao Yang,Chongxuan Li,Fan Bao,Jun Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interactive video world, video world models, achieved remarkable progress, Recent video diffusion, real-time interactive video
备注:
点击查看摘要
Abstract:Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [this https URL](this https URL)
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.30263 [cs.CV]
(or
arXiv:2605.30263v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.30263
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2605.30260】How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
链接:https://arxiv.org/abs/2605.30260
作者:Ziwen Xu,Haiwen Hong,Linsong Yu,Benglei Cui,Longtao Huang,Hui Xue,Ningyu Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, dynamic real-world environments, real-world environments
备注: Ongoing work
点击查看摘要
Abstract:Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at this https URL.
24. 【2605.30257】Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning
链接:https://arxiv.org/abs/2605.30257
作者:Ciara Rowles,Reshinth Adithyan,Nikhil Pinnaparaju,Vikram Voleti,Mark Boss
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reinforcement learning framework, pretrained layer decomposition, reinforcement learning, learning framework, framework that eliminates
备注: 25 pages, 8 figures, 4 tables. Project page: [this https URL](https://stability-ai.github.io/stable-layers.github.io/)
点击查看摘要
Abstract:We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.
25. 【2605.30256】VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
链接:https://arxiv.org/abs/2605.30256
作者:Amrita Mazumdar,Seonwook Park,Rajarshi Roy,Nikhil Srihari,Shengze Wang,Yuhao Zhou,Julia Wang,Koki Nagano,Shalini De Mello
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:people simultaneously speak, Natural human conversation, people simultaneously, producing nonverbal cues, simultaneously speak
备注: Project page: [this https URL](https://research.nvidia.com/labs/amri/projects/video-fdb/)
点击查看摘要
Abstract:Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.
26. 【2605.30250】Ambient-robust Inverse Rendering using Active RGB-NIR Imaging
链接:https://arxiv.org/abs/2605.30250
作者:Hoon-Gyu Chung,Jinnyeong Kim,Hyunwoo Kang,Seung-Hwan Baek
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Inverse rendering, NIR, Inverse rendering aims, inverse rendering method, ambient illumination conditions
备注: 11 pages
点击查看摘要
Abstract:Inverse rendering aims to reconstruct geometry and reflectance of objects from images. Despite recent progress, existing methods often produces inaccurate reconstructions that are sensitive to ambient illumination conditions. Here we introduce an ambient-robust inverse rendering method enabled by active RGB-NIR imaging. Our key insight is to leverage near-infrared (NIR) flash illumination-imperceptible to human observers-to obtain stable point-light shading that is largely invariant to ambient illumination. By using multi-view RGB images illuminated by ambient light and NIR images acquired with active NIR flash illumination, we reconstruct accurate geometry and reflectance by exploiting the complementary benefits of RGB and NIR images via a three-stage inverse rendering method. To enable dense multi-view acquisition, we develop an active imaging system equipped with a RGB-NIR camera and a NIR flash mounted on a mobile base. Using this system, we collect the first multi-view RGB-NIR inverse rendering dataset captured under multiple ambient illumination conditions. Experiments demonstrate that our method outperforms prior approaches, achieving accurate geometry and reflectance estimation across multiple ambient lighting scenarios.
27. 【2605.30248】GenClaw: Code-Driven Agentic Image Generation
链接:https://arxiv.org/abs/2605.30248
作者:Junyan Ye,Jun He,Zilong Huang,Dongzhi Jiang,Xuan Yang,Rui Chen,Weijia Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:tool invocation capabilities, multimodal agents endowed, invocation capabilities, evolved from text-conditioned, comprehension and tool
备注: 21 pages, 7 figures
点击查看摘要
Abstract:Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, this http URL) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.
28. 【2605.30244】Reinforcement Learning with Robust Rubric Rewards
链接:https://arxiv.org/abs/2605.30244
作者:Ya-Qi Yu,Hao Wang,Fangyu Hong,Xiangyang Qu,Gaojie Wu,Qiaoyu Luo,Nuo Xu,Huixin Wang,Wuheng Xu,Yongxin Liao,Zihao Chen,Haonan Li,Ziming Li,Dezhi Peng,Minghui Liao,Jihao Wu,Haoyu Ren,Dandan Tu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:demanding multi-criteria supervision, deterministically checkable tasks, Reinforcement Learning, Verifiable Rewards, RLR
备注:
点击查看摘要
Abstract:While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.
29. 【2605.30239】SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World
链接:https://arxiv.org/abs/2605.30239
作者:Xin Dong,Weijian Deng,Lihan Zhang,Tianru Dai,Wenfeng Deng,Yansong Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:work addresses, addresses the problem, complete object geometry, scene, object geometry
备注: 23 pages, 11 figures
点击查看摘要
Abstract:This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene. Project page: this https URL
30. 【2605.30235】BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval
链接:https://arxiv.org/abs/2605.30235
作者:Marco Peer,Anna-Scius Bertrand,Patricia Scheurer,Andreas Fischer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Heinrich Bullinger, correspondence of Heinrich, large-scale benchmark dataset, document analysis based, High German
备注: Accepted for presentation at ICDAR2026. Dataset available via zenodo
点击查看摘要
Abstract:We present BullingerDB, a large-scale benchmark dataset for historical document analysis based on the correspondence of Heinrich Bullinger (1504-1575). The corpus comprises 20,898 pages and 499,222 text lines written by 796 writers over six decades, featuring stylistic variation, multilingual content (mostly Latin and Early New High German) as well as meta-information such as writer identity and time. We evaluate BullingerDB on text recognition and writer retrieval. TrOCR, the best performing model, achieves a CER of 9.1%. For writer retrieval, we introduce a temporal nDCG metric to assess time-aware retrieval. While temporally coherent retrieval is achievable, mAP (78.3%) scores indicate challenges due to long-term stylistic variation. With BullingerDB, we aim to establish a new benchmark for multilingual historical text recognition and temporally-aware writer analysis.
31. 【2605.30231】Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
链接:https://arxiv.org/abs/2605.30231
作者:Chun-Hsiao Yeh,Shengyi Qian,Manchen Wang,Yi Ma,Joseph Tighe,Fanyi Xiao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision-Language Models, struggle with robust, Models, fundamental geometric priors, spatial
备注: CVPR 2026. Project page: [this https URL](https://danielchyeh.github.io/GASP/)
点击查看摘要
Abstract:Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.
32. 【2605.30230】IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
链接:https://arxiv.org/abs/2605.30230
作者:Hao Wu,Xiangyang Luo,Hao Wang,Jiawei Zhang,Yi Zhang,Jinwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made remarkable progress, talking face generation, remarkable progress, talking face, face generation
备注:
点击查看摘要
Abstract:With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.
33. 【2605.30215】Déjà View: Looping Transformers for Multi-View 3D Reconstruction
链接:https://arxiv.org/abs/2605.30215
作者:Alessandro Burzio,Tobias Fischer,Sven Elflein,Qunjie Zhou,Riccardo de Lutio,Jiawei Ren,Jiahui Huang,Shengyu Huang,Marc Pollefeys,Laura Leal-Taixé,Zan Gojcic,Haithem Turki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, Recent feed-forward, broader trend, trend of increasing, Recent
备注:
点击查看摘要
Abstract:Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture. Our model, DéjàView, applies a single looped transformer block recurrently to per-view features for K refinement steps. Trained once, it exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.
34. 【2605.30211】Cycle Consistency in Video Object-Centric Learning
链接:https://arxiv.org/abs/2605.30211
作者:Rongzhen Zhao,Zhiyuan Li,Ruonan Wei,Juho Kannala,Joni Pajarinen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:self-supervised Multi-Object Tracking, video Object-Centric Learning, Self-supervised video Object-Centric, Object-Centric Learning, Multi-Object Tracking
备注: 14 pages
点击查看摘要
Abstract:Self-supervised video Object-Centric Learning (OCL) aims to discover distinct objects and associate them across time, whereas self-supervised Multi-Object Tracking (MOT) focuses on associating pre-defined object detections or segmentations. Although well-established in MOT, Cycle Consistency (CC) cannot naively or explicitly apply to the latent slot space of OCL. Unlike the deterministic and ideal object representations in MOT, OCL slots are inherently stochastic and ambiguous due to non-unique scene decompositions. Enforcing explicit cycle consistency (ECC) on slots imposes rigid mean seeking. This severely penalizes the model for exploring alternative but equally valid decompositions, thereby driving towards feature collapse. To resolve this dilemma, we propose \textit{Implicit Cycle Consistency (ICC)}, which shifts the cycle-consistency constraint from the restrictive slot space to the continuous reconstruction manifold, encouraging slots to reach a soft consensus on collectively interpreting the visual scene rather than forcing rigid point-to-point feature alignment. Extensive experiments on complex video OCL benchmarks demonstrate that ICC avoids feature collapse and outperforms ECC baselines. Our source code, model checkpoints and training logs are provided on this https URL.
35. 【2605.30174】LiveSVG: Zero-Shot SVG Animation via Video Generation
链接:https://arxiv.org/abs/2605.30174
作者:Matan Levy,Ran Margolin,Bar Cavia,Dvir Samuel,Yael Pritch,Shmuel Peleg,Alex Rav Acha,Ariel Shamir,Dani Lischinski
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scalable Vector Graphics, generating Scalable Vector, Score Distillation Sampling, generating Scalable, Vector Graphics
备注: Project Page: [this https URL](https://levymsn.github.io/LiveSVG)
点击查看摘要
Abstract:We introduce LiveSVG, a zero-shot approach for generating Scalable Vector Graphics (SVG) animations using video diffusion models. Current SVG animation methods struggle with complex motions: LLM-based code synthesis fails to express fine, non-rigid Bézier deformations, while Score Distillation Sampling (SDS) provides noisy gradients and often requires category-specific priors like skeletons. In contrast, LiveSVG fits vector geometry directly to an explicitly generated target video. Given an input SVG image and a motion prompt, we generate a previewable target video using a frozen image-to-video model, then fit the original SVG to this video via differentiable rendering. Our fitting stage is skeleton-free, utilizing a dual-level motion representation that combines per-group homographies for coarse articulation with per-path Bézier control-point offsets for local deformations. To resolve color-induced correspondence ambiguities during pixel-wise fitting, we introduce a novel sphere-packing recolorization strategy. We also present ChallengeSVG, a benchmark of complex, multi-object scenes that exposes the limitations of prior work. Evaluations demonstrate that LiveSVG significantly outperforms existing methods on both AniClipart and ChallengeSVG, establishing direct reference-video fitting as a practical, robust route to prompt-aligned and fully editable vector animation.
36. 【2605.30170】Unveiling the Visual Counting Bottleneck in Vision-Language Models
链接:https://arxiv.org/abs/2605.30170
作者:Xingzhou Pang,Yifan Hou,Junling Wang,Mrinmaya Sachan
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Vision-Language Models, Large Vision-Language, suffer catastrophic failures, excel at interpolation, systematic generalization
备注: ICML 2026
点击查看摘要
Abstract:While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.
37. 【2605.30168】OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics
链接:https://arxiv.org/abs/2605.30168
作者:Chenhao Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:traditional methods struggle, remote sensing, disaster assessment, diverse scenarios, vital for applications
备注:
点击查看摘要
Abstract:Change detection (CD) in remote sensing is vital for applications such as urban monitoring and disaster assessment, yet traditional methods struggle with generalization across diverse scenarios. We present OmniCD, a foundational framework that unifies and enhances remote sensing CD through multimodal semantic guidance. OmniCD incorporates image and text prompts -- such as textual descriptions, semantic maps, and geospatial metadata -- into a unified architecture, supporting tasks from binary CD to zero-shot semantic change understanding. The framework integrates a hierarchical scene retrieval module and a change detection module, reinforced by a style disentanglement mechanism for improved cross-domain robustness. We further introduce RSITCD, a large-scale multimodal dataset with 300K+ annotated image-text pairs. Extensive experiments show that OmniCD achieves state-of-the-art performance across benchmarks, demonstrating strong adaptability and setting a solid foundation for general-purpose CD systems in remote sensing.
38. 【2605.30161】Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
链接:https://arxiv.org/abs/2605.30161
作者:Cheolhong Min,Jaeyun Jung,Daeun Lee,Hyeonseong Jeon,Yu Su,Jonathan Tremblay,Chan Hee Song,Jaesik Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieve strong performance, Vision-language models, achieve strong, reflects structured, understanding or reliance
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: this https URL.
39. 【2605.30140】AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection
链接:https://arxiv.org/abs/2605.30140
作者:Yi Zhang,Jiawen Zhu,Lele Fu,Guansong Pang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Benefiting from generalizability, achieved impressive detection, approaches have achieved, anomaly detection, generalizability of vision-language
备注:
点击查看摘要
Abstract:Benefiting from generalizability of vision-language models (VLMs) such as CLIP, many zero-/few-shot anomaly detection (AD) approaches have achieved impressive detection performance across various datasets. Nevertheless, they require substantial training on large auxiliary datasets to adapt VLMs to anomaly detection, and their inference largely relies on visual-text embedding similarity-based anomaly scores, lacking reasoning abilities to detect complex anomalies that require in-depth contextual understanding. To address this limitation, we propose \textbf{AnomalyAgent}, a novel training-free, agentic framework that leverages the advanced reasoning and generalization capabilities of multimodal large language models (MLLMs) for anomaly detection. The key ingredients include \textbf{1)} a comprehensive anomaly-centric toolset that enables adaptive MLLM-driven, agentic anomaly reasoning in zero-shot settings, and \textbf{2)} a customized memory module that grounds anomaly reasoning with few-shot, in-context reference examples. We extend evaluation beyond the detection of simple anomalies (e.g., surface defects like cracks and dents and clear lesions) in widely used benchmarks to more diverse types of anomalies such as logical/contextual anomalies in logistics and manufacturing settings. Extensive experiment results demonstrate that our AnomalyAgent achieves substantially better performance compared to training-free VLM-based AD and generic agentic methods, highlighting its superior generalization capability in both zero-shot and few-shot anomaly detection settings. The code implementation can be find at this address.
40. 【2605.30131】CCS: Clinical Consensus Selection for Radiology Report Generation
链接:https://arxiv.org/abs/2605.30131
作者:Xi Zhang,Yingshu Li,Zaiqiao Meng,Jake Lever,Edmond S. L. Ho
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:large language model, single-path generation task, multimodal large language, generation task, produces one decoded
备注: 17 pages, 6 figures
点击查看摘要
Abstract:Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.
41. 【2605.30126】PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
链接:https://arxiv.org/abs/2605.30126
作者:Selim Kuzucu,Alessio Tonioni,Vasile Lup,Bernt Schiele,Federico Tombari,Muhammad Ferjad Naeem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Vision-Language Models, quadratic computational bottleneck, map visual inputs, dense token sequences, Large Vision-Language
备注: 33 pages, 4 figures
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.
42. 【2605.30116】SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation
链接:https://arxiv.org/abs/2605.30116
作者:Zhuguanyu Wu,Ruihao Gong,Yang Yong,Yushi Huang,Xiangyu Fan,Lei Yang,Dahua Lin,Xianglong Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Distribution Matching Distillation, few-step video diffusion, Distribution Matching, Gradient Matching Distillation, Matching Distillation
备注: ICML 2026
点击查看摘要
Abstract:Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at this https URL.
43. 【2605.30115】Large Depth Completion Model from Sparse Observations
链接:https://arxiv.org/abs/2605.30115
作者:Zhu Yu,Zhengyi Zhao,Runmin Zhang,Lingteng Qiu,Kejie Qiu,Yisheng He,Siyu Zhu,Zilong Dong,Si-Yuan Cao,Hui-Liang Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Depth Completion, presents the Large, Large Depth, sparse observations, work presents
备注: ICLR 2026. Project webpage: [this https URL](https://pkqbajng.github.io/ldcm/)
点击查看摘要
Abstract:This work presents the Large Depth Completion Model (LDCM), a simple, effective, and robust framework for single-view metric depth estimation with sparse observations. Without relying on complex architectural designs, LDCM generates metric-accurate dense depth maps using a transformer. It outperforms existing approaches across diverse datasets and sparse observations. We achieve this from two key perspectives: (1) leveraging existing monocular foundation models to improve the quality of sparse depth inputs, and (2) reformulating training objectives to better capture geometric structure and metric consistency. Specifically, a Poisson-based depth initialization strategy is first introduced to generate a uniform coarse dense depth map from diverse sparse observations, providing a strong structural prior for the network. Regarding the training objective, we replace the conventional depth head with a point map head that regresses per-pixel 3D coordinates in camera space, enabling the model to directly learn the underlying 3D scene structure instead of performing pixel-wise depth map restoration. Moreover, this design eliminates the need for camera intrinsic parameters, allowing LDCM to naturally produce metric-scaled 3D point maps. Extensive experiments demonstrate that LDCM consistently outperforms state-of-the-art methods across multiple benchmarks and varying sparsity levels in both depth completion and point map estimation, showcasing its effectiveness and strong generalization to unseen data distributions.
44. 【2605.30111】xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR
链接:https://arxiv.org/abs/2605.30111
作者:Thenukan Pathmanathan,Kanchan Keisham,Thangarajah Akilan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Point cloud segmentation, Point, Point cloud, point clouds, fundamental task
备注: 3 figures, and 5 tables
点击查看摘要
Abstract:Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.
45. 【2605.30099】Evaluation of Conversational Agents: Understanding Culture, Context and Environment in Emotion Detection
链接:https://arxiv.org/abs/2605.30099
作者:Martha Teiko Teye,Yaw Marfo Missah,Emmanuel Ahene,Twum Frimpong,Auxane Boch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:social media photo, media photo tagging, highly prioritized analysis, human robots interactions, Valuable decisions
备注: IEEE paper on arxiv
点击查看摘要
Abstract:Valuable decisions and highly prioritized analysis now depend on applications such as facial biometrics, social media photo tagging, and human robots interactions. However, the ability to successfully deploy such applications is based on their efficiencies on tested use cases taking into consideration possible edge cases. Over the years, lots of generalized solutions have been implemented to mimic human emotions including sarcasm. However, factors such as geographical location or cultural difference have not been explored fully amidst its relevance in resolving ethical issues and improving conversational AI (Artificial Intelligence). In this paper, we seek to address the potential challenges in the usage of conversational AI within Black African society. We develop an emotion prediction model with accuracies ranging between 85% and 96%. Our model combines both speech and image data to detect the seven basic emotions with a focus on also identifying sarcasm. It uses 3-layers of the Convolutional Neural Network in addition to a new Audio-Frame Mean Expression (AFME) algorithm and focuses on model pre-processing and post-processing stages. In the end, our proposed solution contributes to maintaining the credibility of an emotion recognition system in conversational AIs.
46. 【2605.30093】Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence
链接:https://arxiv.org/abs/2605.30093
作者:Artur Jesslen,Olaf Dünkel,Adam Kortylewski
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:self-supervised vision models, Foundation features, self-supervised vision, proven effective, Stable Diffusion features
备注: 9 pages (main paper), 21 pages (total), 4 figures
点击查看摘要
Abstract:Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.
47. 【2605.30090】DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation
链接:https://arxiv.org/abs/2605.30090
作者:Jiamin Chen,Qianben Chen,Jiawen Zhang,Yidi Wu,Yuchen Li,Xiaokun Zhang,Wangchunshu Zhou,Chen Ma
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Long-form video generation, Long-form video, video generation, cinematic control, moving from short
备注:
点击查看摘要
Abstract:Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.
48. 【2605.30083】Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation
链接:https://arxiv.org/abs/2605.30083
作者:Jiayi Luo,Qiyan Liu,Tengyang Wang,JunHao Liu,Jiayu Chen,Cong Wang,Hanxin Zhu,Chen Gao,Xiaobin Hu,Qingyun Sun,Zhibo Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:previously generated tokens, generated conditioned, previously generated, future, Future Forcing
备注:
点击查看摘要
Abstract:Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation steps. Nevertheless, its growth with generation length introduces increasing memory and error accumulation, limiting the scalability of AR models to even longer sequences. Existing KV cache compression methods mitigate this issue by selectively retaining only video tokens deemed important. However, most existing methods assess token importance using short-horizon signals derived from the current or historical generation context, making these methods prone to overlooking tokens that appear unimportant at early steps but later become critical for future frames. In this work, we identify an important property of trained AR video models: although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Building on this insight, we propose Future Forcing, a training-free future-aware KV cache policy for AR video generation. Specifically, Future Forcing first constructs a future query proxy from historical statistics, then scores KV cache tokens by their importance under this proxy, and finally merges redundant token pairs within the affine subspace induced by the future query. Extensive experiments show that Future Forcing improves long-horizon consistency under limited KV caches, achieving up to 1.49 improvement in subject consistency on VBench-Long for 60s generation over existing AR video KV cache policies.
49. 【2605.30073】Native Audio-Visual Alignment for Generation
链接:https://arxiv.org/abs/2605.30073
作者:Longbin Ji,Guan Wang,Xuan Wei,Chenye Yang,Xiangrui Liu,Zhenyu Zhang,Shuohuan Wang,Yu Sun,Jingzhou He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:coherent visual-acoustic content, synthesize temporally synchronized, semantically coherent visual-acoustic, visual-acoustic content, aims to synthesize
备注: Project page: [this https URL](https://ernie-research.github.io/NAVA/)
点击查看摘要
Abstract:Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.
50. 【2605.30065】Boosting Zero-Shot 3D Style Transfer with 2D Pre-trained Priors
链接:https://arxiv.org/abs/2605.30065
作者:Xin Dong,Yunzhi Teng,Wenfeng Deng,Yansong Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate multi-view consistent, multi-view consistent stylized, consistent stylized views, arbitrary style image, style transfer
备注: Accepted by IEEE IVMSP2026
点击查看摘要
Abstract:In this work, we focus on zero-shot 3D style transfer that can generate multi-view consistent stylized views of the 3D scene given an arbitrary style image. We primarily tackle the issue of data scarcity in 3D style transfer, which arises when each model is trained on only a single scene, thereby limiting the number of available content images. This scarcity significantly hampers stylization performance, as model optimization relies on a sufficient number of content-style image pairs to provide supervisory signals. Our core idea is to integrate a decoder pre-trained on large-scale 2D image datasets into the 3D style transfer pipeline, thereby leveraging the prior knowledge encoded in the decoder from learning over numerous content-style image pairs. Our method combines feature Gaussian splatting and deferred stylization, enabling high-quality stylization with the data-sufficient decoder network while ensuring view consistency by unifying view-dependent operations into a view-invariant process. Experiments demonstrate that our Data-Sufficient StyleGaussian (DS-StyleGaussian) model outperforms existing zero-shot 3D style transfer methods in terms of visual quality across various datasets. This work also suggests that 2D pre-training can serve as a strong enhancement for 3D tasks, bridging the data gap between 2D and 3D.
51. 【2605.30062】FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection
链接:https://arxiv.org/abs/2605.30062
作者:Leqi Zhu,Junyan Ye,Kaiqing Lin,Zhiyuan Yan,Conghui He,Weijia Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generative artificial intelligence, artificial intelligence technologies, Large Multimodal Models, unprecedented level, development of generative
备注:
点击查看摘要
Abstract:The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.
52. 【2605.30060】owards Consistent Video Geometry Estimation
链接:https://arxiv.org/abs/2605.30060
作者:Zhu Yu,Jingnan Gao,Runmin Zhang,Lingteng Qiu,Zhengyi Zhao,Rui Peng,Yichao Yan,Kejie Qiu,Siyu Zhu,Si-Yuan Cao,Hui-Liang Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:feed-forward foundation model, recovering spatially dense, temporally consistent geometry, work presents ViGeo, work presents
备注: Project webpage: [this https URL](https://pkqbajng.github.io/ViGeo/)
点击查看摘要
Abstract:This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.
53. 【2605.30045】GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver
链接:https://arxiv.org/abs/2605.30045
作者:Yuqing Chen,Lin Liu,Haisu Wu,Xiaopeng Zhang,Yaowei Wang,Yujiu Yang,Qi Tian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:simultaneously eliminate target, eliminate target objects, complex spatiotemporal ambiguities, removal frequently struggles, object removal frequently
备注:
点击查看摘要
Abstract:Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. this https URL
54. 【2605.30038】Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models
链接:https://arxiv.org/abs/2605.30038
作者:Jaa-Yeon Lee,Yeobin Hong,Taesung Kwon,Jong Chul Ye
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:generate highly realistic, highly realistic images, models generate highly, generate highly, highly realistic
备注: ICML 2026, Project page: [this https URL](https://jaayeon.github.io/AGSM)
点击查看摘要
Abstract:Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: this https URL
55. 【2605.30027】DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
链接:https://arxiv.org/abs/2605.30027
作者:Ruofan Hu,Menghui Zhu,Jieming Zhu,Bo Chen,Shengyang Xu,Minjie Hong,Xiaoda Yang,Sashuai Zhou,Li Tang,Tao Jin,Zhou Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Multimodal documents, complicate retrieval tasks, retrieval tasks, Multimodal, complicate retrieval
备注: Accepted at KDD 2026 Research Track
点击查看摘要
Abstract:Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.
56. 【2605.30011】VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
链接:https://arxiv.org/abs/2605.30011
作者:Mingjian Gao,Wenqiao Zhang,Yuqian Yuan,Yang Dai,Binhe Yu,Zheqi Lv,Haoyu Zheng,Jiaqi Zhu,Zhiqi Ge,Zixuan Wan,Siliang Tang,Yueting Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:explicit intermediate reasoning, Recent work, begun to equip, intermediate reasoning, work has begun
备注:
点击查看摘要
Abstract:Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.
57. 【2605.30010】EarlyTom: Early Token Compression Completes Fast Video Understanding
链接:https://arxiv.org/abs/2605.30010
作者:Hesong Wang,Xin Jin,Lu Lu,Chenhaowen Li,Jian Chen,Qiang Liu,Huan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video understanding tasks, Video large language, demonstrated strong capabilities, video understanding, understanding tasks
备注: Accepted by CVPR 2026. 16 pages, 8 figures, 8 tables. Project page: [this https URL](https://viridisgreen.github.io/EarlyTom)
点击查看摘要
Abstract:Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.
58. 【2605.29997】FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views
链接:https://arxiv.org/abs/2605.29997
作者:Yihang Tao,Yu Guo,Zhengru Fang,Haonan An,Yuguang Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian splatting framework, Gaussian splatting, present FRUC, splatting framework, Gaussian
备注:
点击查看摘要
Abstract:We present FRUC, a feed-forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views. Existing multi-agent reconstruction frameworks are often hindered by rigid prerequisites, demanding precise spatial calibration and slow per-scene optimization. In this paper, we rethink this task by conceptualizing a distributed multi-vehicle network as a spatio-temporally unstructured ego-centric multi-camera system, where the core challenge lies in enhancing ego-centric occluded geometry through collaboration without degrading the ego's accurately observed visible geometry, while preserving reconstruction efficiency. For efficient reconstruction, FRUC is built upon a visual grounded geometric Transformer backbone to enable one-shot, calibration-free inference from a flexible number of multi-vehicle views. To achieve non-destructive geometric supplementation under uncalibrated cross-agent misalignment, FRUC first introduces an ego-centric causal occlusion field that explicitly derives occlusion evolution as latent priors by modeling agent-wise spatio-temporal correlations. Guided by these occlusion priors, it further formulates cross-agent integration as a deterministic residual denoising process via zero-initialized injection, turning challenging cross-agent fusion into bounded residual learning for robust collaborative blind-spot completion. Through extensive evaluations on the real-world V2XReal and UrbanIng-V2X datasets, FRUC is shown to be a new state-of-the-art for the scene reconstruction of dynamic collaborative driving environments, significantly outperforming existing methods in both rendering quality and efficiency.
59. 【2605.29983】Improving Adversarial Robustness of Attribution via Implicit Regularization
链接:https://arxiv.org/abs/2605.29983
作者:Amir Mehrpanah,Matteo Gamba,Hossein Azizpour
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:expensive explicit regularization, existing approaches typically, approaches typically rely, computationally expensive explicit, explicit regularization
备注: 39 pages, 22 figures, to be published in International Conference on Machine Learning 2026
点击查看摘要
Abstract:The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.
60. 【2605.29980】Genetically Aligned Patient Representations Improve Hematological Diagnosis
链接:https://arxiv.org/abs/2605.29980
作者:Muhammed Furkan Dasdelen,Fatih Ozlugedik,Ilaria Looser,Rao Muhammad Umer,Christian Pohlkamp,Carsten Marr
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:significantly improve performance, transcriptomic and genomic, shown to significantly, performance in downstream, downstream diagnostic tasks
备注: Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026
点击查看摘要
Abstract:Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at this https URL.
61. 【2605.29977】EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation
链接:https://arxiv.org/abs/2605.29977
作者:Dang Hong Nguyen,Nhi Ngoc-Yen Nguyen,Huy-Hieu Pham
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:High-fidelity ECG interpretation, extreme computational demands, edge-care remains hindered, High-fidelity ECG, clinical edge-care remains
备注: Accepted at the SD4H Workshop at ICML 2026. 11 pages, 3 figures
点击查看摘要
Abstract:High-fidelity ECG interpretation is increasingly reliant on massive foundation models, yet their deployment in clinical edge-care remains hindered by extreme computational demands. While knowledge distillation (KD) is a promising solution, traditional methods fail to capture the complex spatio-temporal dependencies of ECG signals when transferring knowledge across heterogeneous architectures. In this paper, we propose EVL-ECG, a framework specifically designed for cross-architecture distillation of cardiac diagnostic logic. EVL-ECG introduces three ECG-aware innovations: (1) Multi-Head Cross-Attention Alignment, which harmonizes architectural discrepancies to preserve fine-grained morphological features; (2) Optimal Transport-based Visual Feature Matching, utilizing optimal transport to maintain global structural relationships across ECG leads despite mismatched token representations; and (3) Geometric Intra-Architecture Relation Matching, which distills the latent diagnostic reasoning of the teacher model. Evaluations across ECG benchmarks demonstrate that EVL-ECG yields improvements of up to 2.4% AUC and 1.1% clinical accuracy over existing baselines. Notably, EVL-ECG establishes an efficient 2B-parameter ECG foundation model, suitable for resource-constrained clinical environments.
62. 【2605.29954】SwInception -- Local Attention Meets Convolutions
链接:https://arxiv.org/abs/2605.29954
作者:David Hagerman,Roman Naeem,Jakob Lindqvist,Carl Lindström,Fredrik Kahl,Lennart Svensson
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prominent choice, gained popularity, popularity as efficient, efficient encoders, Swin emerging
备注: International Conference on Pattern Recognition and Artificial Intelligence, 2024
点击查看摘要
Abstract:Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on small datasets. To mitigate this weakness, we propose a novel architecture that further enhances Swin's inductive bias by introducing Inception blocks in the feed-forward layers. The introduction of these multi-branch convolutions enables more direct reasoning over local, multi-scale features within the transformer block. We have also modified the decoder layers in order to capture finer details using fewer parameters. We demonstrate a performance improvement on eleven different medical datasets through extensive experimentation. We specifically showcase advancements over the previous state-of-the-art backbones on benchmark challenges like the Medical Segmentation Decathlon and Beyond the Cranial Vault. By showing that the existing inductive bias in Swin can be further improved, our work presents a promising avenue for enhancing the capabilities of sparse vision transformers for both medical and natural image segmentation tasks. Code and pre-trained weights can be accessed at this https URL.
63. 【2605.29953】Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in Basketball
链接:https://arxiv.org/abs/2605.29953
作者:Li Yin,Qin Haobin,Tomohiro Suzuki,Calvin Yeung,Mariko Isogawa,Keisuke Fujii
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:appearance similarity caused, remains challenging due, annotated multi-view data, team sports scenarios, sports scenarios remains
备注:
点击查看摘要
Abstract:Multi-view multi-person 3D pose estimation in team sports scenarios remains challenging due to player occlusions, appearance similarity caused by team uniforms, and the scarcity of annotated multi-view data, all of which limit the effectiveness and generalization capability of learning-based methods. In contrast, the performance of training-free approaches is inherently constrained by the accuracy of 2D keypoint detection and the robustness of cross-view association. To address these challenges, we propose Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation. Our method employs a monocular 3D human mesh recovery model as the frontend and introduces a two-stage epipolar matching strategy based on the recovered mesh outputs. Specifically, the proposed framework combines disjoint-set-union-based clustering with per-joint triangulation to achieve robust cross-view association and accurate 3D pose reconstruction. Experiments on two public multi-view basketball datasets demonstrate that MAEM consistently outperforms existing training-free association baselines while achieving competitive RGB-only performance in both indoor and outdoor basketball scenarios. MAEM achieves MPJPE/PA-MPJPE scores of 59.8/40.7 mm on SportCenter EPFL and 74.0/51.8 mm on Human-M3 Basketball, highlighting the effectiveness of dense mesh geometry for cross-view association without requiring target-domain training or fine-tuning.
64. 【2605.29935】CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving
链接:https://arxiv.org/abs/2605.29935
作者:Zezhong Qian,Zhao Yang,Lu Tan,Zhihao Yan,Weiyi Hong,Haizhuang Liu,Yawei Jueluo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:limited geographic regions, geographic regions, Autonomous driving systems, systems are commonly, commonly trained
备注:
点击查看摘要
Abstract:Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.
65. 【2605.29932】reatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression
链接:https://arxiv.org/abs/2605.29932
作者:Danylo Boiko,Viktoriia Mishkurova
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:personalized therapeutic intervention, effective long-term planning, Parkinson disease, neurodegenerative diseases, therapeutic intervention
备注: 9 pages, 5 figures, 1 table
点击查看摘要
Abstract:Forecasting the progression of neurodegenerative diseases, such as Parkinson's disease, is essential for effective long-term planning and personalized therapeutic intervention. Existing systems typically produce scalar clinical scores that ignore the rich structure of longitudinal neuroimaging, while traditional generative approaches suffer from a loss of anatomical details and blurring subtle progression patterns. To address this, we introduce a novel treatment-conditioned diffusion framework that predicts high-fidelity future brain states by conditioning the generative process on patients' screening DaTscan images and levodopa equivalent daily dose over one year. The pipeline uses a Transformer-based encoder to represent non-linear, time-dependent pharmacological dynamics and optimizes generation through a multi-weight region-of-interest mask that focuses on biologically critical areas. Experimental evaluation shows that our framework maintains sharp anatomical boundaries and significantly improves clinical fidelity relative to the baseline, achieving 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM.
66. 【2605.29911】Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation
链接:https://arxiv.org/abs/2605.29911
作者:Adam T. Müller,Philipp J. Teuffel,Konstantin Manassis,Nicolaj C. Stache
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:sparse experimental measurements, machine learning approach, machine learning, regression from sparse, sparse experimental
备注: Presented at the 11th European Conference for Aeronautics and Aerospace Sciences (EUCASS), 2025, DOI: [https://doi.org/10.13009/EUCASS2025-285](https://doi.org/10.13009/EUCASS2025-285)
点击查看摘要
Abstract:We propose a machine learning approach for image regression from sparse experimental measurements. We show the application of the proposed method on film cooling studies in propulsion system development, aiming to reduce the need for extensive physical testing. Our method employs a lightweight feed-forward neural network with positional encoding to generate images conditioned by input parameters. Validated on real and synthetic data, it achieves high image similarity (RMSE 8 %, SSIM 93 %) while maintaining accuracy with a 30 \% reduction of measurements. We further propose a knowledge-informed extension for local adaptability of the generated images. This approach significantly reduces required tests while preserving high-quality data, enabling efficient optimization of coolant injector configurations with applications beyond aerospace.
67. 【2605.29894】rain the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning
链接:https://arxiv.org/abs/2605.29894
作者:Yaowu Fan,Tao Han,Dazhao Du,Andy J. Ma,Jia Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, visual, progress in computer, produced a wide, wide range
备注:
点击查看摘要
Abstract:Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.
68. 【2605.29891】DVSM: Decoder-only View Synthesis Model Done Right
链接:https://arxiv.org/abs/2605.29891
作者:Cheng Sun,Jaesung Choe,Min-Hung Chen,Ryo Hachiuma,Yu-Chiang Frank Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent Large View, Recent Large, Large View Synthesis, Large View, Recent
备注: Code at [this https URL](https://github.com/NVLabs/dvsm)
点击查看摘要
Abstract:Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.
69. 【2605.29881】Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering
链接:https://arxiv.org/abs/2605.29881
作者:Soumyadeep Jana,Pulkit Mittal,Sanasam Ranbir Singh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large vision-language models, Large vision-language, Adaptive Closed-form Steering, Barrier-Regulated Adaptive Closed-form, input image
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.
70. 【2605.29879】DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding
链接:https://arxiv.org/abs/2605.29879
作者:Luzhou Ge,Xiangyu Zhu,Jinyan Liu,Xuesong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Integrating open-vocabulary semantic, Integrating open-vocabulary, representations is essential, Integrating, scene
备注: 9 pages, 6 figures
点击查看摘要
Abstract:Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at this https URL
71. 【2605.29868】Ciphera: A Decentralised Biometric Identity Framework
链接:https://arxiv.org/abs/2605.29868
作者:Ankit Kanaiyalal Prajapati,Shahzad Memon,Mohammed Mahir Rahman,Ameer Al-Nemrat
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:systems expose users, Centralised biometric identity, irreversible biometric compromise, opaque verification processes, identity systems expose
备注: Accepted at the CyberAI 2026 Conference, and to be indexed at IEEE-Scopus
点击查看摘要
Abstract:Centralised biometric identity systems expose users to single points of failure, opaque verification processes, and irreversible biometric compromise. Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) offer stronger privacy guarantees, yet their integration with biometric authentication and distributed verification remains insufficiently explored. This paper presents Ciphera, a decentralised biometric identity framework combining privacy-preserving facial recognition, multi-node verification, IPFS-based credential metadata storage, and blockchain-anchored revocation. Evaluated across functional, performance, security, and distributed consistency dimensions, Ciphera achieved an 81% functional success rate, with stable enrolment and authentication but measurable revocation propagation delays and occasional audit-log inconsistencies. Performance testing demonstrated sub-second p95 verification latency of approximately 820ms under concurrent multi-node conditions. Security analysis confirmed strong confidentiality and integrity guarantees, though incomplete liveness detection leaves susceptibility to deepfake and replay attacks. The results demonstrate the feasibility of decentralised biometric identity while identifying key engineering challenges for production-grade deployment.
72. 【2605.29858】Masked Diffusion Vision-Language Models for Temporal Action Localization
链接:https://arxiv.org/abs/2605.29858
作者:Fengshun Wang,Zhengbo Zhang,Zhigang Tu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:end times precisely, requires recognizing, untrimmed videos, recognizing the target, target event
备注:
点击查看摘要
Abstract:Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and support language-conditioned outputs, but their autoregressive decoders still generate tokens from left to right, preventing later semantic evidence from revising earlier timestamp predictions. We adapt masked diffusion vision-language models (MDVLMs) to TAL so that semantic tokens and boundary tokens remain editable throughout iterative denoising with bidirectional attention, allowing temporal boundaries and semantic content to be refined jointly. Direct adaptation, however, creates two TAL-specific mismatches: standard masked diffusion training corrupts all positions uniformly at random, but the time tokens are more reliable when enough semantic context is available; and token-level cross-entropy does not reflect temporal IoU. To address these mismatches, we introduce a Planned Training Objective that uses boundary-aware masking and step-weighted reconstruction to rehearse the late recovery of time tokens, together with a Step-Level IoU Reward that provides overlap-aware supervision during denoising. A standard sequence-level cross-entropy term provides the base reconstruction signal. Experiments on ActivityNet-RTL, ActivityNet-1.3, and THUMOS-14 show that MDVLM-TAL improves both temporal reasoning and boundary localization over autoregressive vision-language baselines, with especially strong gains under stricter temporal IoU criteria.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.29858 [cs.CV]
(or
arXiv:2605.29858v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.29858
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Fengshun Wang [view email] [v1]
Thu, 28 May 2026 12:39:04 UTC (419 KB)
73. 【2605.29856】Building and Road Recognition in Dense Urban Informal Settlements: A Dataset and Benchmark
链接:https://arxiv.org/abs/2605.29856
作者:Hongyu Long,Jiaxuan Liu,Rui Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:present significant challenges, villages present significant, sustainable urban development, development and governance, urban villages present
备注: 5 pages, 4 figures;
点击查看摘要
Abstract:As a widespread form of informal settlements, urban villages present significant challenges for sustainable urban development and governance. Precise mapping of their infrastructure is essential, however, existing remote sensing datasets primarily focus on formal urban environments, lacking fine-grained annotated data for the high-density building patterns and narrow road networks typical of urban villages. To address this gap, we introduce the \textit{DenseUIS} dataset, the first high-resolution remote sensing dataset specifically designed for building and road extraction in extremely dense urban informal settlements, covering 126 urban villages across Shenzhen and Guangzhou in China. Furthermore, we conduct a comprehensive evaluation of state-of-the-art deep learning models on this dataset. Experimental results reveal the limitations of existing methods in handling the unique morphological patterns of dense informal settlements, underscoring the need for specialized approaches. \textit{DenseUIS} therefore provides a robust benchmark for advancing fine-grained urban mapping in complex and high-density informal environments. The dataset is publicly available at this https URL.
74. 【2605.29852】Parameter-Efficient Subspace Decoupling ViT for Mitigating Multi-Task Negative Transfer in Histological Scoring
链接:https://arxiv.org/abs/2605.29852
作者:Youhan Huang,Jiajun Li,Yilin Fang,Shuai Wang,Chuheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Fatty Liver Disease, Non-Alcoholic Fatty Liver, NAFLD Activity Score, diagnosing Non-Alcoholic Fatty, correlated NAFLD Activity
备注: 6 pages, 5 figures, 2 tables. Accepted by IEEE ICME 2026. Camera-ready version
点击查看摘要
Abstract:Histological scoring is essential for diagnosing Non-Alcoholic Fatty Liver Disease (NAFLD), yet its automation remains challenging due to the high annotation cost and negative transfer among the strongly correlated NAFLD Activity Score (NAS) indicators in multi-task learning. To address this issue, we propose a subspace-decoupled multi-task Vision Transformer (ViT) that integrates lightweight task-specific Adapters with orthogonality-based constraints. This design constructs independent feature subspaces for steatosis, ballooning, and inflammation, effectively reducing task interference while retaining shared representations. We further construct a curated multi-task mouse NAFLD histology dataset with expert annotations for all NAS components. Experimental results demonstrate that the proposed method improves multi-task stability and generalization with substantially reduced computational cost compared to training separate single-task models. The code and the curated dataset have been prepared and will be made publicly available upon acceptance to support reproducibility.
75. 【2605.29827】Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging
链接:https://arxiv.org/abs/2605.29827
作者:Milad Masroor,Cuong Nguyen,Kevin Wells,Gustavo Carneiro
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:threatening clinical safety, exhibit performance disparities, multiple demographic attributes, demographic attributes, threatening clinical
备注: Pre-review version submitted to MICCAI 2026. 10 pages, 5 figures
点击查看摘要
Abstract:Medical image analysis models can exhibit performance disparities across patient subgroups, threatening clinical safety and fairness. Existing methods typically address this issue by optimizing accuracy and fairness metrics for visible demographic attributes (e.g., sex or age) considered in isolation. This strategy not only overlooks potentially more informative latent stratifications, which may reveal deeper sources of model error and inequity, but also fails to scale when multiple demographic attributes are considered simultaneously due to the resulting sparsity of training data within each subgroup. We deal with these issues by introducing the label-free hidden-cohort fairness (LHCF) training paradigm that instead of maximizing fairness over visible demographic attributes, it optimizes fairness across latent subpopulations discovered from image appearance. By clustering images into K appearance-based cohorts and applying fairness optimization over them, LHCF uncovers underlying sources of model error and avoids the combinatorial sparsity of multi-demographic attributes, reducing disparities across both single and multiple demographic attributes. We demonstrate on our proposed fairness benchmark, HIDFairBench, that LHCF provides state-of-the-art fairness results on single and multiple demographic attributes, despite never using demographic labels for training. Our results position hidden-cohort fairness as a practical, scalable, and robust alternative to demographic-based fairness optimization for trustworthy medical image analysis.
76. 【2605.29812】Not All Inputs Are Valid: Towards Open-Set Video Moment Retrieval Using Language
链接:https://arxiv.org/abs/2605.29812
作者:Xiang Fang,Wanlong Fang,Daizong Liu,Xiaoye Qu,Jianfeng Dong,Pan Zhou,Renfu Li,Zichuan Xu,Lixing Chen,Panpan Zheng,Yu Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Moment Retrieval, Moment Retrieval, Video Moment, query, Open-Set Video Moment
备注: Published in ACM MM 2024
点击查看摘要
Abstract:Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a sentence query from an untrimmed video. Although recent works have made remarkable progress in this task, they implicitly are rooted in the closed-set assumption that all the given queries as video-relevant\footnote{In this paper, we treat ``video-relevant query'' as ``in-distribution (ID) query'' and ``video-irrelevant query'' as ``out-of-distribution (OOD) query''.}. Given an OOD query in open-set scenarios, they still utilize it for wrong retrieval, which might lead to irrecoverable losses in high-risk scenarios, \textit{e.g.}, criminal activity detection. To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. In this paper, we make the first attempt to step toward OS-VMR and propose a novel model \textbf{OpenVMR}, which first distinguishes ID and OOD queries based on the normalizing flow technology, and then conducts moment retrieval based on ID queries. Specifically, we first learn the ID distribution by constructing a normalizing flow, and assume the ID query distribution obeys the multi-variate Gaussian distribution. Then, we introduce an uncertainty score to search the ID-OOD separating boundary. After that, we refine the ID-OOD boundary by pulling together ID query features. Besides, video-query matching and frame-query matching are designed for coarse-grained and fine-grained cross-modal interaction, respectively. Finally, a positive-unlabeled learning module is introduced for moment retrieval. Experimental results on three VMR datasets show the effectiveness of our OpenVMR.
77. 【2605.29809】Cert-LAS: Toward Certified Model Ownership Verification for Text-to-Image Diffusion Models via Layer-Adaptive Smoothing
链接:https://arxiv.org/abs/2605.29809
作者:Leyi Qi,Yiming Li,Siyuan Liang,Zhengzhong Tu,Dacheng Tao
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:unprecedented creative applications, intellectual property concerns, enabled unprecedented creative, increasingly critical, creative applications
备注: This paper has been accepted to the International Conference on Machine Learning (ICML) 2026. 26 pages
点击查看摘要
Abstract:Large-scale text-to-image (T2I) diffusion models have enabled unprecedented creative applications, but their unauthorized use has raised serious intellectual property concerns, making model ownership verification (MOV) increasingly critical. We find that existing backdoor-based diffusion watermarking methods often (implicitly) assume a "faithful" verification process, namely, that the verifier can query a suspicious model and obtain the faithful watermark response to complete MOV. However, in practice, adversaries may intentionally or unintentionally damage potential watermark signals, significantly degrading verification reliability. To address this issue, we propose Cert-LAS, the first certified MOV method for T2I models based on layer-adaptive smoothing. In general, Cert-LAS embeds specified watermarks using diffusion classifiers and an LFS-guided layer-adaptive noise, and verifies ownership by examining whether the suspected model exhibits significantly stronger watermark responses compared to unwatermarked references through hypothesis testing. We further prove that, under certain conditions, our Cert-LAS can still achieve reliable verification even in the presence of malicious removal attacks. Extensive experiments validate the effectiveness of Cert-LAS and its resistance to adaptive attacks. Our code is available at this https URL.
78. 【2605.29801】AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
链接:https://arxiv.org/abs/2605.29801
作者:Dongrui Liu,Yu Li,Zhonghao Yang,Peng Wang,Guanxu Chen,Yuejin Xie,Qinghua Mao,Wanying Qu,Yanxu Zhu,Tianyi Zhou,Leitao Yuan,Zhijie Zheng,Qihao Lin,Yimin Wang,Haoyu Luo,Shuai Shao,Chen Qian,Qingyu Liu,Ling Tang,Ruiyang Qin,Qihan Ren,Junxiao Yang,Kun Wang,Zhiheng Xi,Linfeng Zhang,Ranjie Duan,Bo Zhang,Wenjie Wang,Wen Shen,Qiaosheng Zhang,Yan Teng,Chaochao Lu,Rui Mei,Man Li,Jialing Tao,Xi Lin,Tianhang Zheng,Yong Liu,Quanshi Zhang,Lei Zhu,Xingjun Ma,Junhua Liu,Hui Xue,Xiaoxiang Zuo,Xiangnan He,Chao Shen,Xianglong Liu,Minlie Huang,Jing Shao,Xia Hu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Modern open-world agents, exhibit powerful cross-environment, Modern open-world, OpenClaw exhibit powerful, powerful cross-environment execution
备注: 44 pages, 12 Figures, 9 Tables
点击查看摘要
Abstract:Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.
79. 【2605.29798】Low-Magnification SEM May Suffice: Interpretable Deep Learning for Multi-Scale Fracture-Cause Classification in Zirconia-Toughened Alumina
链接:https://arxiv.org/abs/2605.29798
作者:Julian Schmid,Pawel Astankow,Tom Vater,Julius Beck,Robert Cichon,Danny Krautz
类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Image and Video Processing (eess.IV)
关键词:scanning electron microscopy, alumina matrix composite, matrix composite hip, Reliable identification, high-magnification scanning electron
备注:
点击查看摘要
Abstract:Reliable identification of fracture origins in alumina matrix composite hip and knee implants is critical for quality assurance and patient safety, yet current fractographic workflows are time-consuming, partly subjective, and reliant on high-magnification scanning electron microscopy (SEM). We present an interpretable vision-transformer (ViT) workflow for automated classification of fracture causes in an alumina matrix composite (BIOLOX delta, CeramTec GmbH) widely used in total joint replacements. A dataset of 8,493 SEM images (50x-10,000x) was curated from five years of in-production burst and proof tests and annotated into three defect categories defined along the manufacturing chain: green body, hard machining, and material defects. Under severe class imbalance, the fine-tuned ViT reached an accuracy of 0.907 and a macro-F1 of 0.888 in stratified five-fold cross-validation, with a two-stage perceptual-hash/SSIM leakage audit confirming negligible specimen overlap. Notably, performance at low magnification (50x) was comparable to that at high magnification (1k-10kx), indicating that macro-scale features - mirror geometry and hackle line fields - already encode sufficient diagnostic signal. Grad-CAM attributions consistently localised on canonical fractographic cues (mirrors, hackles, pores, machining marks), aligning with established fractographic criteria. Together, these results position interpretable ViTs as a complementary tool for ceramic-implant quality assurance, enabling low-magnification pre-screening and reducing reliance on time-intensive high-magnification inspection.
80. 【2605.29793】Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language
链接:https://arxiv.org/abs/2605.29793
作者:Xiang Fang,Daizong Liu,Wanlong Fang,Pan Zhou,Zichuan Xu,Wenzheng Xu,Junyang Chen,Renfu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:untrimmed video, aims to locate, video, VMR methods, VMR
备注: Published in AAAI 2024
点击查看摘要
Abstract:Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.
81. 【2605.29776】Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot Learning
链接:https://arxiv.org/abs/2605.29776
作者:Shuai Yi,Yixiong Zou,Yuhua Li,Ruixuan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cross-Domain Few-Shot Learning, strong zero-shot generalization, demonstrate strong zero-shot, CLIP demonstrate strong, Few-Shot Learning
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Vision-Language Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed "tail tokens", from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our code is available at this https URL.
82. 【2605.29773】Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation
链接:https://arxiv.org/abs/2605.29773
作者:Boyuan Zhang,Huanshan Huang,Yifei Cao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:accurate dense prediction, Reliable semantic segmentation, mobile robots requires, robust uncertainty estimation, Monte Carlo Dropout
备注: 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)
点击查看摘要
Abstract:Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at this https URL
Comments:
7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Cite as:
arXiv:2605.29773 [cs.CV]
(or
arXiv:2605.29773v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.29773
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Boyuan Zhang [view email] [v1]
Thu, 28 May 2026 11:19:46 UTC (1,036 KB)
83. 【2605.29762】GeoMag: Geometric-Aware Video Motion Magnification via State Space Model
链接:https://arxiv.org/abs/2605.29762
作者:Kecheng Han,Yuchen Zhang,Bingqing Liu,Boqiang Guo,Wenbin Zheng,Shiyuan Pei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reveals imperceptible dynamics, Video Motion Magnification, Motion Magnification, reveals imperceptible, imperceptible dynamics
备注: ICME 2026 Spotlight
点击查看摘要
Abstract:Video Motion Magnification (VMM) reveals imperceptible dynamics but often suffers from structural inconsistencies under complex geometric transformations. Existing learning-based methods generally face a trade-off between the limited global context of CNNs and the high computational cost of Transformers. In addition, current training protocols, largely dominated by simple linear motion, fail to capture the geometric and imaging complexities encountered in real-world videos. To address these issues, we propose GeoMag, a geometric-aware VMM framework built upon State Space Models to achieve globally consistent motion amplification with linear complexity. We further construct Geo-200K, a large-scale synthetic dataset that introduces rich geometric transformations together with sensor-realistic degradations, improving the diversity and realism of training signals. Extensive experiments on synthetic and real-world benchmarks show that GeoMag consistently outperforms prior methods in visual fidelity and computational efficiency, while producing fewer artifacts and better structural consistency.
84. 【2605.29761】S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance Fields
链接:https://arxiv.org/abs/2605.29761
作者:Deniz Sayin Mercadier,Federico Stella,Aurel Bizeau,Nicolas Talabot,Pascal Fua
类目:Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
关键词:Signed Distance Field, Distance Field, Signed Distance, Compositional implicit surface, implicit surface representations
备注:
点击查看摘要
Abstract:Compositional implicit surface representations model scenes as collections of objects, each encoded by a Signed Distance Field (SDF). A fundamental limitation of this approach is that multiple SDFs can produce geometries that interpenetrate, violating physical plausibility. Existing mitigation strategies rely on soft penalty terms that reduce but do not eliminate intersections, and require careful loss weighting. To truly prevent interpenetration, we propose a hard constraint on vector-valued SDFs and introduce S2MDF, a lightweight plug-and-play module that enforces the constraint on any object-compositional SDF representation without architectural modifications. It introduces negligible computational overhead and is compatible with linearly-interpolated standard meshing algorithms such as Marching Cubes. It can be applied during training or as a post-processing step. Experiments on multiple state-of-the-art compositional methods show that S2MDF reduces intersections to numerical precision while preserving reconstruction quality, outperforming existing mitigation strategies.
85. 【2605.29726】SLAD : Shared LoRA Adapters for Task Specific Distillation
链接:https://arxiv.org/abs/2605.29726
作者:Reda Bensaid,Yassir Bendou,Vincent Gripon,François Leduc-Primeau
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adapting reduced-size foundation, reduced-size foundation models, foundation model, embedded systems, adapting reduced-size
备注: CVPR Findings 2026
点击查看摘要
Abstract:In the context of resource-constrained environments such as embedded systems, adapting reduced-size foundation models to downstream tasks has become increasingly popular. This has recently motivated the emerging setting of task-specific distillation, where a larger and a smaller version of the same foundation model are both adapted to the same downstream task, with the goal of transferring knowledge from the former to the latter. Recent work has demonstrated the benefits of using a larger version of the same foundation model to assist the adaptation of a smaller one. Typically, the larger model (teacher) is first adapted via fine-tuning or linear probing before its knowledge is distilled into the smaller model (student). While fine-tuning the teacher often increases its performance, recent work showed that probing it leads to better knowledge distillation to the student. Our findings show that this is mainly due to a mis-alignment in feature representation between the teacher and the student which occurs during the teacher's fine-tuning. Inspired by existing efforts to preserve previously learned knowledge, we first propose leveraging low-rank adaptation, resulting in better feature alignment and therefore better knowledge transfer. Drawing from this insight, we further enhance the feature alignment through a parameter-sharing strategy of the adapters between the two encoders during joint training. Our proposed method, SLAD, shows better feature alignment between the teacher and student, which results in increased performance for not only the student but also the teacher model, while being 2x faster to train than fine-tuning. Through extensive experiments on multiple classification and segmentation datasets, we demonstrate the improved accuracy and transfer efficiency of our method, achieving state-of-the-art performance in the task-specific distillation framework.
86. 【2605.29720】Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets
链接:https://arxiv.org/abs/2605.29720
作者:Zhichao Chen,Yongle Zhao,Kaicheng Yang,Meng Yang,Yin Xie,Ziyong Feng
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:propose Intrinsic Quality, Intrinsic Quality, validation-free metric designed, Global Representation Subspace, Representation Subspace Complexity
备注: ICML 2026
点击查看摘要
Abstract:We propose Intrinsic Quality (IQ), a validation-free metric designed to estimate the inherent potential of face recognition (FR) datasets to produce high-performance models without the need for full-scale training. IQ integrates two components: (i) a Neighbor-Consistency Score that quantifies local identity label agreement via nearest neighbors, and (ii) Global Representation Subspace Complexity (Effective Rank, ER), which captures the underlying embedding geometry and dataset diversity. IQ allows for rapid evaluation using lightweight proxy models or data subsets, facilitating dataset diagnosis and curation prior to resource-intensive full-scale training. We describe an experimental protocol tailored to clean, noisy, and mixed-quality FR datasets, and outline evaluation methodologies to validate IQ's predictive power for downstream performance.
87. 【2605.29691】Unsupervised Semantic Segmentation Facilitates Model Understanding
链接:https://arxiv.org/abs/2605.29691
作者:Xiaoyan Yu,Lisa Mais,Jannik Franzen,Peter Hirsch,Nick Lechtenbörger,Andreas Mardt,Dagmar Kainmüller
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Self-supervised learning, pretrained representations support, downstream tasks, support a wide, Self-supervised
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3-Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.29691 [cs.CV]
(or
arXiv:2605.29691v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.29691
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
88. 【2605.29673】A Geometric View of SRC: Learning Representations for Stable Residual Inference
链接:https://arxiv.org/abs/2605.29673
作者:Vangelis P. Oikonomou
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Sparse Representation Classification, Reconstruction-based inference assigns, comparing class-wise reconstruction, Sparse Representation, Representation Classification
备注: 37 pages
点击查看摘要
Abstract:Reconstruction-based inference assigns a class by comparing class-wise reconstruction residuals; Sparse Representation Classification (SRC) is a canonical instance whose reliability depends on the geometry of the learned representation. We adopt a strict training-inference separation: SRC is used only as a fixed test-time rule and is never differentiated, unrolled, or optimized during training. In a span-level idealization based on class-conditional spans and their associated projection residuals, we formalize residual-ordering stability through a residual margin and characterize geometric obstructions -- span overlap, dominance, and near-overlap via small principal angles -- that can collapse this margin in worst-case directions. This span-level theory is primary: it specifies when the idealized residual family is well-separated, and it provides a conditional solver-level interpretation for practical residual approximations (e.g., OMP) insofar as they remain close to the span-level residual ordering. Under explicit coverage and separation assumptions, we derive a quantitative lower bound on the (idealized) residual margin. Guided by these targets, we propose geometry-shaping objectives that promote masked within-class self-expressiveness, discourage cross-class reconstruction pathways and inter-class span alignment, and prevent collapse -- without invoking SRC residuals or predictions during training. Experiments on images (COIL-100), text (TREC), and EEG connectivity evaluate all representations under identical fixed SRC/OMP inference and report residual margins and geometric diagnostics; cross-entropy is included only as a reference geometry under the same evaluation protocol.
89. 【2605.29662】SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation
链接:https://arxiv.org/abs/2605.29662
作者:Shilin Ma,Chubin Zhang,Changyuan Wang,Yuji Wang,Yue Wu,Zixuan Wang,Jingqi Tian,Zheng Zhu,Yansong Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-time inference, robotic control, essential for robotic, Real-time, pruning
备注:
点击查看摘要
Abstract:Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.
90. 【2605.29661】Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning
链接:https://arxiv.org/abs/2605.29661
作者:Yiyao Ma,Kai Chen,Zhongxiang Zhou,Zhuheng Song,Dongsheng Xie,Zelong Tan,Rong Xiong,Qi Dou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:geometric understanding, significant challenge, recovery is fundamental, fundamental to geometric, remains a significant
备注: 20 pages, 12 figures, accepted by ICML 2026
点击查看摘要
Abstract:Monocular 3D shape recovery is fundamental to geometric understanding, yet achieving robust generalization across arbitrary viewpoints and unseen object categories remains a significant challenge. In this paper, we present a generalizable deformation learning framework that reconstructs 3D objects by explicitly deforming a category-level shape template to match the target observation. To address complex shape variations between the template and the target, we introduce a geometry-guided feature modeling mechanism. This process first enriches foundation features with template topology to yield a geometry-aware representation, which is then explicitly correlated with the target observation to guide precise deformation. Furthermore, to bridge the disparity between the fixed template and arbitrary target views, we propose a view-adaptive feature aggregation module. This module leverages multi-view template features and their corresponding camera poses to enrich the canonical template representation, ensuring robust feature alignment regardless of the target's perspective. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in handling large shape variations and diverse viewpoints, exhibiting strong generalization to novel categories and effectively supporting downstream real-world dexterous robotic manipulation tasks. Project homepage: this https URL
91. 【2605.29657】OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning
链接:https://arxiv.org/abs/2605.29657
作者:Geng Li,Guohao Chen,Ting Chen,Shilin Shan,Kuangji Zuo,Bofan Lyu,Tuo An,Gen Li,Jianfei Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:prefill stage expensive, Vision-language models, rely on long, computation and memory, prefill stage
备注: 26 pages,8 figures
点击查看摘要
Abstract:Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.
92. 【2605.29655】SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation
链接:https://arxiv.org/abs/2605.29655
作者:Yuan Li,Congyi Zhang,Xifeng Gao,Xiaohu Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:large language models, multimodal large language, Autoregressive multimodal large, Toggle, high-resolution shapes due
备注:
点击查看摘要
Abstract:Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Cite as:
arXiv:2605.29655 [cs.CV]
(or
arXiv:2605.29655v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.29655
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yuan Li [view email] [v1]
Thu, 28 May 2026 09:17:11 UTC (7,342 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation, by Yuan Li and 3 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.CV
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
cs.GR
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
93. 【2605.29647】MARTIAN: A Rendering Framework for Aerial Mars Imagery from HiRISE Orbital Data
链接:https://arxiv.org/abs/2605.29647
作者:Dario Pisanti,Georgios Georgakis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse illumination conditions, requires vision-based pipelines, Mars requires vision-based, diverse illumination, Martian surface
备注:
点击查看摘要
Abstract:Aerial navigation on Mars requires vision-based pipelines that are robust to the diverse illumination conditions and terrain morphology of the Martian surface. A key bottleneck for training and evaluating such methods is the scarcity of large-scale, annotated aerial datasets. We present MARTIAN, an open-source Blender-based rendering framework that leverages real HiRISE orbital map products to synthesize realistic aerial views of the Martian terrain under controllable lighting conditions and at varying altitudes. MARTIAN generates observations with accurate pose annotations, directly addressing the scarcity of training data for vision-based navigation on Mars. The framework has been validated through its deployment in concurrent work on map-based localization systems for Ingenuity and future Mars rotorcraft, where synthetically trained deep image matchers were successfully evaluated on real Mars imagery. MARTIAN is publicly available at: this https URL.
94. 【2605.29643】AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning
链接:https://arxiv.org/abs/2605.29643
作者:Yilun Qiu,Jiahe Wang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yao Hu,Chun Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:Large Language Models, aggregate evidence distributed, Multimodal Large Language, Current Multimodal Large, requiring models
备注:
点击查看摘要
Abstract:Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at this https URL.
95. 【2605.29615】DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?
链接:https://arxiv.org/abs/2605.29615
作者:Linhao Zhang,Aiwei Liu,Yuan Liu,Xiao Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:high-level image-text alignment, made strong progress, perceive subtle visual, differences remains limited, image-text alignment
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.
96. 【2605.29610】Learning Context-Conditioned Predicate Semantics via Prototype Feedback
链接:https://arxiv.org/abs/2605.29610
作者:NamGyu Jung,Chang Choi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:scene graph generation, modeling polysemous predicates, graph generation, central challenge, challenge is modeling
备注: Accepted at ICML 2026. Code: [this https URL](https://github.com/Namgyu97/AlignG-SGG.pytorch)
点击查看摘要
Abstract:In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image-specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context-conditioned predicate semantics via prototype feedback. AlignG infers context-conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over state-of-the-art baselines, with F@100 improvements of +1.4 on VG-150 and +2.7 on GQA-200 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at this https URL.
97. 【2605.29602】CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning
链接:https://arxiv.org/abs/2605.29602
作者:Xiang Fang,Wanlong Fang,Changshuo Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, enhancing Multimodal Large, integrating external visual, knowledge-intensive question answering
备注: Accepted in CVPR 2026
点击查看摘要
Abstract:Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency.
98. 【2605.29599】How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments
链接:https://arxiv.org/abs/2605.29599
作者:Ji-Hoon Hwang,Daeyoung Kim,Hyung-Suk Yoon,Dong-Wook Kim,Seung-Woo Seo
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:identify traversable regions, enabling precise classification, enabling precise, traversable regions, crucial for autonomous
备注: 8 pages, 6 figures. Accepted to IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
点击查看摘要
Abstract:Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.
99. 【2605.29592】Non-Forgetting Knowledge Allocation with Bi-level Competition for Class-Incremental Learning
链接:https://arxiv.org/abs/2605.29592
作者:Xiang Tan,Run He,Yawen Cui,Mengchen Zhao,Yan Wu,Tianyi Chen,Huiping Zhuang,Xiaonan Luo,Guanbin Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Class-Incremental Learning, sequentially adapt PTMs, aims to sequentially, sequentially adapt, adapt PTMs
备注:
点击查看摘要
Abstract:Class-Incremental Learning (CIL) with pre-trained models (PTMs) aims to sequentially adapt PTMs to new categories without forgetting old knowledge. Built upon PTMs, existing adapter-based methods mainly train models via distinct task-specific adapters, and present a uniform knowledge allocation for each adapter during inference. However, this allocation mechanism ignores the nature of task discrepancy and leads to suboptimal utilization of adapters. Also, under CIL constraint, an allocator is prone to forgetting when tasks evolve. To address these issues, we propose a Non-Forgetting Allocation with Bi-Level Competition (NoFA-BC). NoFA-BC constructs a non-forgetting allocator (NFA) by transforming the allocator training into a recursive least-squares problem and achieves an allocator equivalent to that trained with all data. Based on the NFA, a Bi-Level Competition (BLC) including an intra-task level Winner-Takes-All (WTA) mechanism and inter-task Last-Ones-Fall (LOF) elimination is proposed to provide better allocation of adapter knowledge. WTA extracts the most significant logit within a task to represent the adapter's contribution and LOF suppresses the irrelevant adapters. With BLC, participation ratio of each adapter can be tailored for each input. Moreover, a Stability Enhancement (SE) process is incorporated to further improve the performance of old tasks.
100. 【2605.29588】Brain-IT-VQA: From Brain Signals to Answers
链接:https://arxiv.org/abs/2605.29588
作者:Roman Beliy,Matias Cosarinsky,Oliver Heinimann,Navve Wasserman,Michal Irani
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
关键词:Decoding visual content, Decoding visual, fMRI signals recorded, visual question answering, person views images
备注:
点击查看摘要
Abstract:Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.
101. 【2605.29583】BitC-3DGS: High-Capacity 3D Gaussian Splatting Watermarking via Bit Compression
链接:https://arxiv.org/abs/2605.29583
作者:Yuquan Bi,Baosheng Yu,Yingke Lei,Jianwei Yang,Hongsong Wang,Jie Gui,Yuan Yan Tang,James Tin-Yau Kwok
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, enabling reliable identification, embed rich information, asset pipelines, authentication codes
备注:
点击查看摘要
Abstract:High-capacity watermarking is necessary for 3D Gaussian Splatting (3DGS) assets to embed rich information (e.g., ownership, provenance, and authentication codes), enabling reliable identification and integrity verification in large-scale 3D asset pipelines. Existing bit-to-token watermarking methods based on a pre-trained text encoder are limited to 77-bit messages due to CLIP's fixed 77-token context length, as tokens beyond this limit are unsupported by learned positional embeddings. To address this limitation, we introduce BitC-3DGS, a bit-compression framework that encodes multiple message bits per token. It employs a bit-compressed tokenization scheme that encodes multiple bits within the same chunk into a single semantic token. To enable recovery of the compressed information, it further introduces a dual-branch architecture for joint chunk decompression and bit decoding, along with a hard-message sampling strategy to improve combinatorial coverage during decoder training. Extensive experiments on the Blender and LLFF datasets demonstrate the effectiveness of BitC-3DGS for high-capacity watermarking, achieving high message recovery accuracy and rendering fidelity. For example, it supports 128-bit message capacity with recovery accuracy comparable to that of 64-bit messages in recent state-of-the-art methods.
102. 【2605.29579】ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation
链接:https://arxiv.org/abs/2605.29579
作者:Shizhe Zhou,Bohan Jia,Kai Wu,Yan Shen,Tongyun Li,Yuyang Wu,Shaohui Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved rapid progress, vision-language understanding, producing responses, visual input, achieved rapid
备注:
点击查看摘要
Abstract:While multimodal large language models (MLLMs) have achieved rapid progress in vision-language understanding, they remain prone to multimodal hallucinations, producing responses that are inconsistent with the visual input. Existing benchmarks predominantly focus on detecting hallucination outcomes rather than evaluating the underlying causes of these failures. Moreover, many benchmarks rely on simplistic scenarios and limited evaluation formats that no longer challenge state-of-the-art models. To address these limitations, we introduce ReactBench, a cause-driven hallucination benchmark featuring multiple tasks and an exam-style evaluation format. By generating adversarial images and hallucination-inducing queries, ReactBench introduces four targeted tasks: Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting. These tasks systematically expose co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. Beyond standard accuracy-based evaluation, we leverage Chain-of-Thought reasoning to identify fine-grained sub-causes of hallucination within each task. Extensive evaluations reveal that current MLLMs remain notably vulnerable to cause-specific hallucination triggers, demonstrating the value of ReactBench as a systematic and interpretable testbed for diagnosing and improving multimodal model robustness. The project page is available at this https URL.
103. 【2605.29577】Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning
链接:https://arxiv.org/abs/2605.29577
作者:Kyujin Lee,Injae Kim,Jihwan Park,Yejun Ju,Minseok Joo,Hyunwoo J. Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pretrained vision-language models, adapting pretrained vision-language, vision-language models, unifies perception, promising framework
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.
104. 【2605.29575】Optimizing Latent Representations for Robust Building Damage Assessment Onboard Earth Observation Satellites
链接:https://arxiv.org/abs/2605.29575
作者:Thomas Goudemant,Benjamin Francesconi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:support emergency response, Rapid identification, prioritize interventions, identification of damaged, natural disasters
备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2026), Jun 2026, Denver, United States
点击查看摘要
Abstract:Rapid identification of damaged buildings after natural disasters or on war areas is crucial to support emergency response and prioritize interventions. Earth Observation constellations provide timely, large-scale coverage, but actionable information is often delayed by data downlink constraints, on-ground processing, and human interpretation. Reducing this latency is essential to improve decision-making responsiveness. In this work, we propose an original AI-based system that enables object-level building damage assessment (localization and damage classification) directly onboard satellites from pre-disaster and post-disaster highresolution optical imagery. Available pre-disaster images are encoded on ground into compact latent representations, transmitted to the satellite, and compared on-board with newly acquired post-event observations. Leveraging AI interpretation capabilities and increasing processing capabilities on-board satellites, the proposed design enables processing directly at the data source, reducing the amount of information to be downlinked while preserving task-relevant content and improving overall system responsivity. We explore the design space through a systematic benchmark of onboard-compatible variants, analyzing the impact of siamese processing, cross-attention, latent-space compression, and robustness-oriented data augmentation. Experiments on xBD dataset demonstrate reliable and robust damage assessment under misalignment, with minimal performance degradation under strong compression.
105. 【2605.29570】DefSynUS: Real-time Patient-specific Intrahepatic Vessel Identification via Deformation-Aware CT-US Domain Adaptation
链接:https://arxiv.org/abs/2605.29570
作者:Karl-Philippe Beaudet(MIMESIS, UNISTRA),Yordanka Velikova(TUM),Sidaty El Hadramy(MIMESIS, Unibas),Nassir Navab(TUM),Philippe Cattin(Unibas),Juan Verde(MIMESIS, UNISTRA, IHU Strasbourg),Stéphane Cotin(MIMESIS, UNISTRA)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Laparoscopic ultrasound, enhances the safety, Purpose, LUS, safety of liver
备注:
点击查看摘要
Abstract:Purpose: Laparoscopic ultrasound (LUS) enhances the safety of liver surgery by visualizing intrahepatic vessels in real-time. Still, vessel identification remains difficult due to probe constraints, complex vascular structure, and tissue deformation. This work aims to enable real-time, patient-specific vessel identification that remains robust under deformation through deformable ultrasound augmentation. Methods: Preoperative CT vessel annotations are used to generate synthetic ultrasound data via optimized physics-based rendering, coupled with domain adaptation to intraoperative ultrasound. The rendering is trained end-to-end for vessel identification and patient-specificity, eliminating the need for preoperative ultrasound. A deformation-aware augmentation simulates realistic intraoperative motion and tissue deformation within the rendering pipeline. Results: In abdominal phantom and limited clinical feasibility experiments (single-case clinical evaluation), the framework achieved real-time intrahepatic vessel-branch identification, maintaining performance under new patient poses. Conclusion: The framework enables real-time vessel identification without preoperative ultrasound and supports technical feasibility, but multi-patient validation is still needed for generalizability and clinical feasibility.
106. 【2605.29565】From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments
链接:https://arxiv.org/abs/2605.29565
作者:Ji-Hoon Hwang,Jisung Bae,Dong-Wook Kim,Yeonkyu Lee,Seung-Woo Seo
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:unstructured outdoor environments, typically adapting vision, vision foundation models, adapting vision foundation, Vision-based approaches
备注: 8 pages, 5figures
点击查看摘要
Abstract:Vision-based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task-agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision-to-Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task-specific knowledge through learnable traversability prompts while preserving the VFM's cross-domain generalization. To handle annotation ambiguity, we introduce Perspective-Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic-traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real-world off-road datasets, demonstrate that ViTA achieves state-of-the-art IoU and Precision with substantial false-positive reduction and strong cross-domain generalization.
107. 【2605.29563】Planning with the Views via Scene Self-Exploration
链接:https://arxiv.org/abs/2605.29563
作者:Kangrui Wang,Linjie Li,Zhengyuan Yang,Shiqi Chen,Zihan Wang,Li Fei-Fei,Jiajun Wu,Leonidas Guibas,Lijuan Wang,Manling Li
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:moves ahead, camera move, view, multi-turn plans, VLMs predict
备注:
点击查看摘要
Abstract:Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space.
108. 【2605.29562】VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models
链接:https://arxiv.org/abs/2605.29562
作者:Shengyu Si,Yuanzhuo Lu,Ruimeng Yang,Ziyi Ye,Zuxuan Wu,Yu-Gang Jiang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong potential, general-purpose robotic manipulation, models have shown, shown strong, strong potential
备注:
点击查看摘要
Abstract:Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.
109. 【2605.29558】AE: Target-aware enhancer for nighttime UAV tracking
链接:https://arxiv.org/abs/2605.29558
作者:Yanyan Chen,Ruigang Fu,Yu Song,Ping Zhong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Severe image degradation, core bottleneck preventing, bottleneck preventing all-day, preventing all-day applications, UAV-based single object
备注: Accepted at ICIP 2026. Dataset is avaliable at: [this https URL](https://github.com/Fu0511/DarkSOT-Dataset)
点击查看摘要
Abstract:Severe image degradation under low-light nighttime conditions constitutes a core bottleneck preventing all-day applications for UAV-based single object tracking. Existing image enhancement methods often struggle to distinguish between target and background regions, which can easily lead to amplified background noise or compromise target features. To overcome this limitation, we propose TAE, a target-aware low-light enhancement framework tailored for nighttime object tracking. Guided explicitly by weak supervisory signals from tracking bounding boxes, the framework performs region-aware enhancement to ensure operations focus on the target area. It further adopts an adaptive RGB multi-curve fusion mechanism to achieve refined modeling and adaptive adjustment across different regions. To facilitate research in this domain, we also contribute DarkSOT, a new benchmark for nighttime UAV tracking, comprising 268 sequences across 9 target categories. Experimental results on the DarkSOT and UAVDark135 demonstrate that TAE significantly improves tracking performance in low-light nighttime scenarios, exhibiting strong robustness and generalization. The DarkSOT dataset is available at this https URL.
110. 【2605.29549】Learning Representations from 3D Gaussian Splats
链接:https://arxiv.org/abs/2605.29549
作者:Julia Farganus,Krzysztof Żurawicki,Arkadiusz Gaweł,Weronika Jakubowska,Halina Kwaśnicka
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, recent approach, Gaussian Splatting datasets, dedicated Gaussian Splatting, Gaussian
备注: 5 figures, 15 pages
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) is a recent approach for scene rendering. Although primarily designed for view synthesis, its potential for scene understanding tasks remains underexplored. In this work, we conduct a comparative evaluation of various geometric deep learning architectures for the classification of 3D scenes represented using Gaussian Splatting. We benchmark point-based and graph-based models across both traditional point cloud datasets and dedicated Gaussian Splatting datasets. Scenes are embedded into latent representations, which are evaluated through end-to-end classification, linear probing, and clustering analysis. Our study provides insight into the suitability of different geometry-aware architectures and input feature configurations for learning effective 3D Gaussian Splat representations. The results highlight consistent differences between architectural families and reveal the impact of Gaussian-specific attributes on the quality of representation.
111. 【2605.29539】GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection
链接:https://arxiv.org/abs/2605.29539
作者:Jiacong Liu,Shu Luo,Yikai Qin,Yaze Zhao,Yongwei Jiang,Yixiong Zou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Few-Shot Object Detection, Cross-Domain Few-Shot Object, Object Detection, Few-Shot Object, promising zero-shot generalization
备注: CVPR 2026 Workshop
点击查看摘要
Abstract:Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single-instance annotations, and severe overfitting under extremely limited target-domain samples. To address these issues, this paper proposes GiPL, an efficient two-branch training this http URL the first branch, we design an iterative pseudo-label self-training paradigm, which performs zero-shot inference on the support set to generate reliable pseudo-annotations, fuses them with ground-truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision-language models, which synthesizes domain-aligned, multi-object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD-FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10-shot settings demonstrate that GiPL consistently outperforms state-of-the-art methods with significant performance this http URL is available at \href{this https URL}{CDiscover}.
112. 【2605.29538】RadioFormer3D: Weakly Supervised 3D Radio Map Estimation in Low-Altitude Airspace via Generative Modeling
链接:https://arxiv.org/abs/2605.29538
作者:Zheng Fang,Junjie Liu,Kangjun Liu,Jianguo Zhang,Yaowei Wang,Ke Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:characterize signal propagation, radio map estimation, three-dimensional environments, textit, map estimation
备注:
点击查看摘要
Abstract:With the emergence of wireless applications in three-dimensional environments, such as the low-altitude airspace and 3D heterogeneous networks, radio map estimation is increasingly required to characterize signal propagation across both horizontal and vertical dimensions. However, extending radio map estimation from 2D to 3D remains challenging due to increased spatial sparsity and limited supervision across continuous altitudes. In this paper, we propose \textbf{\textit{RadioFormer3D}}, a specialized model for volumetric spectrum reconstruction under weak supervision. Building on the dual-stream, multi-granularity fusion architecture of \textit{RadioFormer}, \textit{RadioFormer3D} introduces a Fourier-based sampling encoder and a volumetric decoder to efficiently process sparse measurements in 3D space. To alleviate the lack of vertical supervision, we propose the \textbf{\textit{Joint Spectrum Integrity Loss}}, which integrates volume-level pseudo-label supervision, map-level geometry-aware radio rendering, and pixel-level localized constraints within a unified optimization scheme. This design enables the model to capture complex vertical structural relationships more effectively under sparse supervision. Extensive experiments across several radio map datasets show that \textit{RadioFormer3D} achieves superior overall performance compared to representative existing methods. In particular, it demonstrates improved reconstruction quality at unlabeled altitudes while maintaining a favorable trade-off between accuracy and inference efficiency, positioning it as a highly promising solution for future 3D environment-aware wireless networks.
113. 【2605.29531】Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion
链接:https://arxiv.org/abs/2605.29531
作者:S. Sutharya,Remya K. Sasi
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:partially manipulated speech, short synthesised segment, Audio deepfake detection, genuine utterance, poses a harder
备注: 13 pages, 5 figures, 11 tables
点击查看摘要
Abstract:Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.
114. 【2605.29509】KGEdit: Ambiguity-Aware Knowledge Graphs for Training-Free Precise Video Generation and Editing
链接:https://arxiv.org/abs/2605.29509
作者:Mingshu Cai,Miao Zhang,Chenghe Yang,Yixuan Li,Osamu Yoshie,Yuya Ieiri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:training-free video generation, recent years, training-free video, progressed remarkably, video generation
备注:
点击查看摘要
Abstract:In recent years, training-free video generation has progressed remarkably. However, when handling complex textual instructions, existing methods still suffer from semantic ambiguity, incorrect concept binding, and cross-frame inconsistency. To address these issues, we propose KGEdit, a structured semantic control framework for text-to-video (T2V) diffusion models. Specifically, we first construct an ambiguity-aware knowledge graph (AAKG) to disentangle and disambiguate the input prompt, converting it into four types of structured semantics: identity, relation, attribute, and negative constraints. We then design a structured semantic injection module (SSIM) to inject these semantic signals into key layers of the diffusion Transformer, enabling fine-grained semantic control. In addition, we introduce a temporal-aware semantic control (TASC) module that dynamically schedules semantic objectives according to the stage-wise characteristics of the denoising process, further improving semantic alignment and temporal consistency. Experiments show that KGEdit outperforms existing methods in editing precision and temporal stability, while offering higher efficiency and controllability in text-driven interaction scenarios.
115. 【2605.29505】ESAM++: Efficient Online 3D Perception on the Edge
链接:https://arxiv.org/abs/2605.29505
作者:Qin Liu,Lavisha Aggarwal,Saptarashmi Bandyopadhyay,Vikas Bahirwani,Marc Niethammer,Ehsan Adeli,Andrea Colaco
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:edge computing scenarios, essential for robotics, autonomous systems, privacy is crucial, computing scenarios
备注:
点击查看摘要
Abstract:Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.
116. 【2605.29498】Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting
链接:https://arxiv.org/abs/2605.29498
作者:Runze Xu,Arpit Garg,Hemanth Saratchandran,Simon Lucey
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:adapting large language, large language models, adaptation distribution differs, models original training, widely used fine-tuning
备注: In Submission
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.
117. 【2605.29496】On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
链接:https://arxiv.org/abs/2605.29496
作者:Xueqing Wu,Yu-Chi Lin,Kai-Wei Chang,Nanyun Peng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:frontier vision-language models, remain comparatively limited, Post-training has greatly, greatly improved reasoning, perception remain comparatively
备注: Project: [this https URL](https://asymmetric-vlm-post-training.github.io/)
点击查看摘要
Abstract:Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.
118. 【2605.29488】AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
链接:https://arxiv.org/abs/2605.29488
作者:Yiheng Li,Zhuo Li,Ruibing Hou,Yingjie Chen,Hong Chang,Hao Liu,Shiguang Shan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Conditional human motion, Conditional human, human motion generation, motion generation remains, vision and robotics
备注:
点击查看摘要
Abstract:Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.
119. 【2605.29471】V2XCrafter: Learning to Generate Driving Scene Across Agents
链接:https://arxiv.org/abs/2605.29471
作者:Yihang Tao,Yu Guo,Senkang Hu,Yanan Ma,Zihan Fang,Sam Kwong,Yuguang Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scarce annotated real-world, enhance driving safety, diverse driving conditions, multi-agent collaborative perception, Collaborative driving systems
备注:
点击查看摘要
Abstract:Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.
120. 【2605.29462】Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
链接:https://arxiv.org/abs/2605.29462
作者:Qian Chen,Xianyin Zhang,Yanzhi Liu,Lifan Guo,Feng Chen,Chi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, enabling unified inference, emergence of Large, Large Vision-Language, substantially expanded model
备注:
点击查看摘要
Abstract:The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.
121. 【2605.29461】FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation
链接:https://arxiv.org/abs/2605.29461
作者:Zekang Zhang,Guangyu Gao,Youyun Tang,ChengJing Wu,Xiaochao Qu,Chi Harold Liu,Jianbo Jiao,Yunchao Wei,Luoqi Liu,Ting Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently advanced rapidly, large language models, coupling large language, recently advanced, advanced rapidly
备注: 18 pages, accepted by ICML 2026
点击查看摘要
Abstract:LLM-conditioned segmentation has recently advanced rapidly by coupling large language models with iterative mask generation frameworks. However, we identify a persistent failure mode in current propose-then-select pipelines. Although high-quality mask candidates are often generated, the final prediction may fail to match the given linguistic condition. This failure arises because language semantics are typically used as static prompts or post-hoc matching signals, rather than participating in the iterative mask generation process. Through systematic analysis, we show that many errors stem from semantic misalignment rather than poor mask quality. To address this issue, we propose FlowSeg, which introduces dynamic semantic guidance via a bidirectional semantic flow between intermediate decoding states and LLM-derived condition embeddings throughout the generation process. Language conditions actively guide mask refinement at each stage, while condition embeddings are progressively updated by emerging visual evidence. This design yields semantically grounded mask representations and visually aligned language conditions, enabling more reliable matching. We further incorporate a lightweight boundary-aware refinement to selectively enhance uncertain regions without perturbing confident interiors. Extensive experiments on referring expression segmentation and reasoning segmentation tasks demonstrate that FlowSeg consistently improves language-mask alignment and achieves state-of-the-art performance. Project page: this https URL
122. 【2605.29460】FedSmoothLoRA: Toward Smoother and Faster Convergence in Federated Low-Rank Adaptation
链接:https://arxiv.org/abs/2605.29460
作者:Zehao Wang,Guanglei Yang,Yihan Zeng,Hang Xu,Hongzhi Zhang,Wangmeng Zuo,Chun-Mei Feng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Low-Rank Adaptation, preserving data locality, inter-round state mismatch, fine-tuning of foundation, efficient solution
备注: 26 pages, 4 figures
点击查看摘要
Abstract:Federated fine-tuning of foundation models with Low-Rank Adaptation (LoRA) provides an efficient solution for reducing communication and computation costs while preserving data locality. However, the direct combination of FedAvg and LoRA suffers from three key issues: limited update space, which restricts the model's effective learning capacity; inter-round state mismatch, which disrupts cross-round local optimization continuity; and a client-agnostic starting state, which slows local convergence on clients. Although recent methods mitigate the limited update space issue by merging LoRA updates into the backbone across communication rounds, inter-round state mismatch and the client-agnostic starting state remain insufficiently addressed. To address these issues, we propose FedSmoothLoRA, a federated LoRA tuning framework that preserves the enlarged update space, improves cross-round local optimization continuity, and provides a client-aware starting state for local training. At each communication round, FedSmoothLoRA constructs the local LoRA initialization using two matrices: a Round-Matching matrix that preserves cross-round local state continuity, and a Gradient-Aligned matrix that provides client-specific optimization guidance from gradient signals estimated on local data. Together, these designs enable smoother and faster convergence. Extensive experiments on image classification and natural language generation tasks demonstrate that FedSmoothLoRA consistently outperforms existing federated LoRA tuning methods. Code: this https URL
123. 【2605.29455】Uni-RCM: Unified Reference-guided Cross-modal Mapping for Multi-Class Anomaly Detection
链接:https://arxiv.org/abs/2605.29455
作者:Yangchen Wu,Huiqiang Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:Multi-modal industrial anomaly, limiting practical scalability, fundamentally limiting practical, Multi-modal industrial, industrial anomaly detection
备注: This work has been submitted IEEE for potential publication
点击查看摘要
Abstract:Multi-modal industrial anomaly detection typically relies on separate models for each product category, fundamentally limiting practical scalability. When shifting to a unified paradigm that handles diverse classes simultaneously, detection accuracy often degrades due to inter-class interference and feature manifold confusion. To overcome these challenges, we propose a Unified Reference guided Cross-modal Mapping framework, named Uni-RCM. At its core, we propose a reference guide block to dynamically filter out category-specific noise by introducing a learnable reference feature, which captures the commonalities across different modalities. Besides, an offline residual quantizer is proposed to characterize the normal distribution by multiple cascaded codebooks. Extensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art performance in the challenging multi-class setting and in terms of image-level detection and pixel-level localization.
124. 【2605.29452】Comparative evaluation of photogrammetric reconstruction methods and 3D Gaussian Splatting for road surface roughness analysis
链接:https://arxiv.org/abs/2605.29452
作者:Marouane Elmegdar,Teng Xiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:traditional sensor-based techniques, road surface assessment, Gaussian Splatting, alternative to traditional, traditional sensor-based
备注: accepted by RSMIP 2026
点击查看摘要
Abstract:Image-based 3D reconstruction offers a low-cost alternative to traditional sensor-based techniques for road surface assessment. This study compares four reconstruction pipelines--COLMAP, Meshroom, Metashape, and 3D Gaussian Splatting (3DGS)--to evaluate their ability to estimate road surface roughness from smartphone imagery. All point clouds were processed in CloudCompare using a consistent workflow involving orientation alignment, segmentation, normal estimation, and roughness computation at neighborhood radiuses of 0.2, 0.4, and 0.6 model units. The results show that COLMAP provides the highest sensitivity to micro-texture, while Meshroom yields balanced reconstructions with moderate roughness variation. Metashape produces the smoothest geometry due to its internal filtering, and 3DGS captures visible irregularities but exhibits higher noise and lower density. The comparison demonstrates that open-source pipelines are viable for relative roughness evaluation, offering a practical approach for low-cost pavement monitoring.
125. 【2605.29448】How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
链接:https://arxiv.org/abs/2605.29448
作者:Jeff A. Bilmes,Gantavya Bhatt,Arnav M. Das
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
关键词:Neural scaling laws, Vendi Score, scaling laws appraise, Neural scaling, Vendi
备注: 75 pages
点击查看摘要
Abstract:Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.
126. 【2605.29447】Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
链接:https://arxiv.org/abs/2605.29447
作者:Tianpeng Bu,Xin Liu,Qihua Chen,Hao Jiang,Shurui Li,Hongtao Duan,Lu Jiang,Lulu Hu,Bin Yang,Minying Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:hindering real-world deployment, Robustness-driven Trajectory Synthesis, propose Robustness-driven Trajectory, advanced rapidly, hindering real-world
备注: ICML 2026 Spotlight. 36 pages, 19 figures, includes appendix
点击查看摘要
Abstract:While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at this https URL.
127. 【2605.29429】One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation
链接:https://arxiv.org/abs/2605.29429
作者:Sanghyun Jo,Seo Jin Lee,Seohyung Hong,Yoorim Gang,Hyeongsub Kim,Hyungseok Seo,Kyungsu Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cell-specific datasets suffer, datasets suffer severe, foundation models overcome, densely packed instances, interactive foundation models
备注: Accepted to MICCAI 2026 (Early Accept)
点击查看摘要
Abstract:Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance $O(N)$ to per-type $O(T)$, where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: this https URL
128. 【2605.29417】ParCo-SDF: Learning Prior-Free Partial-to-Complete Signed Distance Fields of Deformable Objects
链接:https://arxiv.org/abs/2605.29417
作者:Deokmin Hwang,Minseok Song,Daehyung Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deformable objects, study addresses, point-cloud observations, observations toward precise, reconstruction
备注: 6 pages
点击查看摘要
Abstract:This study addresses the partial-to-complete geometry reconstruction of deformable objects (DOs) from point-cloud observations toward precise DO manipulation. Recent DO reconstruction approaches often adopt implicit neural representations (INRs) to model continuous surfaces as well as capture structural variability. However, these methods typically rely on object-specific shape priors that improve training stability and limit generalization. To figure it out, we introduce ParCo-SDF, a two-stage partial-to-complete signed distance field (SDF) reconstruction framework consisting of temporal geometry encoding followed by FiLM-conditioned SDF prediction. The temporal encoder captures structural similarity across DO sequence, enabling prior-free stable training. FiLM-based conditioning preserves reconstruction expressivity while reducing network complexity. We evaluate the proposed method against a state-of-the-art DO surface reconstruction baseline on a rubber band manipulation dataset, demonstrating robust and high-fidelity reconstruction under severe occlusions.
129. 【2605.29416】3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding
链接:https://arxiv.org/abs/2605.29416
作者:Zhongyu Xia,Yousen Tang,Bingqing Wei,Yongtao Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, models have achieved, critical limitation, scene understanding, achieved remarkable
备注:
点击查看摘要
Abstract:Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors. Specifically, 3DVLA tackles the three challenges through: (1) pervasive 3D feature encoding with explicit multi-view consistency constraints across all modalities and a Spatially-Conditioned Geometry Aggregation method, (2) an instance estimation module with high-level instance tokens for 3D instance awareness, and (3) a masked self-supervised 3D encoding branch that retains its predictor for visual token completion to handle occlusions. We integrate 3DVLA with multiple VLA baselines and evaluate on LIBERO-Plus and RoboTwin 2.0. Results show consistent and significant gains in manipulation performance, validating both the effectiveness and plug-and-play compatibility of our approach.
130. 【2605.29402】Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge
链接:https://arxiv.org/abs/2605.29402
作者:Yinsong Xu,Wei Jing,Liuxin Zhang,Wanjun Lv,Hui Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multimodal large language, limited context length, egocentric videos remains, videos remains challenging, long-form egocentric videos
备注:
点击查看摘要
Abstract:Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.
131. 【2605.29390】Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation
链接:https://arxiv.org/abs/2605.29390
作者:Jungmin Ko,Jungwon Park,Jimyeong Kim,Changin Choi,Wonseok Lee,Wonjong Rhee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating high-quality images, increasingly capable, capable of generating, generating high-quality, negative guidance
备注: Preprint
点击查看摘要
Abstract:Text-to-image (T2I) models have become increasingly capable of generating high-quality images. Yet, enforcing the explicit absence of a specified object or attribute remains a fundamentally challenging problem. Existing approaches, including prompt negation, post-hoc editing, and negative guidance, remain insufficient for explicit concept suppression, often failing to remove the target concept or degrading overall image quality. To this end, we propose Orthogonal Negative Guidance in attention feature space, a training-free method that operates in the attention output space of MM-DiT-based T2I transformers. Our method orthogonalizes negative-prompt attention features with respect to positive-prompt features and subtracts only the orthogonal component, suppressing unwanted concepts while preserving desired semantics. Experiments on FLUX-dev and FLUX-schnell show that our method achieves favorable trade-offs between concept suppression, prompt alignment, and image quality. In human evaluation, our method outperforms the second-best baseline by 18.78%. We further show that our method supports multi-concept suppression and adjustable concept suppression.
132. 【2605.29380】RACER: Persistent Regularization for Robust Multimodal Finetuning
链接:https://arxiv.org/abs/2605.29380
作者:Hesam Asadollahzadeh,Feng Liu,Christopher Leckie,Sarah M. Erfani
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Mainstream strategies, catastrophic forgetting, Exponential Moving Average, Weighted Moving Average, Moving Average
备注: ICML 2026
点击查看摘要
Abstract:Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [this https URL](this https URL).
133. 【2605.29353】DeepFake Forensics AI: A Multi-Modal Detection and Blockchain-Anchored Evidence Management Platform
链接:https://arxiv.org/abs/2605.29353
作者:Naisha Minnah
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:AI-generated synthetic media, synthetic media poses, proliferation of AI-generated, poses a critical, critical threat
备注: 5 pages, 5 figures, 3 tables
点击查看摘要
Abstract:The proliferation of AI-generated synthetic media poses a critical threat to the integrity of digital evidence in legal and forensic contexts. Existing deepfake detection systems typically address a single modality and provide no mechanism for tamper-proof evidence preservation. We present DeepFake Forensics AI, a unified platform that detects synthetic media across image, video, and audio modalities, identifies generative architecture fingerprints, and anchors forensic evidence immutably on the Ethereum blockchain. Our system trains four independent neural networks from scratch: an EfficientNet-B4 image detector (AUC = 0.9868), a Bidirectional LSTM video detector (AUC= 0.9628), an ECAPA-TDNN audio detector (EER = 18.63%), and a novel GAN fingerprinting module (accuracy = 99.88%) that identifies the generative architecture behind a fake image. Evidence files are hashed with SHA-256, stored on IPFS via Pinata, and registered on-chain via a Solidity smart contract with role-based access control. The platform provides a React frontend and FastAPI backend suitable for deployment in forensic and legal workflows. To our knowledge, this is the first system to unify multi-modal deepfake detection with blockchain-based chain-of custody management.
134. 【2605.29341】WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
链接:https://arxiv.org/abs/2605.29341
作者:Chengzhi Liu,Yuzhe Yang,Sophia Xiao Pu,Yepeng Liu,Lin Long,Yichen Guo,Nuo Chen,Zhaotian Weng,Elena Kochkina,Simerjot Kaur,Charese Smiley,Xiaomo Liu,James Zou,Sheng Liu,Yuheng Bu,Songyou Peng,Xin Eric Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:large language models, Multimodal large language, decision time, large language, language models
备注: 25 pages, 8 figures
点击查看摘要
Abstract:Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.
135. 【2605.29339】DMC-CF: Dynamic Multimodal CounterFactual QA benchmark for Causal Reasoning
链接:https://arxiv.org/abs/2605.29339
作者:Junzhe Zhang,Huixuan Zhang,Guirong Wang,Xingyao Zhang,Pei Liu,Lin Qu,Hu Wei,Xiaojun Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated increasingly powerful, increasingly powerful multimodal, rapid advancement, demonstrated increasingly, increasingly powerful
备注:
点击查看摘要
Abstract:With the rapid advancement of multimodal large language models (MLLMs), models have demonstrated increasingly powerful multimodal capabilities. However, whether MLLMs trained through statistical learning can truly understand the causal relationships underlying the real world remains a key research question. In recent years, numerous multimodal causal reasoning datasets have been proposed. Nevertheless, these datasets are either limited in scale or constructed from synthetic images and videos, cartoon-based content, or other non-realistic multimodal sources. To address these limitations, we collect real-world videos and construct DMC-CF-Static, a large-scale benchmark for multimodal causal counterfactual reasoning. Furthermore, to mitigate issues such as data contamination in traditional static evaluation, we represent causal events using causal graphs and propose the Dynamic Graph Intervention (DGI) framework to build the dynamic evaluation benchmark DMC-CF-Dynamic from DMC-CF-Static. Experimental results on the overall DMC-CF, which includes both static and dynamic evaluation benchmarks, demonstrate that the multimodal causal reasoning capabilities of current multimodal large language models in real-world scenarios still require substantial improvement.
136. 【2605.29335】Rethinking FID Through the Geometry of the Reference Dataset
链接:https://arxiv.org/abs/2605.29335
作者:Yunghee Lee,Byeonghyun Pak
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Fréchet Inception Distance, Fréchet Inception, evaluate image generators, Inception Distance, image generators
备注: 9 pages, 2 figures. Accepted to ICML 2026 Workshop: Combining Theory and Benchmarks
点击查看摘要
Abstract:Fréchet Inception Distance (FID) is widely used to evaluate image generators, yet lower FID does not always correspond to better sample quality. We show that this mismatch depends in part on the geometry of the reference dataset. In a controlled study across six datasets, distributional density and effective rank significantly explain how FID changes as sample quality improves. Concentrated datasets tend to yield more favorable FID trends, whereas more dispersed datasets can make FID worsen despite better samples. Attribution to precision and recall and ablations with alternative feature spaces and distances support the same conclusion. These results suggest that distributional metrics should be interpreted together with the geometry of the reference dataset for more reliable benchmarking.
137. 【2605.29330】EarthShift: a benchmark for measuring robustness to real-world distribution shifts in Earth observation
链接:https://arxiv.org/abs/2605.29330
作者:Kelsey Doerksen,Hannah Kerner
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current Earth observation, Earth observation benchmarks, Current Earth, typically measuring generalization, measuring generalization in-distribution
备注:
点击查看摘要
Abstract:Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios, such as new time periods, geographies, scales, and sensors. We introduce EarthShift: the first public testbed for benchmarking robustness across multiple realistic distribution shifts encountered in remote sensing. EarthShift enables users to measure distributional robustness by comparing performance in- and out-of-distribution using datasets from paired datasets from different sources, temporal windows, geographic locations, and sensors. Our experiments on 8 geospatial foundation models (GFMs) and 11 tasks covering 5 shift types show that GFMs consistently perform 15-20% worse out-of-distribution on average regardless of model architecture, size, pre-training or fine-tuning strategy. We show that GFM robustness is similar to that of generic vision foundation models, and even fully-supervised models. This highlights a need for future research to strive for improvements in distributional robustness, not just performance, which can be benchmarked using EarthShift. We release our code and datasets to provide a testbed to guide future work to create foundation models that are robust and reliable in real-world applications. Code and data for EarthShift are available at: this https URL
138. 【2605.29325】Multi-Stage VLM Pipeline for Zero-Shot Traffic Accident Understanding
链接:https://arxiv.org/abs/2605.29325
作者:Fumiya Tatematsu,Fumihiko Takahashi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:AUTOPILOT Workshop, CCTV footage, accident timing, type from CCTV, impact centroid
备注: Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 13. Code: [this https URL](https://github.com/fuumin621/cvpr2026-accident-1st-place-solution)
点击查看摘要
Abstract:We present the 1st-place solution to the ACCIDENT challenge at the CVPR 2026 AUTOPILOT Workshop, which asks for zero-shot prediction of accident timing, impact centroid, and collision type from CCTV footage. On a frozen Qwen3-VL-32B-Instruct checkpoint we build a three-stage pipeline (full-video joint prediction, time refinement, and single-frame grounding of the impact centroid), run the same pipeline a second time on a 235B Mixture-of-Experts sibling, blend the two outputs 9:1, and finally snap each predicted point onto the nearest vehicle detection. The final system reaches Public LB 0.55469 / Private LB 0.57080, roughly +0.21 over the strongest host baseline (Molmo-7B, 0.358) and wins the challenge. We ablate each component, report the negative results that shaped the final design, and release the code at this https URL.
139. 【2605.29324】STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments
链接:https://arxiv.org/abs/2605.29324
作者:Junyang Wang,Haiyang Xu,Xi Zhang,Zhaoqing Zhu,Ming Yan,Jieping Ye,Jitao Sang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:GUI agents excel, Mobile GUI agents, Mobile GUI, GUI agents, GUI
备注: 24 pages, 4figures, 21 tables
点击查看摘要
Abstract:Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.
140. 【2605.29318】FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
链接:https://arxiv.org/abs/2605.29318
作者:Donglai Xiang,Vismay Modi,Rishit Dagli,Ty Trusty,Gilles Daviet,Anka He Chen,Nicholas Sharp,David I.W. Levin
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:deformable hyperelastic objects, Reproducing Kernel Particle, deformable hyperelastic, Kernel Particle Method, hyperelastic objects
备注: CVPR 2026, project website: [this https URL](https://research.nvidia.com/labs/sil/projects/freeform/)
点击查看摘要
Abstract:We present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40x training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of objects in different representations including meshes and Gaussian splats, as well as the application of our method in the downstream task of robot simulation.
141. 【2605.29316】CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation
链接:https://arxiv.org/abs/2605.29316
作者:Xuangeng Chu,Yuan Gan,Ziteng Cui,Shuhong Liu,Jian Wang,Bing Zhou,Tatsuya Harada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate synchronized lip, arbitrary audio clips, synchronized lip movements, synchronized lip, facial animation aims
备注:
点击查看摘要
Abstract:Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.
142. 【2605.29302】ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement
链接:https://arxiv.org/abs/2605.29302
作者:Jianping Ye,Michel Wedel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:short-form video advertising, digital media landscape, social media, media landscape, e-commerce platforms
备注:
点击查看摘要
Abstract:The digital media landscape has seen a pervasive shift toward short-form video advertising on TV, social media and e-commerce platforms. The present study focuses on deep saliency prediction for short-form video advertising. Deep saliency models have been used to generate predictions of human eye fixation patterns with the purpose of enhancing user interaction with digital technology and optimizing its design. For video ads, dynamic saliency maps capture where and when viewers are looking, revealing why video ads are effective, and how their content should be optimized. We develop and test a new deep dynamic saliency prediction model called ViASNet (Video Ad Saliency Network), which has an architecture founded on the 3D U-Net, and accommodates the influence of audio and the semantic meaning of scenes. We assess the model's performance on 151 video ads, each seen by about 20 viewers wile their eye movements were tracked, and explore the critical factors influencing model performance through ablation experiments. We calculate the entropy of the predicted saliency maps frame-by-frame as a diagnostic tool to identify ads and scenes that fail to engage viewers, and illustrate its use on test data of 15 unseen ads. Our study reveals that ad design and testing can be sped up considerably through automated systems built on deep saliency models such as ViASNet.
143. 【2605.29299】Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models
链接:https://arxiv.org/abs/2605.29299
作者:Kai Bian,Xucheng Guo,Bin Chen,Lingyan Ruan,Yiran Shen,Ting Dang,Hong Jia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:vision-language models remain, models remain fragmented, remain fragmented, dental vision-language models, Evaluations
备注:
点击查看摘要
Abstract:Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients, five task types and seven metrics. Across typical 14 VLMs, our results reveals an interesting observation: compact VLMs (e.g., 2B-parameter models) outperform larger VLMs in accuracy while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9-fold and memory use by 2.3-fold compared with a 7B baseline.
144. 【2605.29292】urbulence-Robust Dynamic Object Segmentation with Multi-Signal Priors and SAM2 Refinement
链接:https://arxiv.org/abs/2605.29292
作者:Bolian Peng,Ying Tang,Xu Liu,Long Sun,Xiaoqiang Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dynamic Object Segmentation, Challenge Track, Dynamic Object, technical report presents, Object Segmentation
备注:
点击查看摘要
Abstract:This technical report presents our solution for the CVPR 2026 UG2+ Challenge Track 3: Dynamic Object Segmentation in Turbulence (DOST). We design a training-free multi-signal segmentation pipeline that combines pretrained motion estimation, self-supervised semantic priors, background anomaly modeling, manually calibrated proposal fusion, and SAM2-based mask refinement. The method uses RAFT for dense motion responses, DINOv2 for semantic objectness priors, ViBe for training-free background modeling, and pretrained SAM2 for box-prompt mask refinement. Instead of optimizing an end-to-end segmentation network, our system operates entirely in inference mode. This design is suitable for the DOST setting, where severe atmospheric turbulence produces pseudo-motion, blur, and intermittent target visibility, making a single motion cue unreliable. The final submitted masks are evaluated by the official leaderboard, which reports 0.425041 mIoU and 0.457206 mDice. Since no task-specific model training or fine-tuning is performed, stronger learned temporal association, adaptive proposal selection, or task-specific adaptation may further improve the system.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.29292 [cs.CV]
(or
arXiv:2605.29292v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.29292
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
145. 【2605.29287】UniNote: A Unified Embedding Model for Multimodal Representation and Ranking
链接:https://arxiv.org/abs/2605.29287
作者:Jinghan Zhao,Wenwei Jin,Anqi Li,Jintao Tong,Luya Mo,Jiawei Li,Bin Li,Yao Hu
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:supporting critical industrial, critical industrial workflows, modern content platforms, supporting critical, fundamental part
备注: Accepted by KDD Ads Track 2026
点击查看摘要
Abstract:Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to the challenges of balancing global content representation with fine-grained local retrieval, the systemic inefficiency of decoupled embedding-and-ranking pipelines, and the inherent trade-offs between model precision and serving latency. To solve these issues, we propose \textbf{UniNote}, a unified embedding model designed for industrial I2I retrieval. Tailored retrieval strategies are introduced to support representation learning over complex, multimodal content at varying granularities. To operationalize these strategies, UniNote employs a two-stage training paradigm: the first stage leverages contrastive SFT to establish robust base embeddings, while the second stage refines ranking quality through a reinforcement learning (RL) process that aligns the model with content relevance. Our results show that UniNote achieves SOTA performance across diverse I2I tasks. Deployed at Xiaohongshu and integrated with Matryoshka Representation Learning (MRL), UniNote achieved significant improvements in retrieval quality and cost efficiency in large-scale applications.
146. 【2605.29260】Deep Psychovisual Image Representations
链接:https://arxiv.org/abs/2605.29260
作者:Wendi Ma,Aryaman Sharma,Wei Dai,Shekhar S. Chandra
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:decouples low-level feature, low-level feature extraction, suggest human vision, human vision decouples, vision decouples low-level
备注:
点击查看摘要
Abstract:Psychovisual models suggest human vision decouples low-level feature extraction from higher cognition by first forming intermediate abstractions. In contrast, deep learning-based vision models routinely extract and aggregate features using homogeneous stacks of spatial layers, rendering their decision-making processes opaque. In this paper, we propose Deep Visual Coding, a learned frequency-domain representation inspired by 1990s image codes that quantised perceptually salient frequencies, which together with complex-valued image representations produces psychovisual-style abstractions. This approach enables the first psychovisual-based deep learning framework, utilizing data-driven spectral filters that learn to encode task-relevant semantic structures within distinct frequency sub-bands. Salience analyses reveal that our psychovisual models extract highly interpretable object parts compared to the amorphous regions produced by regular Convolutional Neural Networks (CNNs). Furthermore, we find that our models are less depth dependent than CNNs for model scaling, since our complex-valued representations and learned abstractions subsume the role of the deep spatial layers. Together, these findings demonstrate that psychovisual coding provides a promising path toward more efficient and transparent vision models.
147. 【2605.29230】oward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data
链接:https://arxiv.org/abs/2605.29230
作者:Caio Petrucci,Leo Sampaio Ferraz Ribeiro,Sandra Avila
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:images typically relies, facial images typically, images typically, includes images, privacy concerns
备注: 12 pages; 3 figures; 5 tables
点击查看摘要
Abstract:Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.
148. 【2605.29221】An Approach for Thyroid Nodule Analysis Using Thermographic Images
链接:https://arxiv.org/abs/2605.29221
作者:J. R. González,É. O. Rodrigues,C. P. Damião,C. A. P. Fontes,A. C. Silva,A. C. Paiva,H. Li,C. Du,A. Conci
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:common type, female individuals, cancer, Thyroid, Thyroid cancer
备注:
点击查看摘要
Abstract:Thyroid cancer is said to be the second most common type of cancer in female individuals and the third in males by 2030, according to projections. In general, detecting cancer in its early stages improves the chance of survival of the individual. Thermography is a diagnostic tool that has been increasingly used to detect cancer and abnormalities, including that of thyroid. Various methods to segment and detect hot regions in thermograms and, consequently, to detect suspicious tissues present in these images have been proposed. It is well known that medical diagnosis yields a great deal of information. Thus, physicians have to comprehensively analyse and evaluate this information in a short period of time, which is infeasible in most cases. In this work, we perform a general review of thermography , focusing on the thyroid analysis. We propose protocols for image acquisiton and an autonomous registration for thyroid images. We also perform analyses of the image data, which include feature extraction, image processing, and a possible approach for classification of healthy or unhealthy patients. In summary, this work presents a pilot project for detection of tumors in our university hospital, which is part of an effort to support preventive medical actions in our endocrinology department. Under some future adjustments, this project will be submitted for approval by the ethics and research committee of Hospital Universitário Antonio Pedro at Universidade Federal Fluminense (HUAP-UFF) and to the Brazilian Ministry of Health Ethical committee under the name: Evaluation of the importance of thermography to aid diagnosis of thyroid nodules of patients in HUAP-UFF (in Portuguese: Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF).
149. 【2605.29220】Motion-guided sparse correction enables expert-quality point tracking across diverse microscopy regimes
链接:https://arxiv.org/abs/2605.29220
作者:Leonidas Zimianitis,Pasindu Thenahandi,Kai Buckhalter,Dineth Jayakody,Julian O. Kimura,Xinyue Liang,Karen Cunningham,Azeem Ahmad,Balpreet S. Ahluwalia,Sampath Jayarathna,Nikos Chrisochoides,Brandon Weissbourd,Dushan N. Wadduwage
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:non-canonical biological systems, Refinement Interpolation Platform, persistent challenge, Point Location Estimation, microscopy videos remains
备注:
点击查看摘要
Abstract:Tracking the dynamics of non-canonical biological systems in microscopy videos remains a persistent challenge. Both classical and learning-based trackers depend on expert-reviewed data to be evaluated and adapted, yet exhaustive manual annotation rarely scales to the videos where these tools are needed most. We developed RIPPLE (Refinement Interpolation Platform for Point Location Estimation), which recasts annotation as sparse correction: a user clicks a starting point, RIPPLE proposes a full trajectory, and the user intervenes only where the trajectory drifts. We tested RIPPLE on five challenging microscopy datasets from our laboratories, four from the transparent jellyfish Clytia hemisphaerica and one tracking landmarks on rapidly moving sperm. Across these, RIPPLE matched the quality of exhaustive manual annotation while reducing manual clicks by 3 to 25 times across datasets. RIPPLE thereby fills a missing layer between manual annotation and fully automated tracking, enabling immediate quantification of biological dynamics, method benchmarking, and the production of the gold-standard data needed to adapt future automated microscopy trackers.
150. 【2605.29219】SalsaAgent: A multimodal embodied language model for interactive dance generation
链接:https://arxiv.org/abs/2605.29219
作者:Payam Jome Yazdian,Zoe Stanley,Angelica Lim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:humanoids involves bidirectional, humanoids involves, involves bidirectional, nonverbal reactivity, language model
备注:
点击查看摘要
Abstract:Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop. We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline. Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.
151. 【2605.29217】owards the automated segmentation of epicardial and mediastinal fats: A multi-manufacturer approach using intersubject registration and random forest
链接:https://arxiv.org/abs/2605.29217
作者:É. O. Rodrigues,A. Conci,F. F. C. Morais,M. G. Pérez
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:health risk factors, coronary artery calcification, atrial fibrillation, carotid stiffness, artery calcification
备注:
点击查看摘要
Abstract:The amount of fat on the surroundings of the heart is correlated to several health risk factors such as carotid stiffness, coronary artery calcification, atrial fibrillation, atherosclerosis, cancer incidence and others. Furthermore, the cardiac fat varies unrelated to the overall fat of the subject, and, therefore, it reinforces the quantitative analysis of these adipose tissues as being essential. Clinical decision support systems are computer programs capable of evaluating information and providing a corresponding diagnosis or data to complement the physicists' analyses. The aim of this work is to propose a method capable of fully automatically segmenting two types of cardiac adipose tissues that stand apart from each other by the pericardium on CT images obtained by the standard acquisition protocol used for coronary calcium scoring. Much effort was devoted to promote minimal user intervention and ease of reproducibility. The methodology proposed in this work consists of a registration, which will roughly adjust input images to a standard, an extraction of features related to pixels and their surrounding area and a segmentation step based on data mining classification algorithms that define if an incoming pixel is of a certain type. Experimentations showed that the achieved mean accuracy for the epicardial and mediastinal fats was 98.4% with a mean true positive rate of 96.2%. In average, the Dice similarity index was equal to 96.8%.
152. 【2605.29212】MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality
链接:https://arxiv.org/abs/2605.29212
作者:Yujin Park,Haejun Chung,Ikbeom Jang
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:modern imaging systems, imaging systems emerges, modern imaging, imaging systems, systems emerges
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Image quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.
153. 【2605.29198】Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization
链接:https://arxiv.org/abs/2605.29198
作者:Shufan Li,Konstantinos Kallidromitis,Akash Gokul Yusuke Kato,Kazuki Kozuka,Aditya Grover
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated strong performance, reinforcement learning methods, including mathematical reasoning, diverse domains, including mathematical
备注: 21 pages, 11 figures
点击查看摘要
Abstract:Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.
154. 【2605.29136】Eulerian Gaussian Splatting using Hashed Probability Pyramids
链接:https://arxiv.org/abs/2605.29136
作者:Mia Gaia Polansky,George Kopanas,Stephan Garbin,Todd Zickler,Dor Verbin
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:probabilistic splat-based radiance, splat-based radiance field, Gaussian Splatting, replacing heuristic primitive, heuristic primitive manipulation
备注: CVPR 2026. Project Page: [this https URL](https://euleriansplatting.github.io)
点击查看摘要
Abstract:We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density using a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based optimization. To stabilize the optimization, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our framework eliminates brittle priors and naturally explores the volume, achieving state-of-the-art reconstruction quality on mip-NeRF 360 while preserving 3DGS-level rendering speed.
155. 【2605.29122】Robust Cross-Domain Generalization Using Unlabeled Target Data with Source-Domain Supervision
链接:https://arxiv.org/abs/2605.29122
作者:Yuyue Zhou,Shrimanti Ghosh,Michael(Kai Yue)Xie,Justin JY Kim,Jessica Knight,Steel McDonald,Vincent Man,Jacob L. Jaremko,Abhilash Hareendranathan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generalize medical imaging, clinical sites, difficult and costly, desirable to generalize, generalize medical
备注:
点击查看摘要
Abstract:It is often desirable to generalize medical imaging AI models trained with dense annotations to data acquired from different ultrasound scanners or clinical sites; however, retraining these models with new annotations is often difficult and costly. We examine this challenge in pediatric wrist fracture assessment using point-of-care ultrasound (POCUS), where fractures are common and can be effectively triaged via ultrasound. AI has shown radiologist-level performance for fracture detection, often aided by high-quality bony structure segmentation. However, due to significant domain shifts, models perform poorly on data from other centers or probes, and obtaining segmentation labels across devices is impractical due to manual annotation effort and data privacy concerns. To address this, we propose a target-informed self-supervised pretraining and model-ensemble strategy. Specifically, our approach combines masked image modeling (MIM) and contrastive learning to learn target-domain structural representations without labels, and introduces a confidence-aware infusion head to adaptively integrate predictions. The source dataset, collected with a Philips Lumify probe, contained dense labels, while the target dataset, acquired with a TeleMED portable probe, was unlabeled. The datasets were kept strictly separate throughout the entire process. Our method used labeled source data for supervised training and leveraged target-domain pretraining to improve generalization. On 318 images from 62 pediatric POCUS videos, this approach significantly improved cross-device performance, achieving over 6% Dice improvement on the target domain versus the baseline. These results demonstrate a label-efficient and privacy-preserving approach for cross-device-robust ultrasound AI, offering a framework that can be extended to multi-center studies or federated learning setups.
156. 【2605.29098】Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
链接:https://arxiv.org/abs/2605.29098
作者:Jiachen Lu,Hailan Shanbhag,Haitham Al Hassanieh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reconstructing object geometry, fundamentally challenging due, lensless imaging nature, low spatial resolution, Reconstructing object
备注: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
点击查看摘要
Abstract:Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework GeRaF 2.0 that leverages the outside LoS geometry to model and guide RF propagation from the LoS region into the NLoS region. By integrating visual LoS priors into the neural field formulation, GeRaF 2.0 achieves stable training and physically consistent reconstruction of both visible and hidden geometry, setting a new state-of-the-art in RF-based geometry reconstruction.
157. 【2605.29097】GeRaF: Neural Geometry Reconstruction from Radio Frequency Signals
链接:https://arxiv.org/abs/2605.29097
作者:Jiachen Lu,Hailan Shanbhag,Haitham Al Hassanieh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:neural implicit learning, radio frequency, neural implicit, reconstruction from radio, signals
备注: Accepted at NeurIPS 2025 (Spotlight)
点击查看摘要
Abstract:GeRaF is the first method to use neural implicit learning for near-range 3D geometry reconstruction from radio frequency (RF) signals. Unlike RGB or LiDAR-based methods, RF sensing can see through occlusion but suffers from low resolution and noise due to its lensless imaging nature. While lenses in RGB imaging constrain sampling to 1D rays, RF signals propagate through the entire space, introducing significant noise and leading to cubic complexity in volumetric rendering. Moreover, RF signals interact with surfaces via specular reflections, requiring fundamentally different modeling. To address these challenges, GeRaF (1) introduces filter-based rendering to suppress irrelevant signals, (2) implements a physics-based RF volumetric rendering pipeline, and (3) proposes a novel lensless sampling and lensless alpha blending strategy that makes full-space sampling feasible during training. By learning signed distance functions, reflectiveness, and signal power through MLPs and trainable parameters, GeRaF takes the first step towards reconstructing millimeter-level geometry from RF signals in real-world settings.
158. 【2605.29092】Lightweight Complementary-Cue Fusion for Robust Video Face Forgery Detection
链接:https://arxiv.org/abs/2605.29092
作者:Sunghwan Baek,Tariq Anwaar,Karanveer Singh,Rita Singh
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Current face video, Local Binary Patterns, dual-stream backbones, Current face, wide or dual-stream
备注: 13 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Current face video forgery detectors use wide or dual-stream backbones. We show that a single, lightweight fusion of two handcrafted cues can achieve higher accuracy with a much smaller model. Based on the Xception baseline model (21.9 million parameters), we build two detectors: LFWS, which adds a 1x1 convolution to combine a low-frequency Wavelet-Denoised Feature (WDF) with a phase-spectrum channel derived from Spatial-Phase Shallow Learning (SPSL), and LFWL, which merges WDF with Local Binary Patterns (LBP) in the same way. This extra module adds only 292 parameters, keeping the total at 21.9 million, smaller than F3Net (22.5 million) and less than half the size of SRM (55.3 million). Even with this minimal overhead, the fused models increase the average area under the curve (AUC) from 74.8% to 78.6% on FaceForensics++ and from 70.5% to 74.9% on DFDC-Preview, gains of 3.8% and 4.4% over the Xception baseline. They also consistently outperform F3Net, SRM, and SPSL in eight public benchmarks, without extra data or test-time augmentation. These results show that carefully paired, handcrafted features, combined through the lightweight fusion block, can provide competitive robustness at a significantly lower cost than comparable frequency-based detectors. Our findings suggest a need to reevaluate scale-driven design choices in face video forgery detection.
159. 【2605.29089】OISD: On-Policy Internal Self-Distillation of Language Models
链接:https://arxiv.org/abs/2605.29089
作者:Xinyu Liu,Darryl Cherian Jacob,Yang Zhou,Jindong Wang,Pan He
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent reinforcement learning, sparse outcome-level rewards, post-training approaches primarily, approaches primarily optimize, largely overlooking predictive
备注: Under Review for Publication
点击查看摘要
Abstract:Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at this https URL
160. 【2605.29088】A Deep Learning Iterative Framework for Sentinel-1 Stripmap Enhancement Based on Azimuth Doppler Decomposition
链接:https://arxiv.org/abs/2605.29088
作者:Juan Francisco Amieva,Christian Ayala,Roberto Del Prete,Mikel Galar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, intrinsic imaging artifacts, Earth observation
备注: Accepted at the AI4Space Workshop, CVPR 2026
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) imagery enables all-weather, day-and-night Earth observation; however, it remains difficult to interpret due to speckle noise and other intrinsic imaging artifacts. Sentinel-1 (S1) constitutes one of the most widely used spaceborne SAR missions, offering systematic global coverage, high temporal resolution, dual-polarization imaging, and free data availability. Among S1 modes, Stripmap (SM) provides the highest resolution, yet speckle noise and spatial constraints often hinder applications requiring finer spatial detail. This motivates the need for effective image enhancement strategies. In this work, we propose a self-supervised enhancement framework for S1 SM imagery based on azimuth subaperture decomposition. The method exploits the physical consistency between subaperture reconstructions and the corresponding full-aperture image to generate paired training data without external sensors, simulated ground truth, or multi-temporal stacks. The proposed framework integrates single- and multi-frame learning and incorporates an iterative inference scheme that progressively refines image quality. Experiments on real S1 SM data show that the proposed approach consistently outperforms the widely adopted self-supervised deep learning baseline MERLIN, in terms of PSNR and SSIM, while MERLIN attains higher ENL, highlighting a trade-off between structural fidelity and speckle smoothing. Overall, the results demonstrate that subaperture-based supervision provides a physically grounded, reproducible, and operationally viable approach for SAR image enhancement using S1 data. It is worth noting that the proposed approach can be extended to other SAR platforms, polarizations, and acquisition modes.
161. 【2605.29074】Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models
链接:https://arxiv.org/abs/2605.29074
作者:Jiyao Zhang,Mingxu Zhang,Yitong Peng,Haoxuan Liu,Chenshuo Wang,Yuxing Long,Haoyang Huang,Dongjiang Li,Nan Duan,Hui Shen,Hao Dong
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Vision Language Models, current Vision Language, Vision Language, Grasp Point Prediction, Language Models
备注:
点击查看摘要
Abstract:Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.
162. 【2605.29064】Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
链接:https://arxiv.org/abs/2605.29064
作者:Neemias da Silva,Myriam Delgado,Rodrigo Minetto,Daniel Silver,Thiago H Silva
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
关键词:prompting shapes language, shapes language generated, multimodal large language, large language models, urban perception setting
备注: 10 pages, 6 figures
点击查看摘要
Abstract:We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona-related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.
163. 【2605.29012】rajectory Constraints for Imaging Inverse Problems
链接:https://arxiv.org/abs/2605.29012
作者:Chaoyan Huang,Haijie Yuan,Saiprasad Ravishankar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:imaging inverse problems, solving imaging inverse, Diffusion-based and iterative, inverse problems, effective tools
备注: 20 pages, 10 figures
点击查看摘要
Abstract:Diffusion-based and iterative methods have become effective tools for solving imaging inverse problems. Their reconstruction process naturally forms a trajectory of intermediate estimates. Although these intermediate estimates define a reconstruction trajectory, most methods do not explicitly regularize the transitions between consecutive states. To address this limitation, we introduce TRACE, a training-free TRAjectory-Constrained rEconstruction framework that stabilizes the reconstruction path by coupling adjacent states along the trajectory. This gives a trajectory-level model that can be interpreted as a sequence of proximal updates. Since the exact proximal update is generally intractable, we approximate it with a neural mapping. This yields a diffusion-like reconstruction process with an explicit coupling between neighboring states. We provide a stability analysis showing that temporal coupling bounds trajectory variation and that this control is preserved under untrained network updates. Experiments on linear and nonlinear image reconstruction tasks show that TRACE improves reconstruction quality. Trajectory-level analyses and ablations confirm that temporal coupling directly affects state transitions along the reconstruction path.
164. 【2605.29004】Auditing Training-Free 3D Shape Retrieval with Diffused Geodesic Moments
链接:https://arxiv.org/abs/2605.29004
作者:Zhicheng Du,Changyue Liu,Wenji Xi,Zhaotian Xie,Zhuo Deng,Ziheng Zhang,Yang Liu,Lan Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:making isolated component, Reported retrieval scores, component evaluation difficult, isolated component evaluation, descriptors conflate local
备注:
点击查看摘要
Abstract:Reported retrieval scores for training-free shape descriptors conflate local signal design, normalization, aggregation, codebook fitting, and metric choices, making isolated component evaluation difficult. This paper reframes descriptor evaluation as a {\em protocol audit}. We introduce Diffused Geodesic Moments (DGM), a seed-conditioned descriptor that computes sparse implicit heat responses, converts them to distance-like fields, and summarizes each vertex by low-order moments across seeds and scales. DGM is used both as a practical non-spectral baseline and as an instrument for isolating protocol effects. On the registered FAUST benchmark split (FAUST-Reg) and the TOSCA shape collection, aggregation-matched experiments show that an independent Geometric Moment Shape Descriptor baseline built on Heat Kernel Signature features (GMSD-HKS) obtains the highest scores in this implementation ($0.621/0.820$ and $0.865/0.963$ mean average precision (mAP)/top-1), Wave Kernel Signature (WKS) remains a strong classical signal, and DGM is useful mainly when sparse solves, non-spectral deployment, or symmetry-informative seed frames are priorities. The broader finding is methodological: the input field and aggregation protocol can dominate the moment formula. The paper contributes a reproducible protocol-cascade analysis, a cross-shape alignment diagnostic for functional-map compatibility, and concrete recommendations for designing and reporting training-free shape descriptors.
165. 【2605.28995】GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation
链接:https://arxiv.org/abs/2605.28995
作者:Polytimi Anna Gkotsi,Andrii Zadaianchuk,Mohammad Mahdi Derakhshani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent approaches integrating, approaches integrating vision-language, spatial structure required, Recent approaches, conditioning typically rely
备注:
点击查看摘要
Abstract:Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation. To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs. It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.
166. 【2605.28962】Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
链接:https://arxiv.org/abs/2605.28962
作者:Yurong Gao,Zicheng Zhang,Congying Han,Tiande Guo,Xinmin Qiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion bridge models, bridge models offer, Diffusion bridge, Noise-Aligned Diffusion Bridge, standard diffusion models
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Diffusion bridge models offer a powerful framework for connecting two data distributions, such as in image restoration and translation. Many existing methods learn this bridge by mimicking the score-matching formulation of standard diffusion models. In this work, we find that this way leads to an anomalous underfitting phenomenon near the target endpoint, as the process approaches the target distribution ($t \to 0$). This underfitting, characterized by significant drift in the predicted variance and direction, results from an excessively large discrepancy in noise levels between the network's input and its regression this http URL resolve this issue, we propose the Noise-Aligned Diffusion Bridge (NADB).Our approach reformulates the diffusion bridge by first employing a mean network to provide a cleaner conditional target, and then introducing a novel, noise-aligned mapping relationship. This new formulation resolves the noise mismatch and corrects the underfitting near the target endpoint. Experimental validation across multiple image restoration and image translation tasks demonstrates the effectiveness of our approach. Code is available at this https URL.
167. 【2605.30167】Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks
链接:https://arxiv.org/abs/2605.30167
作者:Daniel Tinoco,Raquel Menezes,Carlos Baquero,Alexandra Silva
类目:Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
关键词:complete spatially correlated, Predicting a complete, spatially correlated field, complete spatially, spatially correlated
备注: 53 pages, 10 figures
点击查看摘要
Abstract:Predicting a complete spatially correlated field from sparse observations is a fundamental challenge in spatial statistics and environmental modelling. Classical interpolation methods such as Kriging rely on Gaussian process assumptions and variography, which can limit their effectiveness in non-stationary settings and require substantial domain expertise. In this work, we leverage an architecture based on convolutional neural networks (CNNs) for spatial interpolation that is trained and applied on a single partially observed field, without access to external data or prior fields. The model is supervised directly on the observed locations and learns to predict values at unobserved points on the user defined grid. Unlike Kriging, our method does not require explicit covariance modelling or variogram estimation, and it can flexibly capture local spatial patterns in a data-driven manner. This work demonstrates the potential of CNNs for single-instance spatial interpolation under sparse supervision, offering a practical alternative to classical geostatistical methods, and extending the use of CNNs to a new problem domain.
168. 【2605.29703】Subcortical Shape Variations and Their Associations with Cognition Across the 8th Decade of Life. A Study in the Lothian Birth Cohort 1936
链接:https://arxiv.org/abs/2605.29703
作者:Maria del C. Valdes-Hernandez,Wonjung Park,Joanna Moodie,Susana Muñoz Maniega,Janie Corley,Fraser N. Sneden,Mark E. Bastin,Joanna M. Wardlaw,Simon R. Cox,Jinah Park
类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
关键词:functionally-relevant brain aging, Lothian Birth Cohort, subcortical brain structures, gross volumetry, capture aspects
备注: 34 pages
点击查看摘要
Abstract:The study of brain morphology changes in normal individuals may capture aspects of functionally-relevant brain aging not fully indicated by gross volumetry. Despite the important role of subcortical brain structures in cognition, the associations between their morphological trajectories and cognitive changes in aging have not been documented. We use neuroimaging, demographic, and cognitive data from a large longitudinal study of cognitive aging, the Lothian Birth Cohort 1936, to explore shape changes in subcortical brain structures of community-dwelling individuals across their 8th decade of life. We investigate the association of these changes with cognitive aging using ANCOVA and mixed linear model analyses. Subcortical shape changes were heterogeneous, with varied atrophy patterns across whole period. The hippocampus and the ventral DC experienced varied morphological deformations (from its baseline point) different in left and right hemispheres, while the thalami and globus pallidi shapes, for example, experienced a more uniform volume contraction, nearly symmetrical throughout different timelines. Changes in general cognition were mainly associated with inwards and outwards vertex displacements between the time-points.
169. 【2605.29415】Constructing efficient channels for ideal observers using the conjugate gradient method
链接:https://arxiv.org/abs/2605.29415
作者:Weimin Zhou
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
关键词:Task-based assessment, medical imaging systems, Bayesian Ideal Observer, critically important, design and optimization
备注: Submitted to the Journal of Medical Imaging (JMI) Special Issue Honoring Dr. Harrison H. Barrett
点击查看摘要
Abstract:Task-based assessment of image quality (IQ) is critically important for the design and optimization of medical imaging systems. Ideal observers, including the Bayesian Ideal Observer (IO) and the ideal linear observer, i.e., the Hotelling observer (HO), provide objective figures of merit (FOMs) that quantify system performance on signal detection tasks. However, the application of ideal observers to high-dimensional image data is often computationally intractable. Channel mechanisms provide an effective framework for dimensionality reduction that can facilitate the computation of ideal observers. This work presents a conjugate gradient (CG)-based method to construct efficient channels for approximating the IO and HO performance.
170. 【2605.29063】Accelerating HEVC Intra Partitioning via a CNN-Hierarchical Attention Transformer Hybrid
链接:https://arxiv.org/abs/2605.29063
作者:Krishna Kumar Sharma,Somdyuti Paul
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Efficiency Video Coding, High Efficiency Video, Video Coding, considerable computational overhead, High Efficiency
备注:
点击查看摘要
Abstract:The recursive quad-tree partitioning in High Efficiency Video Coding (HEVC) incurs considerable computational overhead, with exhaustive rate-distortion optimization for CTU partition prediction consuming the dominant share of encoding time. Although partition prediction through deep learning has emerged as a viable encoding accelerator, an architectural dichotomy remains largely unaddressed: CNNs are computationally efficient but spatially myopic due to their localized effective receptive fields, failing to capture long range semantic relationships and repetitive textures; conversely, transformer based architectures are better at capturing global context but incur prohibitive CPU latency, a critical liability that impedes deployment which is predominantly CPU-bound. This paper introduces Hybrid Fast Vision Transformer (HFViT), a hybrid architecture designed to accelerate HEVC intra-mode partition prediction. HFViT fuses a reparameterized depthwise-separable convolutional backbone with a Hierarchical Attention Transformer (HAT) mechanism, leveraging a carrier token scheme to enable efficient global information propagation at sub-quadratic complexity. Post-training structural fusion collapses batch normalization into preceding layers to further reduce latency. Comprehensive evaluation reveals the efficacy of HFViT in accelerating HEVC intra-encoding across resolutions. On standard JCT-VC test sequences, HFViT reduces the average VMAF BD-rate penalty by 2.4, 2.6, and 7.9 percentage points on Classes A, B and E, respectively, as compared to the competing ETH-CNN baseline while maintaining CPU inference latency within 8% of the CNN baseline and surpassing it on GPU by 40%, establishing practical viability for real-time encoder integration.

