本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新798篇论文,其中:

  • 自然语言处理153
  • 信息检索19
  • 计算机视觉125

自然语言处理

1. 【2606.06492】Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

链接https://arxiv.org/abs/2606.06492

作者:Liliana Hotsko,Yinxi Li,Yuntian Deng,Pengyu Nie

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:resolve imports, project conventions, repository-level context, context to resolve, Code language models

备注

点击查看摘要

Abstract:Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at this https URL the model checkpoints and RepoPeftBench datasets can be found at this https URL.

2. 【2606.06481】Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

链接https://arxiv.org/abs/2606.06481

作者:Sondos Mahmoud Bsharat,Jiacheng Liu,Xiaohan Zhao,Tianjun Yao,Xinyi Shang,Yi Tang,Jiacheng Cui,Ahmed Elhagry,Salwa K. Al Khatib,Hao Li,Salman Khan,Zhiqiang Shen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:progressive human-AI co-editing, longer purely human-written, human-AI co-editing, assistants become increasingly, increasingly integrated

备注: Our code and data are available at [this https URL](https://github.com/VILA-Lab/OpAI-Bench)

点击查看摘要

Abstract:As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at this https URL.

3. 【2606.06474】Self-Augmenting Retrieval for Diffusion Language Models

链接https://arxiv.org/abs/2606.06474

作者:Paul Jünger,Justin Lovelace,Linxi Zhao,Dongyoung Go,Kilian Q. Weinberger

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:models generate text, Discrete diffusion language, language models generate, diffusion language models, response in parallel

备注: ICML 2026

点击查看摘要

Abstract:Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.

4. 【2606.06473】MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

链接https://arxiv.org/abs/2606.06473

作者:Shangheng Du,Xiangchao Yan,Jinxin Shi,Zongsheng Cao,Shiyang Feng,Zichen Liang,Boyuan Sun,Tianshuo Peng,Yifan Zhou,Xin Li,Jie Zhou,Liang He,Bo Zhang,Lei Bai

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model, Large language, machine learning engineering, language model, key capability

备注

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at this https URL.

5. 【2606.06467】You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

链接https://arxiv.org/abs/2606.06467

作者:Yutao Sun,Yanqi Zhang,Li Dong,Jianyong Wang,Furu Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:generate long intermediate, long intermediate chains, models generate long, chains of thought, increasingly constrained

备注

点击查看摘要

Abstract:Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.

6. 【2606.06464】Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

链接https://arxiv.org/abs/2606.06464

作者:Mandana Samiei,Eunice Yiu,Anthony GX-Chen,Dongyan Lin,Jocelyn Shen,Blake A. Richards,Alison Gopnik,Doina Precup

类目:Computation and Language (cs.CL)

关键词:causal learning literature, long-standing finding, learning literature, simultaneous presence, presence of multiple

备注: Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

点击查看摘要

Abstract:A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

7. 【2606.06454】Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

链接https://arxiv.org/abs/2606.06454

作者:Mehmet Iscan

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Large language models, fast-growing practice equips, models increasingly write, language models increasingly, Large language

备注: 34 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill's Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill's Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.

8. 【2606.06447】Latent Reasoning with Normalizing Flows

链接https://arxiv.org/abs/2606.06447

作者:Guancheng Tu,Xiangjun Fu,Suhao Yu,Yao Tang,Haoqiang Kang,Lianhui Qin,Yizhe Zhang,Jiatao Gu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, demonstrating the importance, Large, intermediate computation

备注

点击查看摘要

Abstract:Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.

9. 【2606.06443】Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

链接https://arxiv.org/abs/2606.06443

作者:Xinnong Zhang,Wanting Shan,Hanjia Lyu,Zhongyu Wei,Jiebo Luo

类目:Computation and Language (cs.CL); Multimedia (cs.MM); Social and Information Networks (cs.SI)

关键词:Large language models, Large language, social media users, simulate social media, language models

备注

点击查看摘要

Abstract:Large language models are increasingly used to simulate social media users and infer how individuals may respond to online discussions. However, it remains unclear whether these simulations reflect precise user-specific beliefs or whether they are highly sensitive to semantically independent changes in conversational contexts. In this work, we study counterfactual context revision as a framework for auditing LLM-based stance simulation. Given an original online conversation, we first infer a target user's stance toward a specific topic. We then apply controlled revision strategies to the conversational context and simulate the user's stance again under the revised context. We compare text-only revision strategies with a multimodal one that incorporates meme-based context and evaluate two main effectiveness metrics, i.e., average directional stance shift and stance transition rate. The results reveal effective and robust stance transitions in both text-only and multimodal strategies across different polarization-preference mechanisms. Our study contributes an evaluation framework for understanding the context sensitivity of LLM-based stance simulation. More broadly, it highlights both the promise and risk of using LLMs to simulate online opinion dynamics.

10. 【2606.06428】Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

链接https://arxiv.org/abs/2606.06428

作者:Hanxu Hu,Zdeněk Šnajdr,Pinzhen Chen,Jannis Vamvas,Rico Sennrich

类目:Computation and Language (cs.CL)

关键词:undergoing continued training, Prior work, work has shown, shown that large, undergoing continued

备注: 15 pages, 2 figures

点击查看摘要

Abstract:Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific languages, with limited zero-shot transfer at test time. To translate extremely low-resource languages at scale, we argue that LLMs must acquire the meta-skill of utilizing in-context linguistic knowledge rather than memorizing specific languages. In this paper, we propose a reinforcement learning (RL) approach to unseen language translation given rich linguistic context, using a surface-level translation metric (chrF) as the reward. Empirically, despite the lightweight reward, our RL-trained models effectively extract and apply relevant linguistic information from the provided context, leading to better translations on completely unseen languages than in-context learning or supervised fine-tuning. Our analyses suggest that outcome-based RL can extend beyond conventional reasoning tasks like math and coding to serve as a recipe for language learning from context.

11. 【2606.06420】A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

链接https://arxiv.org/abs/2606.06420

作者:Petr Parshakov

类目:Computation and Language (cs.CL)

关键词:Russian parallel corpus, Russian parallel, extremely low-resource setting, extremely low-resource, explicit evaluation protocol

备注: 18 pages, 6 tables, 3 figures

点击查看摘要

Abstract:We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. The protocol includes story-level cross-validation, deterministic retrieval for few-shot prompting, strict validation of generated outputs, complementary reference-based and judge-based metrics, and story-level uncertainty estimates. Across models, LLMs produce non-trivial translations, but performance varies strongly by model family and prompting regime. Retrieval-based few-shot prompting consistently improves over zero-shot prompting, while gains beyond a small retrieved context remain limited. The results show that evaluative conclusions in this setting depend materially on metric choice and failure handling, so the paper frames the corpus as both a dataset contribution and a reproducible evaluation testbed for endangered-language machine translation.

12. 【2606.06416】Unsupervised Skill Discovery for Agentic Data Analysis

链接https://arxiv.org/abs/2606.06416

作者:Zhisong Qiu,Kangqi Song,Shengwei Tang,Shuofei Qiao,Lei Liang,Huajun Chen,Shumin Deng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Inference-time skill augmentation, injecting reusable procedural, reusable procedural knowledge, Inference-time skill, updating model parameters

备注: Work in progress

点击查看摘要

Abstract:Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.

13. 【2606.06399】CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

链接https://arxiv.org/abs/2606.06399

作者:Jiaju Chen,Bo Sun,Yuxuan Lu,Yun Wang,Dakuo Wang,Bingsheng Yao

类目:Computation and Language (cs.CL)

关键词:shown growing promise, Multi-agent systems, large language models, built on large, growing promise

备注

点击查看摘要

Abstract:Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

14. 【2606.06388】Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

链接https://arxiv.org/abs/2606.06388

作者:Jiaju Chen,Yuxuan Lu,Jiayi Su,Chaoran Chen,Songlin Xiao,Zheng Zhang,Yun Wang,Yunyao Li,Jian Zhao,Tongshuang Wu,Toby Jia-Jun Li,Dakuo Wang,Bingsheng Yao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enabled complex cognitive, Recent advances, mental model annotations, complex cognitive capabilities, multi-step reasoning

备注

点击查看摘要

Abstract:Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

15. 【2606.06380】Emergent Language as an Approach to Conscious AI

链接https://arxiv.org/abs/2606.06380

作者:Zengqing Wu,Chuan Xiao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)

关键词:consciousness-inspired modules directly, human language priors, conscious remains open, engineer consciousness-inspired modules, language priors

备注: Source codes available at [this https URL](https://github.com/wuzengqing001225/ConsciousAI_Indexicality/)

点击查看摘要

Abstract:The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.

16. 【2606.06350】EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

链接https://arxiv.org/abs/2606.06350

作者:Zhihao Wu,Linhai Zhang,Taiyi Wang,Runcong Zhao,Peter Andrews,Cesare Aloisi,Yulan He

类目:Computation and Language (cs.CL)

关键词:accurate score prediction, Reliable rubric grading, Reliable rubric, Reliable, rubric grading requires

备注

点击查看摘要

Abstract:Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathematics reasoning, struggle in this setting because they do not identify where grading reasoning goes wrong or how the model's belief about the final mark changes during reasoning. We propose Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for training more rubric-faithful LLM graders. First, EDIT-SFT locates problematic reasoning steps using internal model signals: posterior belief over the final mark and input-grounding scores. It then revises only these local steps with help from a rubric checklist. Second, EDIT-RL calibrates the grader with belief-guided reward shaping, penalising large harmful belief drifts while still allowing helpful exploration. Experiments on two real-world, multi-subject grading benchmarks demonstrate that EDIT consistently outperforms strong supervised fine-tuning and reinforcement learning baselines on both in-domain and out-of-domain splits, with ablation studies confirming that internal-state diagnostics drive these gains.

17. 【2606.06349】"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

链接https://arxiv.org/abs/2606.06349

作者:Edoardo Signoroni,Pavel Rychlý

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, terms of Natural, Language Processing, Natural Language, Machine Translation

备注: Submitted to TSD 2026

点击查看摘要

Abstract:Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

18. 【2606.06320】Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

链接https://arxiv.org/abs/2606.06320

作者:Gizem Yüce,Giorgos Nikolaou,Nicolas Flammarion

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Machine unlearning aims, remove targeted knowledge, Machine unlearning, general capabilities, aims to remove

备注

点击查看摘要

Abstract:Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token's relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.

19. 【2606.06306】Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

链接https://arxiv.org/abs/2606.06306

作者:Victor De Marez,Luna De Bruyne,Walter Daelemans

类目:Computation and Language (cs.CL)

关键词:language model abandons, abandons a correct, Factual sycophancy, verifiable answer, Factual sycophancy occurs

备注

点击查看摘要

Abstract:Factual sycophancy occurs when a language model abandons a correct, verifiable answer under social pressure. Because a flip occurs only when pressure toward a false answer exceeds the model's neutral preference for the truth, flip rates conflate two mechanisms: the strength of that baseline preference (truth margin), and how far pressure shifts it (manipulation sensitivity). We decompose factual sycophancy into these channels and use them to separate the effects of size and instruction tuning across 56 open-weight models spanning 0.3B-32B parameters and 13 manipulation types. We find that vulnerability is governed mainly by size, but instruction tuning changes how size acts: small instruction-tuned models can become less robust, whereas large instruction-tuned models usually become more robust. Instruction tuning primarily increases truth margin, but its behavioral effect depends on manipulation type. Scaling also changes the two channels differently: base models gain margin but become mildly more manipulation-sensitive, whereas instruction-tuned models gain margin faster and become less sensitive. Factual sycophancy is therefore not a single scalar property. Evaluations should report channel-specific, manipulation-specific, and size-conditioned robustness rather than flip rates alone.

20. 【2606.06286】LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

链接https://arxiv.org/abs/2606.06286

作者:Gianluca Barmina,Peter Schneider-Kamp,Lukas Galke Poech

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, DFM Decoder, Large, memorization

备注

点击查看摘要

Abstract:Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

21. 【2606.06271】FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

链接https://arxiv.org/abs/2606.06271

作者:Yijun Liu,Yifan Song,John Gallagher,Sarah Sterman,Tal August

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:writing research identifies, large language models, central to revision, large language, research identifies

备注

点击查看摘要

Abstract:While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.

22. 【2606.06267】Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

链接https://arxiv.org/abs/2606.06267

作者:Alireza Bayat Makou,Jingcheng Niu,Subhabrata Dutta,Iryna Gurevych

类目:Computation and Language (cs.CL)

关键词:specific model behaviors, explain specific model, methods identify subgraphs, discovery methods identify, structural differences

备注: 90 pages, 53 figures

点击查看摘要

Abstract:Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by varying input statistics while holding the task fixed, and show that the resulting structural differences exhibit apparent specialization but do not correspond to functional differences, a pattern we term phantom specialization. Using Literal Sequence Copying across four token-frequency bands plus a control condition in five Pythia models (70M-1.4B), we extract 75 circuits and find that structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands. Repeated extractions within the same frequency band further suggest that discovery algorithms sample from an equivalence class of valid subgraphs rather than recovering a unique mechanism. Standard evaluation practice obscures this pattern: source-level evaluation inflates apparent faithfulness, while edge-level evaluation reveals the many-to-one mapping from structure to function. Our results show that structural differences between circuits are not sufficient evidence for distinct mechanisms, and that exposing this requires edge-level evaluation and cross-condition transfer tests.

23. 【2606.06266】From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

链接https://arxiv.org/abs/2606.06266

作者:Paloma Piot,Javier Parapar

类目:Computation and Language (cs.CL)

关键词:Hate speech detection, Hate speech, inherently subjective, speech detection, detection is inherently

备注

点击查看摘要

Abstract:Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale. Persona-conditioned Large Language Models (models prompted to adopt a specific demographic identity) have been proposed as a way to simulate diverse perspectives at scale. But do they actually reflect how different groups disagree? We evaluate three aspects of human social judgement: (i) whether personas from different groups disagree in human-like ways (inter-group disagreement), (ii) whether they become more sensitive when content targets their own identity (in-group sensitivity), and (iii) whether they can accurately predict how another group would react (vicarious prediction). Our results show that no model consistently captures all three dimensions, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.

24. 【2606.06260】OneReason Technical Report

链接https://arxiv.org/abs/2606.06260

作者:OneRec Team,Biao Yang,Boyang Ding,Chenglong Chu,Dunju Zang,Fei Pan,Han Li,Hao Jiang,Honghui Bao,Huanjie Wang,Jian Liang,Jiangxia Cao,Jiao Ou,Jiaxin Deng,Jinghao Zhang,Kun Gai,Lu Ren,Peiru Du,Pengfei Zheng,Rongzhou Zhang,Ruiming Tang,Shiyao Wang,Siyang Mao,Siyuan Lou,Teng Shi,Wei Yuan,Wenlong Xu,Xingchen Liu,Xingmei Wang,Xinqi Jin,Yan Sun,Yan Wang,Yifei Hu,Yingzhi He,Yufei Ye,Yuhao Wang,Yunhao Zhou,Yuqin Dai,Zhao Liu,Zhipeng Wei,Zhixin Ling,Ziming Li,Zixing Zhang,Ziyuan Liu,An Zhang,Changxin Lao,Chaoyi Ma,Chengru Song,Defu Lian,Fan Yang,Guowang Zhang,Hao Peng,Jiayao Shen,Jie Chen,Jun Xu,Junmin Chen,Kun Zhang,Kuo Cai,Mingxing Wen,Minmao Wang,Minxuan Lv,Qi Zhang,Qiang Luo,Sheng Yu,Shijie Li,Shijie Yi,Shuang Yang,Shugui Liu,Shuni Chen,Tinghai Zhang,Tingting Gao,Xiang Wang,Xiangyu Wu,Xiangyu Zhao,Xiao Lv,Xiaoyou Zhou,Xuming Wang,Yong Du,Zejian Zhang,Zhaojie Liu,Zhiyang Zhang,Zhuang Zhuang,Ziqi Wang,Ziyi Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:real-world services, OneRec family, widely deployed, Generative recommendation models, Generative recommendation

备注: Work in progress

点击查看摘要

Abstract:Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.

25. 【2606.06242】Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

链接https://arxiv.org/abs/2606.06242

作者:AJ Carl P. Dy,Aivin V. Solatorio

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:figures and tables, Institutional documents, analytical information embedded, substantial amounts, generic document layout

备注: 23 pages, 8 figures

点击查看摘要

Abstract:Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at this https URL and the source code is available at this https URL.

26. 【2606.06211】FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

链接https://arxiv.org/abs/2606.06211

作者:Fernando López,Santosh Kesiraju,Jordi Luque

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Automatic speech recognition, neurological conditions remains, Feature-wise Linear Modulation, significant challenge, Automatic speech

备注: Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop

点击查看摘要

Abstract:Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

27. 【2606.06203】Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

链接https://arxiv.org/abs/2606.06203

作者:Giovanni Dettori,Matteo Boffa,Danilo Giordano,Idilio Drago,Marco Mellia

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:degraded LLM long-context, widely cited, LLM long-context performance, degraded LLM, LLM long-context

备注: 20 pages, 6 figures

点击查看摘要

Abstract:Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information -- as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three "find-the-needle" style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.

28. 【2606.06197】Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

链接https://arxiv.org/abs/2606.06197

作者:Hafez Abdelghaffar,Ahmed Alansary,Ali Hamdi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:achieved notable progress, Question answering, large language models, achieved notable, notable progress

备注: 7 pages, IMSA2026

点击查看摘要

Abstract:Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.

29. 【2606.06188】he Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

链接https://arxiv.org/abs/2606.06188

作者:Jinyang Zhang,Hongxin Ding,Yue Fang,Weibin Liao,Muyang Ye,Junfeng Zhao,Yasha Wang

类目:Computation and Language (cs.CL)

关键词:understand Large Language, Large Language Models, Large Language, Recent work, understand Large

备注: ICML

点击查看摘要

Abstract:Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model's reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs' internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model's latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by l2 norms: (i) Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at this https URL.

30. 【2606.06178】Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning

链接https://arxiv.org/abs/2606.06178

作者:Jiahao Zeng,Ming Tang,Ningning Ding

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, incur greater expense, powerful models incur, models incur greater

备注

点击查看摘要

Abstract:Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users' implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.

31. 【2606.06177】Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

链接https://arxiv.org/abs/2606.06177

作者:Giuseppe Attanasio,Beatrice Savoldi,Daniel Chechelnitsky,Matteo Negri,Marine Carpuat,Maarten Sap,André F.T. Martins

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:end users' communication, Speech translation, speech translation outputs, user applications, increasingly adopted

备注: Code and data at [this https URL](https://github.com/g8a9/ouvia)

点击查看摘要

Abstract:Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.

32. 【2606.06168】ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

链接https://arxiv.org/abs/2606.06168

作者:Prathamjyot Singh,Ashima Sood,Sahil Sharma,Jasmeet Singh

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:utterance-level emotional baseline, local prosodic dynamics, Global Emotion Encoder, Prosodic Incongruity Analyzer, Temporal Prosody Encoder

备注: Accepted at Interspeech 2026, Sydney

点击查看摘要

Abstract:We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.

33. 【2606.06160】Where does Absolute Position come from in decoder-only Transformers?

链接https://arxiv.org/abs/2606.06160

作者:Valeria Ruscio,Umberto Nanni,Fabrizio Silvestri

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:RoPE-trained transformers distinguish, transformers distinguish absolute, distinguish absolute position, RoPE-trained transformers, transformers distinguish

备注

点击查看摘要

Abstract:RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

34. 【2606.06109】Harnessing Structural Context for Entity Alignment Foundation Models

链接https://arxiv.org/abs/2606.06109

作者:Xingyu Chen,Yuanning Cui,Zequn Sun,Wei Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:identify equivalent entities, heterogeneous knowledge graphs, Entity alignment, aims to identify, identify equivalent

备注

点击查看摘要

Abstract:Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.

35. 【2606.06098】IR3DE: A Linear Router for Large Language Models

链接https://arxiv.org/abs/2606.06098

作者:Eros Fanì,Oğuzhan Ersoy

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Foundational Large Language, Large Language Models, Foundational Large, Language Models, Large Language

备注: Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference

点击查看摘要

Abstract:Foundational Large Language Models (LLMs) demonstrate proficiency on a wide range of general tasks, and achieve remarkable results on various specialized tasks via domain-expert LLMs. With the ever-growing list of available LLMs, inference routers are being proposed to select the most appropriate LLM for each prompt. However, existing routing methods either optimize cost across weak-to-strong generalist LLMs or require substantial training to support domain-expertise routing. In this paper, we propose IR3DE, a Ridge Regression-based Router for Domain Experts that provides cheap and fast routing decisions for each prompt. We evaluate IR3DE in two Causal Language Modeling (CLM) settings where the tasks are next-token prediction for all domains, and one reasoning setting where each domain has its own distinct reasoning task. Despite being a linear router, IR3DE achieves performance comparable to the other baselines in both CLM settings, and surpassing them in the reasoning setting, with a normalized performance of 98.4%. Moreover, IR3DE enables the addition or removal of new domain experts without requiring the router to be retrained from scratch, allowing a dynamic set of LLMs to be served with minimal disruption to the router itself. Our code is available at: this http URL.

36. 【2606.06096】OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

链接https://arxiv.org/abs/2606.06096

作者:Paavo Parmas,Yongmin Kim,Kohsei Matsutani,Shota Takashiro,Soichiro Nishimori,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:real world applications, world applications care, optimize expected return, expected return, tail risk

备注

点击查看摘要

Abstract:Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2606.06096 [cs.LG]

(or
arXiv:2606.06096v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.06096

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
37. 【2606.06088】CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

链接https://arxiv.org/abs/2606.06088

作者:Michal Tichý,Jindřich Libovický

类目:Computation and Language (cs.CL)

关键词:Challenging Language Identification, Language Identification Samples, address difficult cases, benchmark dataset explicitly, dataset explicitly designed

备注: 7 pages

点击查看摘要

Abstract:We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian). The second part tests for orthography noise: we transliterate text across multiple scripts, remove diacritics, simulate homoglyph attacks, and use Internet slang. We evaluate four widely used language identification systems on CHALIS and demonstrate that all struggle substantially in these scenarios, especially on lower-resource languages within cousin pairs and on transliterated input. The resource is publicly available at this https URL.

38. 【2606.06087】LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

链接https://arxiv.org/abs/2606.06087

作者:Aofan Yu,Chenyu Zhou,Tianyi Xu,Zihan Guo,Rong Shan,Zhihui Fu,Jun Wang,Weiwen Liu,Yong Yu,Weinan Zhang,Jianghao Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reusable task procedures, encode reusable task, step incurs substantial, Agent systems increasingly, incurs substantial context

备注: 16 pages, 4 figures

点击查看摘要

Abstract:Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

39. 【2606.06080】On Advantage Estimates for Max@K Policy Gradients

链接https://arxiv.org/abs/2606.06080

作者:Shota Takashiro,Soichiro Nishimori,Paavo Parmas,Yongmin Kim,Kohsei Matsutani,Gouki Minegishi,Yusuke Iwasawa,Takeshi Kojima,Yutaka Matsuo

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:make exploration difficult, outcome rewards make, rewards make exploration, sparse outcome rewards, Reinforcement learning

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

40. 【2606.06079】SkillComposer: Learning to Evolve Agent Skills for Specification and Generalization

链接https://arxiv.org/abs/2606.06079

作者:Qi Zhang,Zhaopeng Feng,Xiaonan Shi,Xiaomeng Hu,Chu Liu,Pengjun Xie,Xiaobin Wang,Jieping Ye,Bryan Hooi,Haobo Wang,Junbo Zhao

类目:Computation and Language (cs.CL)

关键词:shown strong potential, guide agent reasoning, improving model capability, reasoning and action, consist of reusable

备注: Under Review

点击查看摘要

Abstract:Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as one-shot extraction, overlooking a fundamental tension: a skill tailored to the specific task fails to transfer, while the abstracted skill often provides insufficient guidance. We attribute this fragility to the absence of explicit mechanisms for skill specification and generalization. To address this gap, we introduce SkillComposer, a framework that decomposes skill construction into three learnable operations: create, improve, and merge. Trained via systematic rejection sampling recipe, SkillComposer enables language models to self-evolve skills at inference time and supports three deployment modes: offline for building generalized libraries, online for task-specific refinement, and hybrid for combining both. Comprehensive experiments on $\tau^2$-Bench, LiveCodeBench v6, and AppWorld show that SkillComposer consistently outperforms baselines. Our SkillComposer-4B improves a 27B executor by up to +4.5 on agent tasks and +3.4 on code tasks, while generalizing across domains and task types unseen during training. Analysis reveals that merge and improve address orthogonal quality dimensions and that skill composition is a transferable meta-ability, providing a practical recipe for skill-augmented inference.

41. 【2606.06065】Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

链接https://arxiv.org/abs/2606.06065

作者:Seung Hwan Cho,Young-Min Kim

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:pronunciations and intended, Second-language, English, MTL, requires transcriptions

备注: 5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for Audio

点击查看摘要

Abstract:Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit this http URL analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.

42. 【2606.06058】MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

链接https://arxiv.org/abs/2606.06058

作者:Mohammad Mahdi Salmani-Zarchi,Zahra Rahimi,Heshaam Faili,Mohammad Javad Dousti

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:group-relative policy optimization, within-group reward distributions, standard group-relative policy, Reinforcement learning, policy optimization

备注: Accepted to ACL 2026 Main Conference. 14 pages, 9 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.

43. 【2606.06047】Automatic Labelling of Speech Translation Errors

链接https://arxiv.org/abs/2606.06047

作者:Dominik Macháček,Maike Züfle,Ondrej Klejch

类目:Computation and Language (cs.CL)

关键词:translations reduce trustworthiness, speech translations reduce, Speech Translation Error, speech translations, Speech Translation

备注

点击查看摘要

Abstract:Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.

44. 【2606.06044】IA-RAG: Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge Retrieval

链接https://arxiv.org/abs/2606.06044

作者:Xiaoman Wang,Yaoze Zhang,Wenzhuo Fan,Hongwei Zhang,Ding Wang,Guohang Yan,Song Mao,Botian Shi,Yunshi Lan,Pinlong Cai

类目:Computation and Language (cs.CL)

关键词:grounding Large Language, Large Language Models, Large Language, Retrieval-Augmented Generation, grounding Large

备注: 22 pages, 10 figures, 13 tables. Code available at [this https URL](https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen's Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at this https URL.

45. 【2606.06038】English-to-Prakrit Machine Translation via Multilingual Transfer Learning

链接https://arxiv.org/abs/2606.06038

作者:Om Choksi,Smit Kareliya,Shrikant Malviya,Pruthwik Mishra

类目:Computation and Language (cs.CL)

关键词:machine translation, low-resource setting, Hindi language tag, Maharashtri Prakrit parallel, Maharashtri Prakrit

备注

点击查看摘要

Abstract:We study English-to-Prakrit machine translation in a low-resource setting where the target language is unsupported by IndicTrans2. We adapt the multilingual model by mapping Prakrit to the Hindi language tag (hin_Deva) without modifying the tokenizer, vocabulary, or architecture. Using a 1,474-pair Maharashtri Prakrit parallel corpus and evaluation on a 20-sample Ardhamagadhi test set, we report corpus BLEU improvements over an untuned baseline. The results indicate that script-compatible language routing can enable feasible transfer to unsupported classical languages, while highlighting limitations due to data scarcity and dialect mismatch. Our code and trained models are released to the public for further exploration this https URL.

46. 【2606.06031】NAVIRA: Decoupled Stochastic Remasking for Masked Diffusion Language Models

链接https://arxiv.org/abs/2606.06031

作者:Andrey Fomenko,Maksim Kryzhanovskiy,Svetlana Glazyrina,Roman Ischenko

类目:Computation and Language (cs.CL)

关键词:early local dependency, local dependency errors, marginal distributions, iteratively unmasking, step are predicted

备注

点击查看摘要

Abstract:Masked diffusion language models generate text by iteratively unmasking many tokens in parallel, but this speed comes with a correction problem: tokens generated in the same step are predicted from marginal distributions, and early local dependency errors can later contaminate the context. PRISM addresses this by learning token-level quality scores and remasking unreliable tokens, but its inference rule is coupled: the same forward pass both detects low-quality tokens and computes logits for their replacements, so the erroneous tokens still condition regeneration. We propose NAVIRA, an inference-time decoding policy that separates these two operations and samples remasking positions stochastically. A first forward pass scores tokens; selected tokens are masked; a second forward pass regenerates from the cleaned context. Temperature-controlled remasking reduces repeated correction of the same positions and balances fluency against diversity. In controlled experiments with a 170M masked diffusion language model, decoupling improves fluency, while scheduled stochastic remasking preserves entropy and achieves stronger LLM-judge scores under larger forward-pass budgets. These results show that remasking policy, not only the learned quality signal, is central to reliable masked-diffusion text generation.

47. 【2606.06027】RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

链接https://arxiv.org/abs/2606.06027

作者:Amirhossein Ghaffari,Ali Goodarzi,Huong Nguyen,Simo Hosio,Lauri Lovén,Ekaterina Gilman

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:Community-conditioned language model, language model adaptation, model adaptation requires, adaptation requires choices, Community-conditioned language

备注

点击查看摘要

Abstract:Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: this https URL.

48. 【2606.06025】EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

链接https://arxiv.org/abs/2606.06025

作者:Xinpeng Qiu,Wang Yihu,Zhifeng Liu,Xiaochen Wang,Jimin Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Scientific peer review, providing timely feedback, attracted increasing attention, reducing reviewing burdens, Scientific peer

备注

点击查看摘要

Abstract:Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

49. 【2606.06022】Contextualized Prompting For Stance Detection On Social Media

链接https://arxiv.org/abs/2606.06022

作者:Tilman Beck,Shakib Yazdani,Simon Kruschinski,Marcus Maurer,Iryna Gurevych

类目:Computation and Language (cs.CL)

关键词:social media, media is challenging, German Twitter stance, Stance detection, detection on social

备注

点击查看摘要

Abstract:Stance detection on social media is challenging due to short, noisy, and context-dependent language. While large language models (LLMs) show zero-shot generalization, they are typically prompted without contextual information, which limits their ability to interpret ambiguous posts. In this work, we systematically investigate the impact of incorporating real-world (e.g., user biographies), derived (e.g., political party), and LLM-generated (e.g., target descriptions) contextual features into zero-shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high-quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM-generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user, often beneficial in supervised learning, can impair performance due to input noise. Our qualitative analysis reveals that LLMs struggle to distinguish task-specific useful information from irrelevant context. Our findings highlight both the promise and challenges of prompting with context information in noisy real-world settings. We publish code and data at this \href{this https URL}{page}.

50. 【2606.06004】he Generator-Eraser Paradox: Community Guidelines for Responsible LLM-Assisted Dialect Resource Creation

链接https://arxiv.org/abs/2606.06004

作者:Wajdi Zaghouani

类目:Computation and Language (cs.CL)

关键词:cultural preservation, scientific description, computational infrastructure, occupy a unique, unique position

备注

点击查看摘要

Abstract:Dialect resources occupy a unique position at the intersection of scientific description, cultural preservation, and computational infrastructure. Large language models offer powerful capabilities for accelerating dialect resource development through retrieval-grounded drafting, corpus navigation, metadata enrichment, and annotation workflow support. However, the same systems pose substantial risks: they can contribute to dialect erasure by privileging prestige varieties, homogenizing orthography, and enabling synthetic feedback loops that reduce linguistic diversity over time. These risks are particularly acute for language varieties characterized by diglossia, limited written standardization, or marginalized speaker communities. This paper makes three contributions. First, we integrate insights from variationist sociolinguistics and corpus linguistics to formalize the generator-eraser paradox as a theoretical framework for understanding the dual nature of LLM-assisted dialect work. Second, we derive 12 community guidelines that operationalize this framework into implementable design requirements for dialect resource creation and documentation. Third, we provide an in-depth case study of Arabic dialects, including a structured comparison of widely used resources, to demonstrate how these guidelines address language-specific challenges including diglossia, orthographic variability, and community governance. The contribution is conceptual and operational rather than experimental, with the goal of enabling dialect communities and resource builders across languages to adopt LLMs without sacrificing authenticity, variation, or sovereignty.

51. 【2606.05988】Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

链接https://arxiv.org/abs/2606.05988

作者:Maxime Griot,Paul Steven Scotti,Tanishq Mathew Abraham

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Reasoning models produce, models produce long, Reasoning models, encourage verbose student, produce long

备注

点击查看摘要

Abstract:Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

52. 【2606.05985】Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

链接https://arxiv.org/abs/2606.05985

作者:Shaoyang Xu,Jingshen Zhang,Long P. Hoang,Jinyuan Li,Wenxuan Zhang

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:globally diverse settings, diverse settings, increasingly deployed, deployed in globally, globally diverse

备注

点击查看摘要

Abstract:Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at this https URL.

53. 【2606.05983】Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI

链接https://arxiv.org/abs/2606.05983

作者:Alexander Apartsin,Yehudit Aperstein

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:invites cognitive offloading, makes answers easy, Generative AI makes, understanding hard, cognitive offloading

备注: 18 pages, 4 pages

点击查看摘要

Abstract:Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one "prompting" score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.

54. 【2606.05976】he Self-Correction Illusion: LLMs Correct Others but Not Themselves

链接https://arxiv.org/abs/2606.05976

作者:Kuan-Yen Chen,Fang-Yi Su,Jung-Hsien Chiang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent work shows, LLM agents struggle, show markedly higher, markedly higher correction, Recent work

备注

点击查看摘要

Abstract:Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{thought}, a \role{user} message, a \role{tool} response, or a \role{system memory} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{thought} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{memory} dominates on math, while a plain \role{user} message dominates on logical deduction.

55. 【2606.05970】Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

链接https://arxiv.org/abs/2606.05970

作者:Martin Murin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, upstream configuration choices, Large language, clinical free-text notes, output to upstream

备注: 69 pages, 5 main figures, supplementary material included

点击查看摘要

Abstract:Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen's kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema's contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.

56. 【2606.05937】Large Language Models are Perplexed by some Political Parties

链接https://arxiv.org/abs/2606.05937

作者:Paul Lerner,François Yvon

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Large, Language Models, political applications

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used, including in political applications, but their political fairness has been little studied. We assess it using perplexity, posing that a fair model should give equal probability to all political groups. However, we find, across ten LLMs and three datasets covering 37 languages, that LLMs are more perplexed by the texts of far right and nationalist parties than of social-democratic parties. We find this to be consistent with previous work on translation fairness, to the point that perplexity correlates with downstream translation metrics. Our method is applicable to both base LLMs as well as their instruction-tuned counterpart, and we find that both are highly correlated, suggesting that the political fairness of LLMs stems from their pretraining, and is hardly affected by instruction-tuning.

57. 【2606.05936】Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

链接https://arxiv.org/abs/2606.05936

作者:Marco Antonio Stranisci,A Pranav,Rossana Damiano,Christian Hardmeier,Anne Lauscher

类目:Computation and Language (cs.CL)

关键词:Modern language models, suppress undesirable outputs, language models rely, remove undesirable content, Modern language

备注

点击查看摘要

Abstract:Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.

58. 【2606.05931】o Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

链接https://arxiv.org/abs/2606.05931

作者:Erfan Loweimi,Mengjie Qian,Kate Knill,Guanfeng Wu,Chi-Ho Chan,Abbas Haider,Muhammad Awan,Josef Kittler,Hui Wang,Mark Gales

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

关键词:voice and face, retrieving a person, unlike curated benchmarks, real-world broadcast archives, Abstract

备注: INTERSPEECH 2026

点击查看摘要

Abstract:When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

59. 【2606.05924】Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

链接https://arxiv.org/abs/2606.05924

作者:Zhihao Lin,Ziqi Zhu,Hao Huang,Guanghui Wang,Peiyang He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:poses unique challenges, unique challenges due, balance expression fluency, translation poses unique, high-quality annotated data

备注: Accepted by ACL 2026 Industry

点击查看摘要

Abstract:Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).

60. 【2606.05922】Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

链接https://arxiv.org/abs/2606.05922

作者:Wenbo Pan,Shujie Liu,Chin-Yew Lin,Jingying Zeng,Xianfeng Tang,Xiangyang Zhou,Yan Lu,Xiaohua Jia

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:solve complex problems, workflows to solve, solve complex, complex problems, harness

备注: Code: [this https URL](https://github.com/wbopan/retro-harness) ; Project website: [this https URL](https://paper-rho.wenbo.io)

点击查看摘要

Abstract:AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

61. 【2606.05920】Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

链接https://arxiv.org/abs/2606.05920

作者:Xin Wang,Liangtai Sun,Yaoming Zhu,Shuang Zhou,Jiaxing Liu,Fengjiao Chen,Lin Qiu,Xuezhi Cao,Xunliang Cai,Licheng Zhang,Zhendong Mao

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Existing code-generation benchmarks, Existing code-generation, code-generation benchmarks score, one-shot output, score a single

备注: under review

点击查看摘要

Abstract:Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.

62. 【2606.05917】MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

链接https://arxiv.org/abs/2606.05917

作者:Qing Yang,Pengcheng Huang,Xinze Li,Zhenghao Liu,Yukun Yan,Yu Gu,Ge Yu,Gang Li,Maosong Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision-Language Models, lengthy video contexts, answering remains challenging, remains challenging, challenging for Vision-Language

备注: 21 pages, 8 figures

点击查看摘要

Abstract:Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at this https URL.

63. 【2606.05906】ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL

链接https://arxiv.org/abs/2606.05906

作者:Xiaobing Chen,Ai Jian,Eryu Guo,Zhiqi Pang

类目:Computation and Language (cs.CL)

关键词:executable SQL queries, maps natural language, natural language questions, SQL queries, maps natural

备注

点击查看摘要

Abstract:Text-to-SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full-schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold-column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL (ACE-SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE-SQL constructs an online column-set pool from generator rollouts and derives adaptive on-policy retrieval targets from the column set most frequently associated with execution-correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever's evolving schema selections under execution feedback. With approximately 3k synthetic Text-to-SQL question-database pairs for RL training, ACE-SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at this https URL.

64. 【2606.05901】Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

链接https://arxiv.org/abs/2606.05901

作者:Christopher J. Wedge,Joshua Stutter,Danny Dixon,Jacek Cała

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Processing, Large language models, landscape of Natural, Large language, Language Processing

备注

点击查看摘要

Abstract:Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.05901 [cs.CL]

(or
arXiv:2606.05901v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.05901

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
65. 【2606.05895】Representing Research Attention as Contextually Structured Flows

链接https://arxiv.org/abs/2606.05895

作者:Jessica Rodrigues,Angelo Salatino,Gard Jenset,Scott Hale

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:indicator of visibility, societal uptake, aggregated counts, attention, typically represented

备注: Accepted at STi 2026 - International Conference on Science and Technology Indicators

点击查看摘要

Abstract:Research attention is widely used as an indicator of visibility, influence, and societal uptake, yet it is typically represented as aggregated counts that do not preserve how attention develops across contexts over time. This creates a mismatch between how attention is interpreted and how it is represented. We propose attention flows as contextually structured representations that encode the organisation of attention and its evolution over time. We evaluate whether these representations capture transferable structure by constructing a benchmark based on analogy-style reasoning across research outputs. Comparing signal, sequence, and flow-based representations, we find that flow representations more effectively support structural comparison, particularly in settings where attention is shaped by temporal progression or context distributions. We further show that learned flow representations improve robustness under partial observation and structural perturbation. Overall, these results support modelling attention as a contextually structured phenomenon and provide a basis for more informative approaches to research evaluation.

66. 【2606.05894】EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon Agents

链接https://arxiv.org/abs/2606.05894

作者:Yilong Li,Suman Banerjee,Tong Che

类目:Computation and Language (cs.CL)

关键词:archive large histories, context costs, agents can archive, archive large, future answers

备注

点击查看摘要

Abstract:Long-horizon agents can archive large histories, but future answers still incur retrieval, rereading, and context costs. When retained memory misses answer-relevant evidence, the system must return to larger portions of the raw history. We study budgeted evidence survival: before the query is known, which source evidence should be retained so that it remains recoverable and usable under a fixed retained source-evidence token budget? We instantiate this setting as Budgeted Pre-Query Retention, where memory is written during ingestion and later read without access to the full raw stream. We introduce EMBER, a learned retention policy that constructs a compact, source-backed evidence state. EMBER stores evidence capsules: verbatim source excerpts paired with retrieval keys and update metadata, preserving both grounding and read-time access. Post-query outcome feedback trains the writer to preserve evidence across the ingestion-retrieval-answer chain. On LongMemEval-RR, our LongMemEval-derived retained-evidence protocol, EMBER-14B reaches 0.3017 F1 at the 8192-token retained-evidence comparison point, compared with 0.1765 for the strongest non-EMBER budgeted baseline. Across retained source-evidence budgets, EMBER improves F1, Retain-Recall, and Read-Recall, indicating that long-horizon memory depends on retaining evidence within the budget rather than rereading larger histories.

67. 【2606.05890】Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated Conversations

链接https://arxiv.org/abs/2606.05890

作者:Salvatore Greco,Hainiu Xu,Jacopo Domenicucci,Yulan He,Sylvie Delacroix

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Artificial Moral Advisors, Moral Advisors, Artificial Moral, deployed as Artificial, variety of contexts

备注

点击查看摘要

Abstract:LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors "stay with the uncertainty". We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.

68. 【2606.05889】GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

链接https://arxiv.org/abs/2606.05889

作者:Jaehoon Kang,Yejin Lee,Kyuhong Shim

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:framework for composable, composable acoustic style, Relative Policy Optimization, zero-shot autoregressive, Group Relative Policy

备注

点击查看摘要

Abstract:We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.

69. 【2606.05874】Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

链接https://arxiv.org/abs/2606.05874

作者:Huiyuan Zheng,Houtao Zhang,Boyang Wang,Qingyi Si,Hongcheng Guo

类目:Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, leaving model behavior, scenarios largely underexplored

备注

点击查看摘要

Abstract:Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

70. 【2606.05868】YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

链接https://arxiv.org/abs/2606.05868

作者:PSBC LLM Team,Huawei LLM Team:Ruihan Long,Junjie Wu,Tianan Zhang,Duo Zhang,Yaozong Wu,Jinbin Fu,Chang Liu,Zhentao Tang,Wenshuang Yang,Xin Wang,Zhihao Song,Ning Huang,Wenjing Xu,Shuai Zong,Shupei Sun,Sen Wang,Jing Hu,Bin Wang,Xinyu Wang,Junkui Ju,Zequn Ding,Jie Ran,Man Luo,Shixiong Kai,Linkai Hou,Kaichao Liang,Hu Zhao,Yang Zhao,Shucheng Lin,Wei Yu,Chenghan Jiang,Jingjing Ding,Jiahui Zhang,Tian Jin,Yuhang Zhang,Dong Guo,Wei Sun,Jun Xie,Jianwei Li,Lei Cao,Pei Li,Jiabin Li,Jia Yuan,Rui Yuan,Jing Zhu,Mingxuan Yuan,Zhangcheng Lv,Xin Jiang,Xiuhong Fei,Xiaozhe Ren,Yulong Li,Zhipeng Zhang,Hang Wang,Zhaohui Xu,Rui Zhao,Yibo He,Xinzhuang Niu

类目:Computation and Language (cs.CL)

关键词:cache memory overhead, Large language models, inflates infrastructure costs, drive significant financial, significant financial innovations

备注

点击查看摘要

Abstract:Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

71. 【2606.05864】Analysis of the Neglect-Zero Effect in Large Language Models

链接https://arxiv.org/abs/2606.05864

作者:Jin Tanaka,Daiki Matsuoka,Ryoma Kumon,Hitomi Yanaka

类目:Computation and Language (cs.CL)

关键词:cognitive bias called, human cognitive processes, resembles human cognitive, human cognitive bias, human cognitive

备注: 14 pages (10 pages main text), 8 figures. To appear in the Proceedings of the ACL2026 Student Research Workshop (SRW)

点击查看摘要

Abstract:We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the $\textit{neglect-zero effect}$. This effect refers to the human tendency to ignore $\textit{zero-models}$, which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect-zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect-zero effect. For this purpose, we employ a paradigm based on $\textit{structural priming}$, where recent exposure to a preceding sentence (the $\textit{prime}$) facilitates the processing of a subsequent sentence (the $\textit{target}$) due to their structural similarity. We prepare primes to force LLMs to consider the zero-model, and analyze whether they also consider it in the target. The results suggest that the neglect-zero effect may not occur in the LLMs analyzed in this study. Our code is available at this https URL

72. 【2606.05859】ARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

链接https://arxiv.org/abs/2606.05859

作者:Liting Zhang,Shiwan Zhao,Xuyang Zhao,Zichen Xu,Jianye Wang,Qicheng Li

类目:Computation and Language (cs.CL)

关键词:large language models, language models, enabling more expressive, promising alternative, large language

备注: 18 pages, 12 figures. Code available at [this https URL](https://github.com/NKU-LITI/TARPO-master)

点击查看摘要

Abstract:Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at this https URL.

73. 【2606.05858】ReverseEOL: Improving Training-free Text Embeddings via Text Reversal in Decoder-only LLMs

链接https://arxiv.org/abs/2606.05858

作者:Ailiang Lin,Zhuoyun Li,Yusong Wang,Keyu Mao,Kotaro Funakoshi,Manabu Okumura

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, Recent advances, advances in Large

备注

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have opened new avenues for generating training-free text embeddings. However, the causal attention in decoder-only LLMs prevents earlier tokens from attending to future context, leading to biased contextualized representations. In this work, we propose Reverse prompting with Explicit One-word Limitation (ReverseEOL), a simple yet effective method for enhancing the representational capability of frozen LLMs. ReverseEOL augments the standard forward embedding with an additional reversed embedding derived from the reversed input text. Since reversing the input exposes each token to context inaccessible in the original order, the resulting reversed embedding effectively provides complementary information to the original one. As a result, combining the forward and reversed embeddings yields a richer final representation. Comprehensive experiments on STS and MTEB benchmarks demonstrate that ReverseEOL significantly improves the performance of existing training-free baselines across a broad range of LLMs with diverse architectures and scales. Extensive ablations and analyses further confirm the necessity of our reversal mechanism.

74. 【2606.05857】Forgive or forget: Understanding the context of hate in audio retrieval systems

链接https://arxiv.org/abs/2606.05857

作者:Arghya Pal,Sailaja Rajanala,Raphael C.-W. Phan,Shekhar Nayak

类目:Computation and Language (cs.CL)

关键词:systems is challenging, contextual dependencies, Handling toxic, Handling toxic retrieval, challenging due

备注

点击查看摘要

Abstract:Handling toxic retrieval in text-to-audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing framework with a sentiment-controlled mediator to preserve semantic relevance while suppressing harmful speech. Our approach is model-agnostic and integrates seamlessly with existing retrieval pipelines. We introduce two variants: Forgive, which re-ranks and filters toxic audio via logit adjustment, and Forget, which generates counterfactual toxic prompts to mitigate harmful retrievals. Experiments show consistent toxicity reduction with minimal loss in retrieval accuracy, improving both safety and reliability.

75. 【2606.05846】owards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

链接https://arxiv.org/abs/2606.05846

作者:Gio Paik,Hyunseo Shin,Soungmin Lee

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Automatic Speech Recognition, Automatic Speech, Speech Recognition, language pairs, technology for human

备注: ICML 2026 Workshop on Machine Learning for Audio

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

76. 【2606.05843】Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

链接https://arxiv.org/abs/2606.05843

作者:Ruoxi Sun,Quantong Qiu,Juntao Li,Zecheng Tang,Yihang Lou,Min Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, Large Language, noisy contexts remain, contexts remain opaque

备注

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.

77. 【2606.05836】ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

链接https://arxiv.org/abs/2606.05836

作者:Zhaorui Yang,Huawei Zheng,Sen Yang,Yuhui Zhang,Haoxuan Li,Zhizhen Yu,Xuan Yi,Chen Hou,Defeng Xie,Chao Hu,Minfeng Zhu,Dazhen Deng,Haozhe Feng,Danqing Huang,Yingcai Wu,Peng Chen,Wei Chen

类目:Computation and Language (cs.CL)

关键词:databases remains challenging, enterprise-scale databases remains, substantially advanced, remains challenging, SQL

备注: 24 pages, 12 figures

点击查看摘要

Abstract:Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL--Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.

78. 【2606.05828】Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

链接https://arxiv.org/abs/2606.05828

作者:Zeyu Gan,Huayi Tang,Yong Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, locally deployed personal, Language Model, API-based remote models

备注

点击查看摘要

Abstract:As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.

79. 【2606.05804】Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting

链接https://arxiv.org/abs/2606.05804

作者:Michiro Asai,Ailiang Lin,Yu Kishimoto,Takao Obi,Satoshi Kosugi,Kotaro Funakoshi,Manabu Okumura

类目:Computation and Language (cs.CL)

关键词:large language model, Prompted knowledge cutoff, Prompted knowledge, date were unavailable, instructs a large

备注

点击查看摘要

Abstract:Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is only causally related to the question. To address this limitation, we propose two recall-based prompting strategies: Self-Recall (SR), which asks the model to restate its cutoff constraint, and Question-Recall (QR), which requires the model to recall question-relevant information valid under the cutoff. Across three existing benchmarks, our methods outperform both direct-answer prompting and conventional step-by-step reasoning baselines, with particularly strong improvements on counterfactual questions. To investigate robustness across different cutoff settings, we further construct the Multi-cutoff Historical Event Benchmark (MHEB), which evaluates the same question under multiple cutoff years. Results show that knowledge cutoff performance varies with cutoff distance, while combining SR and QR consistently yields the best performance.

80. 【2606.05799】CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

链接https://arxiv.org/abs/2606.05799

作者:Mohammad Anas Jawad,Cornelia Caragea

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Existing calibration methods, Large Language, Existing calibration, dimension of trustworthiness

备注

点击查看摘要

Abstract:Existing calibration methods for Large Language Models (LLMs) often overlook a critical dimension of trustworthiness: a model's {\em behavioral robustness} to irrelevant or misleading information. In this paper, we argue that a model's true confidence should reflect its stability under cognitive pressure. We introduce \textsc{CaliDist}, a novel post-hoc calibration approach that directly measures and penalizes a model's susceptibility to distraction. \textsc{CaliDist} quantifies how an LLM's predictions and uncertainty change when its input prompt is perturbed with semantic \textit{distractors}. This stability (or lack thereof) signal is then used to adaptively scale the model's initial confidence score. Our extensive experiments on seven Natural Language Understanding classification benchmarks using six distinct LLMs show that \textsc{CaliDist} consistently achieves lower Expected Calibration Error (ECE) and Brier Score compared with strong baselines. Remarkably, our method reduces the ECE from 23\% to 7\% on average--a relative improvement of 70\%--demonstrating that behavioral stability is a powerful signal for calibration. We make our code and datasets available at this http URL.

81. 【2606.05793】CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

链接https://arxiv.org/abs/2606.05793

作者:Hong Qian,Yuanhao Liu,Zihan Zhou,Zongbao Zhang,Hanjie Ge,Haotian Shi,Liang Dou,Xiangfeng Wang,Jingwen Yang,Aimin Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:partners remains challenging, realistic human partners, human partners remains, LLM-based agents excel, remains challenging

备注: Accepted by ICML 2026

点击查看摘要

Abstract:While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

82. 【2606.05761】SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

链接https://arxiv.org/abs/2606.05761

作者:Wenxuan Wang,Haoyu Sun,Fukuan Hou,Mingyang Song,Weinan Zhang,Yu Cheng,Yang Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:accumulate large collections, Persistent AI assistants, accumulate large, large collections, collections of related

备注: 48 pages

点击查看摘要

Abstract:Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.

83. 【2606.05749】MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

链接https://arxiv.org/abs/2606.05749

作者:Kaifeng Chen,Hongtao Liu,Qiyao Peng,Jian Yang,Yongqiang Liu,Xiaochen Zhang,Qing Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Iterative retrieval-reasoning agents, recently shown promise, Iterative retrieval-reasoning, long-document question answering, question answering

备注

点击查看摘要

Abstract:Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

84. 【2606.05748】UNIVID: Unified Vision-Language Model for Video Moderation

链接https://arxiv.org/abs/2606.05748

作者:Kejuan Yang,Yizhuo Zhang,Mingyuan Du,Yue Zhang,Dixin Zheng,Kaili Zhao,Yang Xiao,Hanzhong Liang,Kenan Xiao

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:support downstream enforcement, Global-scale video moderation, fine-grained multi-modal reasoning, Global-scale video, dual challenge

备注: 7 pages, 3 figures. Accepted to ACL 2026 Industry Track

点击查看摘要

Abstract:Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.

85. 【2606.05744】PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

链接https://arxiv.org/abs/2606.05744

作者:Minxin Chen,He Zhu,Junyou Su,Wen Wang,Yijie Deng,Wenjia Zhang

类目:Computation and Language (cs.CL)

关键词:translating planning objectives, spatial planning map, planning map interpretation, Spatial planning, public communication

备注

点击查看摘要

Abstract:Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at this https URL.

86. 【2606.05743】Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

链接https://arxiv.org/abs/2606.05743

作者:Minseok Choi,Seungbin Yang,Dongjin Kim,Subin Kim,Jungmin Son,Yunseung Lee,Jaegul Choo,Youngjun Kwak

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:large language models, language models remain, models remain vulnerable, large language, language models

备注

点击查看摘要

Abstract:Despite advances in safety alignment, large language models remain vulnerable to continuously evolving jailbreaks. Existing fine-tuned safety classifiers cannot adapt to these evolving attacks, while adaptive memory-based guardrails tend to over-refuse benign queries that resemble stored attacks. We propose Membrane, a self-evolving guardrail built on Contrastive Safety Memory (CSM): each cell pairs the conditions for blocking a harmful query with those for permitting a superficially similar benign request. Without retraining, Membrane evolves CSM by distilling each harmful interaction and its benign counterpart into a contrastive cell indexed by the underlying attack strategy, so that one cell generalizes across topical variants of the same mechanism. At inference, retrieved cells serve as grounding context for precise safety decisions. Across model-level safety on HarmBench and agent-level safety on AgentHarm, Membrane achieves the highest F1 on all six jailbreak attacks. Notably, benign refusal on AgentHarm stays at 7-14%, well below the 28-85% range of prior guards. Memory cells also retain 87-88% F1 under cross-attack transfer and remain stable under memory poisoning.

87. 【2606.05742】AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

链接https://arxiv.org/abs/2606.05742

作者:Runheng Liu,Jincheng Xie,Wen Hu,Xingchen Xiao,Heyan Huang

类目:Computation and Language (cs.CL)

关键词:Speculative decoding accelerates, verifying multiple drafted, multiple drafted tokens, reducing sequential decoding, sequential decoding iterations

备注

点击查看摘要

Abstract:Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.

88. 【2606.05734】When AI Says It Feels

链接https://arxiv.org/abs/2606.05734

作者:Shin-nosuke Ishikawa,Seiya Ikeda,Hirotsugu Ohba

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, post-training processes, generally constrained, constrained from expressing

备注: 15 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.

89. 【2606.05728】DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

链接https://arxiv.org/abs/2606.05728

作者:Yansi Li,Zhuosheng Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Generating executable tool, plans requires selecting, large solution space, Generating executable, exponentially large solution

备注: Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings

点击查看摘要

Abstract:Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at this https URL.

90. 【2606.05725】An Embarrassingly Simple Detector for Model Extraction Attacks in Large Language Model API Traffic

链接https://arxiv.org/abs/2606.05725

作者:Shuze Liu,Qianwen Guo,Yushun Dong

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Large language models, Large language, making model extraction, service security, increasingly deployed

备注: Preprint. Code available at [this https URL](https://github.com/LabRAI/mmd-llm-mea-detection)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: this https URL.

91. 【2606.05724】Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

链接https://arxiv.org/abs/2606.05724

作者:Qiuyu Tian,Fengyi Chen,Yiding Li,Youyong Kong,Fan Guo,Yuyao Li,Jinjing Shen,Zhijing Xie,Yiyun Luo,Xin Zhang,Yingce Xia,Zequn Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:evolving story worlds, Long-form narrative, changing character states, isolated passages, answers may depend

备注

点击查看摘要

Abstract:Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.

92. 【2606.05716】Interpreting Style Representations via Style-Eliciting Prompts

链接https://arxiv.org/abs/2606.05716

作者:Junghwan Kim,David Jurgens

类目:Computation and Language (cs.CL)

关键词:learned representations makes, modeling writing style, Style, difficult to interpret, powerful tool

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Style representation learning is a powerful tool for authorship analysis and modeling writing style, yet the latent nature of learned representations makes them difficult to interpret. Recent work has attempted to explain these representations by generating natural language descriptions with large language models (LLMs) conditioned on input text. However, such descriptions are often prone to the LLM's biases and hallucinations, and they lack an explicit objective and practical utility. In this work, we propose a novel framework for interpreting style representations through style-eliciting prompts: natural language instructions designed to steer LLMs to generate text that reflects specific stylistic attributes. We curate 1,010 distinct style features spanning 26 stylistic categories and construct a dataset by prompting an LLM to generate text conditioned on these features. Using this data, we train a decoder to generate a style prompt from the style representation of the generated text. We evaluate our approach on three tasks: (1) recovering original style prompts from generated text, (2) generating text in the same style using the recovered prompts, and (3) steering LLM outputs to match the style of human-written texts. Experiments demonstrate that our method consistently outperforms strong baselines that directly prompt LLMs with target text, achieving superior performance in both style description and style imitation. These results highlight that style-eliciting prompts can provide a practical and interpretable interface to stylistic information encoded in style representations.

93. 【2606.05711】Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

链接https://arxiv.org/abs/2606.05711

作者:Yingzhuo Liu

类目:Computation and Language (cs.CL)

关键词:Multi-agent systems built, large language models, tackling complex reasoning, Multi-agent systems, tool-use tasks

备注

点击查看摘要

Abstract:Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

94. 【2606.05698】Rethinking LoRA Memory Through the Lens of KV Cache Compression

链接https://arxiv.org/abs/2606.05698

作者:Chunsheng Zuo,Liaoyaqi Wang,William Jurayj,William Fleshman,Benjamin Van Durme

类目:Computation and Language (cs.CL)

关键词:Parametric retrieval augmentation, retrieval augmentation encodes, information into lightweight, document-specific modules, retrieval augmentation

备注

点击查看摘要

Abstract:Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this parameter-side memory interacts with context-side memory stored in the KV cache. We study this interaction in document-level question answering by progressively evicting document key-value states and measuring when a document LoRA contributes beyond the retained context. We find that document LoRA adds little when the KV cache is largely intact, but becomes increasingly useful under aggressive compression, recovering 13-21 ROUGE-L points when no document context remains. The gain is largest when the base model encodes the document, and the adapter is applied only during answer generation, suggesting that document LoRA is better understood as decoding-time parametric memory than as a document encoder. Finally, QA-style supervision produces substantially stronger adapters than raw-context next-token-prediction. These results position document LoRA as a complementary memory channel whose value emerges precisely when context-side evidence is scarce.

95. 【2606.05688】Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

链接https://arxiv.org/abs/2606.05688

作者:Hancheol Park,Geonho Lee,Tairen Piao,Tae-Ho Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:makes quantization essential, practical deployment, efficiently by activating, large number, parameters still makes

备注: 8 pages, 1 figure

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.

96. 【2606.05677】LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

链接https://arxiv.org/abs/2606.05677

作者:Shiqiang Lang,Jing Liu,Haoyang He,Peiwen Sun,Yuanteng Chen,Tao Liu,Lan Yang,Longteng Guo,Honggang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, longer visual inputs

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

97. 【2606.05671】QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

链接https://arxiv.org/abs/2606.05671

作者:Dike Sun,Zheng Zou,Jingtong Zang,Qi Sun,Huaipeng Zhaoand Tao Luo,Xiaoyi Zeng

类目:Computation and Language (cs.CL)

关键词:e-commerce search aims, users' potential interests, match users' potential, proactively suggest queries, potential interests

备注

点击查看摘要

Abstract:Query recommendation in e-commerce search aims to proactively suggest queries that match users' potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users' downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.

98. 【2606.05661】Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

链接https://arxiv.org/abs/2606.05661

作者:Parth Asawa,Christopher M. Glaze,Gabriel Orlanski,Ramya Ramakrishnan,Benji Xu,Asim Biswal,Vincent Sunn Chen,Frederic Sala,Matei Zaharia,Joseph E. Gonzalez

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:attracted substantial interest, high-quality benchmark exists, Continual learning, Continual Learning Bench, substantial interest

备注

点击查看摘要

Abstract:Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

99. 【2606.05647】Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

链接https://arxiv.org/abs/2606.05647

作者:Jingheng Ye,Huiqi Zou,Simon Yu,Weiyan Shi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:gaining broader access, codebases and tools, increasingly embedded, gaining broader, broader access

备注: 34 pages, 30 figures, 3 tables

点击查看摘要

Abstract:AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

100. 【2606.05634】Bootstrapping Semantic Layer from Execution for Text-to-SQL

链接https://arxiv.org/abs/2606.05634

作者:Youngwon Lee,Jaejin Kim,Seung-won Hwang

类目:Computation and Language (cs.CL)

关键词:under-specified until user, user phrases, database stores, Grouding After Test, GATE

备注

点击查看摘要

Abstract:Real-world text-to-SQL is often under-specified until user phrases are grounded in how the database stores values. Prior work attempts to address this by requiring a semantic layer to specify groundings in advance, but such specifications are often incomplete, especially in expert domains where domain-specific conventions are under-documented. As this leaves multiple grounding hypotheses open for the same SQL part, we introduce GATE (Grouding After Test from Execution), which bootstraps missing groundings from execution feedback. GATE keeps grounding hypotheses open while executing the already grounded parts to obtain observations. Then, only the hypothesis supported by that observation is grounded and stored as a memory entry, recording what was tested and how the open part should be written in SQL. These entries accumulate into execution-grounded memory, allowing later steps to reuse supported groundings. Across real-world and controlled benchmarks, GATE consistently improves over strong baselines, demonstrating that execution can serve not only as validation but also as a bootstrapping mechanism for reusable memory in text-to-SQL.

101. 【2606.05626】When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer

链接https://arxiv.org/abs/2606.05626

作者:Zhen Sun,Yifan Liao,Zhicong Huang,Jiaheng Wei,Cheng Hong,Yutao Yue,Xinlei He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Machine-generated text, providing fine-grained evidence, specific generator responsible, lifelong MGT attribution, misuse investigation

备注: 12 pages

点击查看摘要

Abstract:Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.

102. 【2606.05622】AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

链接https://arxiv.org/abs/2606.05622

作者:Jiayu Liu,Cheng Qian,Zhenhailong Wang,Bingxuan Li,Jiateng Liu,Heng Wang,Jeonghwan Kim,Yumeng Wang,Xiusi Chen,Yi R. Fung,Heng Ji

类目:Computation and Language (cs.CL)

关键词:disclosed through interaction, Large Language Model, constraints, real-world problems, fully specified upfront

备注

点击查看摘要

Abstract:Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.

103. 【2606.05620】An ERP Study on Recursive Locative Processing in Mandarin-Speaking Children with Autism

链接https://arxiv.org/abs/2606.05620

作者:Xiaoyi Wang,Chenxi Fu,Ziman Zhuang,Caimei Yang

类目:Computation and Language (cs.CL)

关键词:hierarchical linguistic structures, Recursion enables, imposes substantial processing, ASD, real-time comprehension

备注

点击查看摘要

Abstract:Recursion enables the generation of hierarchical linguistic structures but imposes substantial processing demands during real-time comprehension. While difficulties with complex syntax have been reported in autism spectrum disorder (ASD), the temporal dynamics of recursive processing remain poorly understood. This study used event-related potentials (ERPs) to examine how Mandarin-speaking children with ASD process two-level recursive locative constructions. Twenty-four children (12 ASD, 12 typically developing, TD) participated in a cross-modal sentence-picture matching task. Neural responses were analyzed across three processing stages associated with structural prediction (P200), semantic integration (N400), and syntactic reanalysis (P600), with mental age controlled. Results revealed a systematic divergence between groups. TD children showed clear P200 and P600 modulation in response to structural mismatch, whereas ASD children exhibited attenuated early differentiation and reduced late reanalysis effects. In contrast, ASD children showed enhanced N400 responses under mismatch conditions, indicating increased semantic integration demands. In addition, the ASD group displayed significantly greater inter-individual variability in hemispheric lateralization, although lateralization strength was not associated with receptive vocabulary performance. These findings support a cascading account in which reduced early predictive engagement in ASD leads to increased integration costs and diminished reanalysis efficiency during recursive processing. More broadly, the results highlight the importance of both temporal processing dynamics and neural variability in understanding language differences in ASD.

104. 【2606.05616】What's in a Name? Morphological Shortcuts by LLMs in Pharmacology

链接https://arxiv.org/abs/2606.05616

作者:Kaijie Mo,Thomas Yang,Chantal Shaib,Qing Yao,William Rudman,Ramez Kouzy,Kanishka Misra,Byron C. Wallace,Junyi Jessy Li

类目:Computation and Language (cs.CL)

关键词:purely relying, mappings can lead, lead to overgeneralization, overgeneralization in high-stakes, high-stakes domains

备注: 22 pages

点击查看摘要

Abstract:The morphological form of a word can often give cues to its meaning, but purely relying on these mappings can lead to overgeneralization in high-stakes domains. In the medical domain, for instance, LLMs can confidently reason about fictitious drugs from their affixes alone (e.g., wugcillin) and generate plausible-looking clinical content. We present a behavioral and mechanistic study of LLM "affix heuristics" in pharmacology. Using fictitious drug names built from real affixes, we show that affix signals alone elicit class-level pharmacological responses. We introduce a framework for identifying whether a model's drug semantics are driven mainly by the affix, the stem, or the drug name as a whole. Applied across 653 drugs, our framework reveals that models often induce drug meaning primarily through affix cues, yet rarely explicitly indicate this reliance, and sometimes incorrectly conflate properties among affix-sharing drugs. Activation patching across models further localizes this behavior to early-mid layers. These findings show that morphological shortcuts pose a subtle but measurable risk to safety.

105. 【2606.05610】Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

链接https://arxiv.org/abs/2606.05610

作者:Yongwei Zhou,Juncheng Diao,Junlin Shang,Peiguang Li,Rongxiang Weng

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, batch size, learning rate

备注

点击查看摘要

Abstract:The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textit{Empirical Law Discovery}, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textit{State-Aware Hyperparameter Prediction}, where we evaluate an initial checkpoint's validation loss and use the inverse scaling law to estimate its \textit{equivalent pre-training compute} -- the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90\% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.

106. 【2606.05570】nsorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

链接https://arxiv.org/abs/2606.05570

作者:Bobby Yan,Fredrik Kjolstad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:involve large codebases, Repository-level coding benchmarks, incomplete test coverage, coding benchmarks face, Repository-level coding

备注

点击查看摘要

Abstract:Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent's patch and running the framework's test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from $64.8\%$ for the strongest agent to $22.1\%$ for the weakest. Agents pass different subsets of tasks: pairwise Cohen's $\kappa$ ranges from $-0.07$ to $0.43$, with $\kappa = 0.05$ for the two strongest agents.

107. 【2606.05569】Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

链接https://arxiv.org/abs/2606.05569

作者:Huu Tuong Tu,Hanh Nguyen,Thien Van Luong,Nguyen Tien Cuong,Vu Huan,Nguyen Thi Thu Trang

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Mispronunciation Detection, Detection and Diagnosis, gained increasing importance, computer-assisted language learning, recent years

备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs that enable models to learn phoneme confusion patterns represented as directed graphs. Furthermore, we introduce a language-specific strategy to capture systematic pronunciation differences across various native language (L1) backgrounds. The effectiveness of our approach is demonstrated through extensive experiments on the L2-ARCTIC benchmark, where it achieves an F1-score of 59.52%, outperforming several competitive baselines.

108. 【2606.05568】ColBERTSaR: Sparsified ColBERT Index via Product Quantization

链接https://arxiv.org/abs/2606.05568

作者:Eugene Yang,Andrew Yates,Dawn Lawrie,James Mayfield,Saron Samuel,Rohan Jha

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:support candidate set, neural retrieval architecture, heavy index structure, set retrieval based, effective neural retrieval

备注: 6 pages, 1 figure, accepted at SIGIR 2026 as a short paper

点击查看摘要

Abstract:While ColBERT is an effective neural retrieval architecture, it requires a heavy index structure to support candidate set retrieval based on approximated token embeddings, gathering and decompressing document token embeddings, and applying the MaxSim operation. Indexes in PLAID and similar ColBERT implementations require five to ten times the disk storage of the original raw text, which limits their scalability. Furthermore, prior work has identified that the gathering and decompression stages are the primary inefficiencies at query time. Limiting the number of document tokens that must be gathered by thresholding and score approximation does not eliminate the need for the entire index to support ad hoc queries. In this work, we propose an embedding quantization approach that turns a ColBERT index into a true inverted index. We show that, theoretically, ColBERT with embedding quantization is equivalent to learned-sparse retrieval except for the scoring mechanism. Empirically, we demonstrate that our index is 50-70% smaller than a one-bit PLAID index while retaining retrieval effectiveness.

109. 【2606.05564】Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program

链接https://arxiv.org/abs/2606.05564

作者:Varun Aggarwal,Kay Kobak,John Howarter

类目:Computation and Language (cs.CL)

关键词:Undergraduate Research Fellowship, Summer Undergraduate Research, Undergraduate research programs, Undergraduate research, Purdue University receive

备注

点击查看摘要

Abstract:Undergraduate research programs such as the Summer Undergraduate Research Fellowship (SURF) at Purdue University receive thousands of applications every year, requiring significant time and effort for program staff to evaluate each submission consistently and within tight timelines. This work-in-progress paper describes the development and initial deployment of a large language model (LLM)-based tool to assist in the evaluation of approximately 1,200 student Statements of Purpose (SoPs) for the SURF 2026 cycle at Purdue University. The workflow utilizes OpenAI GPT models (GPT-4o, GPT-5-mini, and GPT-5.2) and uses a structured rubric across six subcategories, each scored on a 0-3 scale. A few SoPs, graded by program staff, were used to tune the model responses. The model prompt was designed to generate both numerical scores, rationales (including positive and negative aspects) and short excerpts from each submission. Using GPT-5.2, the full batch of 1,200 SoPs was processed in approximately 4.6 hours of compute time, averaging roughly 14 seconds per SoP (with per-SoP timing varying with SoP length, which ranged from 500 to 2,000 words). Notable differences in rubric adherence were observed across model versions, with GPT-5.2 adhering most closely. Disagreement in model scores was more pronounced for lower-scoring submissions. The LLM outputs replicated the role previously played by distributed human graders, providing the program coordinator with scored and rationale-annotated outputs for the entire applicant pool. The program coordinator then reviewed these outputs alongside each applicant's SoP, applying the same downstream office criteria used in prior SURF cycles, to produce a shortlist of strong candidates. This coordinator review was completed in approximately 4 hours, compared to the multi-week coordination effort required in prior program cycles.

110. 【2606.05563】SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

链接https://arxiv.org/abs/2606.05563

作者:Taewon Yun,Hyeonseong Park,Jeonghwan Choi,Hayoon Park,Yeeun Choi,Hwanjun Song

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:disputants' shifting emotions, real-time trajectory shaped, mediators remains challenging, LLM mediators remains, remains challenging

备注

点击查看摘要

Abstract:Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.

111. 【2606.05561】InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

链接https://arxiv.org/abs/2606.05561

作者:Xueyang Wu,Siyuan Liu,Kezhuo Yang,Guang Ling

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Speech-based mental health, mental health screening, health screening offers, screening offers scalable, clinical deployment faces

备注

点击查看摘要

Abstract:Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve this conflict. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy. We identify that standard MINE estimators struggle with sequential speech due to temporal-static misalignment, and introduce TimeAwareMINE with cross-modal attention to align acoustic frames with attribute embeddings. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92.6\% to 55.5\% and age inference from 55.7\% to 30.3\% with limited utility loss (6\% F1 reduction), achieving F1=0.784 compared to prior SOTA's 0.723.

112. 【2606.05557】AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

链接https://arxiv.org/abs/2606.05557

作者:Yang Li,Jiaxiang Liu,Jiang Cai,Mingkun Xu

类目:Computation and Language (cs.CL)

关键词:Lin Wei, Wei is free, good mood, situated query, worth interrupting

备注: Submitted to EMNLP 2026. Code, simulator, and benchmark: [this https URL](https://github.com/innovation64/AURA)

点击查看摘要

Abstract:A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal question and stop. AURA inserts an inference step between scene perception and tool use that produces an IntentFrame: a structured estimate of the implicit need with a scalar gap score that controls per-query probe budget and tool selection. On a 100-query four-scene implicit-intent benchmark, AURA improves implicit-need coverage over ReAct-style probing (Delta = +0.07, p 10^-6); three of four scenes are individually significant, the gain reproduces on a second backbone, and a prompt ablation attributes the lift to gap calibration rather than answer memorisation. On factual lookup the controller trades raw accuracy for 82% fewer probes and zero forbidden-tool violations on a privacy-sensitive slice; scope conditions are detailed in Limitations. Code, simulator, and benchmark are released at this https URL.

113. 【2606.05553】ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

链接https://arxiv.org/abs/2606.05553

作者:Woojung Song,Nalim Kim,Sangjun Song,Chaewon Heo,Jongwon Lim,Yohan Jo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Role-playing language agents, Role-playing language, source text, language agents, story progresses

备注

点击查看摘要

Abstract:Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character's psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.

114. 【2606.05545】Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

链接https://arxiv.org/abs/2606.05545

作者:Nadine Yasser Abdelhalim,Emmanuel Akinrintoyo,Nicole Salomons

类目:Computation and Language (cs.CL)

关键词:Alzheimer Disease Dementia, multilingual Alzheimer Disease, Disease Dementia, Alzheimer Disease, presents significant challenges

备注: 5 pages

点击查看摘要

Abstract:The development of multilingual Alzheimer's Disease Dementia (AD) detection models presents significant challenges due to the resource-intensive and time-consuming nature of language-specific model training. We propose a novel solution using cross-language training to detect AD in languages beyond those used for model training. This study investigates multilingual deep learning models for detecting AD across different languages and cognitive impairment levels. Using datasets in English, Chinese, Arabic, and Hindi, we developed transformer-based models for binary AD classification. Our approach achieved F1 scores of 82\% across all languages, demonstrating strong cross-linguistic generalization. The rapid inference time (0.5 seconds) supports potential real-time screening applications, while consistent performance across languages indicates feasibility for global deployment.

115. 【2606.05538】Less is MoE: Trimming Experts in Domain-Specialist Language Models

链接https://arxiv.org/abs/2606.05538

作者:Haoze He,Xinkai Zou,Xuan Jiang,Xingyuan Ding,Ao Qu,Juncheng Billy Li,Heather Miller

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:poses deployment challenges, large parameter footprint, parameter footprint poses, footprint poses deployment, achieve strong performance

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.

116. 【2606.05531】Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

链接https://arxiv.org/abs/2606.05531

作者:Mohammad Mahdi Abootorabi,Omid Ghahroodi,Anas Madkoor,Marzia Nouri,Doratossadat Dastgheib,Mohamed Hefeeda,Ehsaneddin Asgari

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:chart meaningful progress, field lacks benchmarks, human-like multimodal intelligence, true reasoning abilities, rapid progress

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: this https URL.

117. 【2606.05523】CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

链接https://arxiv.org/abs/2606.05523

作者:Rahul Markasserithodi,Aditya Joshi,Yuekang Li,Ishmanbir Singh,Chris Yoo,Alan Niu

类目:Computation and Language (cs.CL)

关键词:persona modulation, fictional framing, persuasion-based reformulation, framing and persuasion-based, Relative Policy Optimization

备注: Under Review at ARR

点击查看摘要

Abstract:Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.

118. 【2606.05513】EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

链接https://arxiv.org/abs/2606.05513

作者:Yiming Lu,Sihang Zeng,Zhengxu Tang,Max Lau,Fei Liu,Wei Jin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Epidemic LLM forecasters, operational pandemic forecasting, Epidemic LLM, static supervised models, supervised models

备注

点击查看摘要

Abstract:Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

119. 【2606.05494】MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization

链接https://arxiv.org/abs/2606.05494

作者:Ahmed Alansary,Ali Hamdi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:digital textual information, increasingly important due, Automatic text summarization, text summarization, textual information

备注: 6 pages, 3 figures, IMSA2026

点击查看摘要

Abstract:Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Adaptive Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics. To address this limitation, the proposed framework integrates multiple fine-tuned transformer-based summarization models and introduces an adaptive selection mechanism. In this framework, each model independently generates a candidate summary for the same input article. The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance. Based on these scores, the framework selects the highest-quality summary as the final output. The models are fine-tuned and evaluated on the widely used CNN/DailyMail news summarization dataset. Experimental results demonstrate that the proposed framework achieves the highest BERTScore among all compared methods with a score of 88.63%. It also outperforms several LLMs such as GPT3-D2, Falcon-7b, and Mpt-7b, highlighting its effectiveness and robustness. These findings highlight the effectiveness of leveraging multiple transformer-based models within an adaptive selection strategy to improve the quality and robustness of automatic text summarization systems.

120. 【2606.05486】Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

链接https://arxiv.org/abs/2606.05486

作者:Govind Ramesh,Yao Dou,Wei Xu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, explain observable outputs, existing attribution methods, language models, common source

备注: 23 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Prompt ambiguity is a common source of failure in large language models, but is difficult to localize because it is a latent property of the prompt, while existing attribution methods are designed to explain observable outputs such as logits or generated tokens. We introduce PRIG, a gradient attribution method that uses a probe logit to attribute latent ambiguity to token positions. Specifically, PRIG trains a linear probe to distinguish clear prompts from ambiguous prompts and attributes the probe score to earlier token representations in the residual stream. To enable token-level evaluation, we construct synthetic ambiguity datasets across coding, math, and writing by rewriting one task-critical sentence per prompt, and complement them with a human-written gold benchmark. In this setting, PRIG localizes ambiguous spans substantially better than gradient attribution baselines, achieving 0.840 AUROC on the combined synthetic benchmark and 0.891 AUROC on the gold set. It also outperforms GPT-5.4 on sentence-level ambiguity identification and retains useful signal out-of-domain. These results establish PRIG as a practical tool for identifying which parts of a prompt are ambiguous. More broadly, they suggest that latent prompt properties can be localized through intermediate representations, rather than through output-level attribution.

121. 【2606.05444】Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

链接https://arxiv.org/abs/2606.05444

作者:Adriana-Valentina Costache,Eduard Poesina,Silviu-Florin Gheorghe,Paul Irofti,Radu Tudor Ionescu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:core NLP task, core NLP, Coreference resolution, NLP task, question answering

备注

点击查看摘要

Abstract:Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.

122. 【2606.05443】MIRAI: Prediction and Generation of High-Impact Academic Research

链接https://arxiv.org/abs/2606.05443

作者:Alex Li,Joseph Jacobson

类目:Digital Libraries (cs.DL); Computation and Language (cs.CL)

关键词:increasingly urgent challenge, urgent challenge, rapid pace, pace of scientific, scientific publishing

备注

点击查看摘要

Abstract:The rapid pace of scientific publishing has made the identification and synthesis of high-impact work an increasingly urgent challenge. We introduce MIRAI (Multi-year Inference of Research trends and Academic Impact), a deep learning framework that predicts paper impact using only it's title, abstract, and publication date. We train MIRAI on the arXiv academic graph to predict 5-year PageRank and citation counts, achieving Spearman's $\rho$ of 0.4686 on PageRank prediction and 0.6192 on citation prediction for papers published in 2021. We propose a research ideation pipeline built on top of MIRAI that produces research ideas oriented towards high impact. These ideas were judged as more impactful than a baseline without MIRAI by an unbiased LLM judge at a 4:3 ratio. We make the 5-year citation prediction model publicly available at this https URL.

123. 【2606.05436】n Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

链接https://arxiv.org/abs/2606.05436

作者:Alejandro Lozano,Keiko Ihara,Ping-Hao Yang,Carrie E. Robertson,Jennifer Stern,Allan Purdy,Hsiangkuo Yuan,Pengfei Zhang,Yulia Orlova,Olga Fermo,Jennifer Hranilovich,Fred Cohen,Todd J. Schwedt,Jenelle A. Jindal,Serena Yeung-Levy,Chia-Chun Chiang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:high-quality patient care, Summarizing the latest, latest medical literature, latest medical, decision-making is essential

备注

点击查看摘要

Abstract:Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.

124. 【2606.05421】ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation

链接https://arxiv.org/abs/2606.05421

作者:Joseph Marvin Imperial,Junhong Liang,Belal Shoer,Abdullah Barayan,Rodrigo Wilkens,Omar Mussa,Dawn Knight,Eugénio Ribeiro,Ekaterina Kochmar,Sowmya Vajjala,Fernando Alva-Manchego,Harish Tayyar Madabushi

类目:Computation and Language (cs.CL)

关键词:Common European Framework, machine translation, CEFR levels, CEFR, Common European

备注

点击查看摘要

Abstract:When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other, using the Common European Framework of Reference for Languages (CEFR) levels as the measure of text complexity. Across six languages, including Arabic, Dutch, English, French, Hindi, and Russian, we evaluate three open-weight models, one closed model, and a commercial machine translation system on two tasks: i) correlation of CEFR with translation difficulty, and ii) shifts in CEFR levels of the source texts. Our experiments show that higher CEFR levels make texts more difficult to translate, and that machine translation shifts the CEFR level of the target text compared to the original source, for most languages. These findings provide new insights for researchers and practitioners working on multilingual pedagogical content generation and machine translation difficulty estimation.

125. 【2606.05415】Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval

链接https://arxiv.org/abs/2606.05415

作者:Padmaja Jonnalagedda,Yuguang Yao,Xiang Gao,Hilaf Hasson,Kamalika Das

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Real-world data spans, data spans tables, Real-world data, spans tables, implicit semantics

备注: 9 pages, 4 figures, plus supplementary appendix

点击查看摘要

Abstract:Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval. A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph. At query time the schema -- optionally extended via a monotonic protocol -- conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations. In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.

126. 【2606.05414】When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

链接https://arxiv.org/abs/2606.05414

作者:Avinash Baidya,Xinran Liang,Ruocheng Guo,Xiang Gao,Kamalika Das

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:alerting requires deciding, requires deciding, failure, failure alerting requires, evidence

备注: 9 pages, 14 figures, and appendix

点击查看摘要

Abstract:Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with $\alpha$-STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6\% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10\% over naive prefix supervision, and that the full system improves frontier quality by 3-42\% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.

127. 【2606.05409】Would you still call this Dax? Novel Visual References in VLMs and Humans

链接https://arxiv.org/abs/2606.05409

作者:Ada Defne Tür,Gaurav Kamath,Joyce Chai,Siva Reddy,Benno Krojer

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:remains largely underexplored, exposure remains largely, Visual References Dataset, Vision-language models, largely underexplored

备注

点击查看摘要

Abstract:Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

128. 【2606.05405】Agents' Last Exam

链接https://arxiv.org/abs/2606.05405

作者:Yiyou Sun,Xinyang Han,Weichen Zhang,Yuanbo Pang,Tianyu Wang,Yuhan Cao,Yixiao Huang,Chris Duroiu,Haoyun Zhang,Jeffrey Lin,Weishu Zhang,Tyler Zeng,Ying Yan,Bo Liu,Hanson Wen,Mingyang Xu,Xiaoyuan Liu,Zimeng Chen,Weiyan Shi,Amanda Dsouza,Vincent Sunn Chen,Patrick Bryant,Carl Boettiger,Yamini Rangan,Bradley Rothenberg,Kyle Steinfeld,Arvind Rao,Tapio Schneider,Georgios Yannakakis,Laure Zanna,Kaan Ozbay,Ida Sim,Tarek Zohdi,George Em Karniadakis,Jack Gallant,Teresa Head-gordon,Yushan Li,Wenxi Deng,Tao Sun,Huiqi Wang,Zhun Wang,Justin Xu,Chris Yuhao Liu,Yafei Cheng,Rongwang Hu,Aras Bacho,Shengcao Cao,Zengyi Qin,Yixiong Chen,Hengduan Fan,Hao Liu,Lin Zeng,Shashank Muralidhar Bharadwaj,Litian Gong,Yingxuan Yang,Maojia Song,Ruheng Wang,Zongzheng Zhang,Honglin Bao,Shuo Lu,Jianhong Tu,Zhonghua Wang,Zheng Zhang,Zijiao Chen,yanqiong Jiang,Zhendong Li,Bohan Lyu,Chang Ma,Peiran Xu,Benran Zhang,Shangding Gu,Haoyue Hua,Haoyang Li,Wanzhe Liao,Chengzhi Liu,Junbo Peng,Haoran Sun,Zechen Xu,Bo Chen,Jiayi Cheng,Yi Jiang,Keying Kuang,Yuan Li,Youbang Pan,Ziyan Rao,Alexander Schubert,Yifan Shen,Vincent Siu,Xiatao Sun,Kangqi Zhang,Xiaopan Zhang,Yuchen Zhu,Ishaan Singh Chandok,Lei Ding,Jingxuan Fan,Andrew Glover,Jiaming Hu,Yiran Hu,Wenbo Huang,Zixin Jiang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recent AI systems, economically meaningful deployment, achieved strong results, professional domains, systems have achieved

备注: Project website: [this https URL](https://agents-last-exam.org) Code: [this https URL](https://github.com/rdi-berkeley/agents-last-exam)

点击查看摘要

Abstract:Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

129. 【2606.05404】Harnessing Generalist Agents for Contextualized Time Series

链接https://arxiv.org/abs/2606.05404

作者:Zihao Li,Kaifeng Jin,Yuanchen Bei,Jiaru Zou,Avaneesh Kumar,Xuying Ning,Yanjun Zhao,Mengting Ai,Baoyu Jing,Hanghang Tong,Jingrui He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:holistic modeling, embedded in rich, essential for holistic, Time series, rich contexts

备注: Preprint. 38 Pages

点击查看摘要

Abstract:Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at this https URL.

130. 【2606.05402】ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

链接https://arxiv.org/abs/2606.05402

作者:Jinu Lee,Shivam Agarwal,Amruta Parulekar,Siddarth Madala,Dilek Hakkani-Tur,Julia Hockenmaier

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large reasoning models, Large reasoning, backtracking and self-correction, produce reasoning traces, complicate the evaluation

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: this https URL.

131. 【2606.05400】LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

链接https://arxiv.org/abs/2606.05400

作者:Yuanhe Zhang,Yuekai Sun,Taiji Suzuki,Jason D. Lee,Fanghui Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:corrupt distant work, Long-horizon autoformalization, repairs corrupt distant, statements drift, dependencies tangle

备注: 26 pages, 9 figures. Comments are welcome

点击查看摘要

Abstract:Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at this https URL.

132. 【2606.05384】Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

链接https://arxiv.org/abs/2606.05384

作者:Srimonti Dutta,Akshata Kishore Moharir

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:automated evaluators, model outputs, outputs are compared, compared and ranked, ranked using automated

备注: Accepted at ACL 2026 GEM (Generation, Evaluation and Metrics) Workshop

点击查看摘要

Abstract:LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering. These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.

133. 【2606.05346】rajectory Dynamics in Language Model Hidden States Predict Human Processing Costs Beyond Surprisal

链接https://arxiv.org/abs/2606.05346

作者:Elan Barenholtz

类目:Computation and Language (cs.CL)

关键词:Human language comprehension, comprehension unfolds sequentially, interpretation builds incrementally, language comprehension unfolds, Human language

备注: 17 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Human language comprehension unfolds sequentially: each word is processed in the context of those that came before, and the interpretation builds incrementally over time. Surprisal, the negative log probability of a word given its context, has been the dominant predictor of incremental processing cost. But surprisal reduces rich sequential representations to a single scalar at each word, discarding information about the direction in which the interpretation has been evolving. Dynamical-systems approaches suggest that the trajectory of the evolving interpretive state, not just its position at each moment,should shape processing, and language itself may have local momentum, since speakers plan utterances a few words at a time. We introduce trajectory extrapolation error: at each word, we fit a linear trajectory to the preceding hidden states of a transformer language model and measure deviation from the extrapolated path. On the Natural Stories corpus, this measure is nearly orthogonal to surprisal (r = .044) and independently predicts self-paced reading times. The effect is especially pronounced in garden-path sentences, strengthens with model scale (GPT-2 Small to Large), and replicates across architectures with different positional encoding schemes (GPT-2 vs. Pythia/RoPE). A displacement control shows the effect is not reducible to representational change magnitude: displacement and extrapolation error predict in opposite directions. These findings reveal two dissociable components of processing cost: word-level prediction error (surprisal) and sensitivity to the local momentum of the unfolding interpretation (trajectory extrapolation error).

134. 【2606.05336】Self-supervised User Profile Generation for Personalization

链接https://arxiv.org/abs/2606.05336

作者:Clark Mingxuan Ju,Yuwei Qiu,Tong Zhao,Neil Shah

类目:Computation and Language (cs.CL)

关键词:Personalizing large language, large language models, Personalizing large, language models, deployed across recommendation

备注

点击查看摘要

Abstract:Personalizing large language models (LLMs) has become a central challenge as LLMs are deployed across recommendation, search, dialogue, and content generation -- settings where the same query should yield different answers given different users. A promising route is to summarize each user's interaction history into a natural-language memory or profile and prepend it to the prompt to facilitate personalization. Existing methods learn such profile generators with explicit rewards derived from labeled downstream tasks, which are expensive and sparse as they require annotated supervision for every target task. In light of this challenge, we introduce Bidirectional User Modeling via Profiles (BUMP), a self-supervised framework that trains a profile generator without any downstream labels. Specifically, given a user's interaction history, we use GRPO to train an LLM to emit a free-form textual profile under a bidirectional in-batch ranking objective: a small LLM judge measures (i) how well the generated profile, used as a query, ranks the user's own held-out interactions above interactions from other users in the batch, and (ii) how well a held-out interaction, used as a query, ranks the user's own profile above profiles of other users. Both directions are scored with multi-positive NDCG and combined into a dense reward per rollout; other users in the batch supply free negatives, so every training example yields supervision from raw interaction logs alone. Evaluated on the LaMP benchmark, BUMP matches or outperforms closed-source APIs and prior methods relying on labeled rewards, while requiring no task label at training.

135. 【2606.05330】A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

链接https://arxiv.org/abs/2606.05330

作者:Jared Moore,Noah Goodman,Nick Haber,Max Kleiman-Weiner

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Large language models, Large language, post belief change, high-stakes domains, rely on pre

备注

点击查看摘要

Abstract:Large language models can shift human beliefs across high-stakes domains, but most persuasion studies rely on pre/post belief change. These endpoint measures identify whether persuasion occurred, yet miss where and how beliefs moved within a dialogue. We present PERSUASIONTRACE, a framework for studying persuasion in human-LLM interaction. Built on a web-based experimental platform, PERSUASIONTRACE contributes a tool for multi-turn persuasion studies and a process-level evaluation protocol: it records multi-turn belief reports from human or simulated targets of persuasion, annotates persuader turns with rhetorical dimensions (logos/pathos/ethos), and evaluates simulators by fidelity to real human belief dynamics. Using this framework, we find that human targets group into two clusters of multi-turn belief updates and exhibit susceptibility to rhetorical strategies, and that LLMs are persuasive across generic and personalized topics, text and audio modalities, and multi-turn interactions. Prior work has chiefly used vanilla-prompted LLMs to simulate human targets, but we show that these simulators fail to replicate human belief dynamics. We introduce a Bayesian-network simulated target that maintains an explicit latent belief state over time so each persuader message yields cognitively realistic belief updates. In human-likeness evaluation, our Bayesian target scores near a human reference (81 vs 80), while baseline LLM targets score substantially lower (64). PERSUASIONTRACE reframes persuasion evaluation from endpoint movement alone to process fidelity, providing a stronger basis for scientific analysis and safer optimization of persuasive systems.

136. 【2606.05315】LoRi: Low-Rank Distillation for Implicit Reasoning

链接https://arxiv.org/abs/2606.05315

作者:Ryan Solgi,Jiayi Tian,Zheng Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, aim to internalize, large language, explicit CoT prompting, Implicit

备注

点击查看摘要

Abstract:Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. Motivated by this observation, we propose a low-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low-rank tensor subspace using first- and second-order statistics. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks. Our approach consistently improves performance, especially on challenging multi-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods.

137. 【2606.05308】Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

链接https://arxiv.org/abs/2606.05308

作者:Abhishek Divekar

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Applications (stat.AP)

关键词:extended Prediction-Powered Inference, small human-labeled set, large LLM-judged set, Prediction-Powered Inference, Inference to produce

备注: Accepted at ACL 2026 - GEM Workshop

点击查看摘要

Abstract:With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

138. 【2606.05233】Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

链接https://arxiv.org/abs/2606.05233

作者:Nicholas Saban

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:red-teaming papers report, papers report prompt-injection, report prompt-injection attack, headline numbers cluster, red-teaming papers

备注

点击查看摘要

Abstract:Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs. We release CUA-HandCrafted, a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations. Against Claude Sonnet 4.6 and GPT-5.4 we measure 0/140 multi-step attack success (Clopper-Pearson 95% upper bound 2.60%); a prompt ablation shows this resistance lives in the model weights. Yet it does not generalize: on a sister coding-agent benchmark (SkillBench), the same weights fall to hand-crafted skill-injection at up to 100%. We argue that the literature's high ASR is largely attributable to RL-optimized injection text rather than the attack categories, and that frontier safety hardening is domain-conditioned, specific to the heavily-targeted browser surface. Reporting techniques without releasing the optimized strings, or extrapolating browser-domain safety to other CUA modalities, makes published ASR numbers unreproducible.

139. 【2606.05194】mporal Preference Concepts and their Functions in a Large Language Model

链接https://arxiv.org/abs/2606.05194

作者:Ian Rios-Sialer,Shantanu Darveshi,Shuai Jiang,Avigya Paudel,Anastasiia Pronina,Ipshita Bandyopadhyay,Justin Shenk

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, long-term consequences, resolve these tradeoffs

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason

140. 【2606.05186】Staged Factorial Screening for Budget-Constrained Micro-Pretraining

链接https://arxiv.org/abs/2606.05186

作者:Felipe Chavarro Polania

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Budget-constrained micro-pretraining, micro-pretraining often requires, requires triaging, triaging many candidate, candidate recipes

备注: 23 pages, 4 figures

点击查看摘要

Abstract:Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

141. 【2606.05183】he Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

链接https://arxiv.org/abs/2606.05183

作者:Patrick Keough

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Large language models, Large language, alignment benchmarks treat, high-stakes advisors, benchmarks treat sycophancy

备注: 16 pages, 9 figures

点击查看摘要

Abstract:Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert = 2.0) and 22.7 percent reach moderate or severe levels (= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.

142. 【2606.05182】LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

链接https://arxiv.org/abs/2606.05182

作者:Rahul Subramani

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, discard critical details, Large language, Episodic Retrieval Network, finite context windows

备注

点击查看摘要

Abstract:Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn. On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0.81), LANTERN-Rerank recovers 78.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72.4%; Wilcoxon p0.0001, 95% CI [+3.1, +8.6] pp, d=0.43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0.005) using zero LLM calls. When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8.4 percentage points on average (Wilcoxon p0.05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.

143. 【2606.05181】Multi-Granularity Reasoning for Natural Language Inference

链接https://arxiv.org/abs/2606.05181

作者:Chunling Xi,Di Liang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Inference, Language Inference, fundamental task, requires determining, Natural Language

备注

点击查看摘要

Abstract:Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emph{Multi-Granularity Reasoning Network} (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.

144. 【2606.05180】From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

链接https://arxiv.org/abs/2606.05180

作者:Ivo Bueno,Babette Bühler,Philipp Stark,Tim Fütterer,Ulrich Trautwein,Dorottya Demszky,Heather Hill,Enkelejda Kasneci

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Automated scoring models, including classroom transcripts, Automated scoring, assign rubric-based quality, rubric-based quality ratings

备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.

145. 【2606.05179】Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

链接https://arxiv.org/abs/2606.05179

作者:Sungmook Woo,Hyungu Kang,Chanwoo Kim

类目:Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, Automatic Speech, Speech Recognition, restoration improves ASR, Punctuation restoration improves

备注: Accepted for presentation at The International Joint Conference on Neural Networks (IJCNN) 2026

点击查看摘要

Abstract:Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which makes generation-based approaches prone to latency and alignment failures under boundary-wise evaluation. This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. Our method compares punctuation insertion hypotheses against a no-insertion baseline under a bounded K-subword-token lookahead, and calibrates decisions using a weight {\alpha} and a validation-calibrated threshold {\tau} (no parameter updates during inference). On IWSLT 2017, our scoring method achieves a 4-class macro F1 of 0.893 in the no fine-tuning setting (validation-calibrated, K=2) and 0.937 after fine-tuning (K=2), outperforming the prompt-based baseline (0.566) and a fine-tuned ELECTRA baseline (0.913) under the same lookahead budget. We analyze the impact of the lookahead budget through ablation studies on K.

146. 【2606.05177】MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

链接https://arxiv.org/abs/2606.05177

作者:Manh Luong,Tamas Abraham,Junae Kim,Amar Kaur,Rollin Omari,Gholamreza Haffari,Trang Vu,Lizhen Qu,Dinh Phung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词:Omni Large Language, Large Language Models, Large Language, assess Omni Large, benchmarks focus solely

备注

点击查看摘要

Abstract:Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

147. 【2606.05176】PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

链接https://arxiv.org/abs/2606.05176

作者:Lucas Tamic,Ilan Jaffeux-Cheniout,Xavier Marjou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Toggle, telecommunications customer support, support remain limited, customer support remain, natural language understanding

备注

点击查看摘要

Abstract:While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain. We present a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) applied to Qwen2.5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2.0 Flash. We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5.2 and Claude 4.5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss (0.5024) ranks only 6th-7th in qualitative evaluation, while the worst loss (0.6807) ranks first according to both judges. This work contributes (1) a combinatorial method for synthetic dataset construction, (2) insights into the impact of target module selection for LoRA injection, (3) evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and (4) an energy-performance trade-off analysis for sustainable LLM deployment.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.05176 [cs.CL]

(or
arXiv:2606.05176v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.05176

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Lucas Tamic [view email] [v1]
Fri, 17 Apr 2026 09:56:18 UTC (171 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis, by Lucas Tamic and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CL

prev

|
next

new
|
recent
| 2026-06

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

148. 【2606.05175】Generic Triple-Latent Compression with Gated Associative Retrieval

链接https://arxiv.org/abs/2606.05175

作者:Liu Xiao

类目:Computation and Language (cs.CL)

关键词:compressed pair-memory pathway, running token state, capture higher-order token, higher-order token interactions, study generic triple-latent

备注

点击查看摘要

Abstract:We study generic triple-latent sequence models that maintain a running token state and compressed pair-memory pathway to capture higher-order token interactions without benchmark-specific parsing. The triple-latent family improves a small Transformer baseline on byte-level WikiText-2 and on a tokenizer-based MiniMind language-model benchmark, while a recall-focused gated key-value retrieval extension improves associative recall but remains seed-sensitive and much slower in the current reference implementation.

149. 【2606.05174】Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

链接https://arxiv.org/abs/2606.05174

作者:Arash Ahmadi,Parisa Masnadi,Sarah Sharif,Charles Nicholson,David Ebert,Mike Banad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, shown strong promise, Language Models, healthcare applications

备注: 27 Pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.

150. 【2606.05173】Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

链接https://arxiv.org/abs/2606.05173

作者:Aimen Boukhari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Masked language modelling, deeper semantic structure, surface-form token identity, Masked language, dominant pre-training objective

备注: 12 pages, 10 figures, 11 tables. Preprint. Code available at : [this https URL](https://github.com/aymen-000/predict-reconstruct-language-models)

点击查看摘要

Abstract:Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0.16 vs -0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance. Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.

151. 【2606.05168】Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

链接https://arxiv.org/abs/2606.05168

作者:Xiangyu Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:existing analyses treat, existing analyses, analyses treat, synthetic data, Training on synthetic

备注: 24 pages, 15 figures

点击查看摘要

Abstract:Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination. We derive the basic reproduction number $R_0 = \sqrt{\beta_D \beta_M / [(\gamma_D+\mu_D)(\gamma_M+\mu_M)]}$ via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system. Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ($R_0 1$) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter. A bipartite-network agent-based model confirms mean-field consistency ($R^2 0.96$) for dense networks but degrades under heterogeneity. GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions. Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.

152. 【2606.06444】USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

链接https://arxiv.org/abs/2606.06444

作者:Heng-Jui Chang,Alexander H. Liu,Saurabhchand Bhati,Mrudula Athi,Anton Ratnarajah,Amit Chhetri,James Glass

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:modern audio applications, large language models, increasingly rely, diverse inputs, critical to modern

备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

153. 【2606.06183】Revisiting Lexicon Evaluation in Unsupervised Word Discovery

链接https://arxiv.org/abs/2606.06183

作者:Simon Malan,Danel Slabbert,Herman Kamper

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:zero-resource speech processing, speech processing, central goal, goal in zero-resource, zero-resource speech

备注: 6 figures

点击查看摘要

Abstract:Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.

信息检索

1. 【2606.06407】A Vision-language Framework for Comparative Reasoning in Radiology

链接https://arxiv.org/abs/2606.06407

作者:Tengfei Zhang,Ziheng Zhao,Lisong Dai,Xiaoman Zhang,Pengcheng Qiu,Ya Zhang,Yanfeng Wang,Weidi Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:imaging artificial intelligence, remains poorly aligned, isolated image interpretation, Medical imaging artificial, artificial intelligence

备注

点击查看摘要

Abstract:Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

2. 【2606.06260】OneReason Technical Report

链接https://arxiv.org/abs/2606.06260

作者:OneRec Team,Biao Yang,Boyang Ding,Chenglong Chu,Dunju Zang,Fei Pan,Han Li,Hao Jiang,Honghui Bao,Huanjie Wang,Jian Liang,Jiangxia Cao,Jiao Ou,Jiaxin Deng,Jinghao Zhang,Kun Gai,Lu Ren,Peiru Du,Pengfei Zheng,Rongzhou Zhang,Ruiming Tang,Shiyao Wang,Siyang Mao,Siyuan Lou,Teng Shi,Wei Yuan,Wenlong Xu,Xingchen Liu,Xingmei Wang,Xinqi Jin,Yan Sun,Yan Wang,Yifei Hu,Yingzhi He,Yufei Ye,Yuhao Wang,Yunhao Zhou,Yuqin Dai,Zhao Liu,Zhipeng Wei,Zhixin Ling,Ziming Li,Zixing Zhang,Ziyuan Liu,An Zhang,Changxin Lao,Chaoyi Ma,Chengru Song,Defu Lian,Fan Yang,Guowang Zhang,Hao Peng,Jiayao Shen,Jie Chen,Jun Xu,Junmin Chen,Kun Zhang,Kuo Cai,Mingxing Wen,Minmao Wang,Minxuan Lv,Qi Zhang,Qiang Luo,Sheng Yu,Shijie Li,Shijie Yi,Shuang Yang,Shugui Liu,Shuni Chen,Tinghai Zhang,Tingting Gao,Xiang Wang,Xiangyu Wu,Xiangyu Zhao,Xiao Lv,Xiaoyou Zhou,Xuming Wang,Yong Du,Zejian Zhang,Zhaojie Liu,Zhiyang Zhang,Zhuang Zhuang,Ziqi Wang,Ziyi Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:real-world services, OneRec family, widely deployed, Generative recommendation models, Generative recommendation

备注: Work in progress

点击查看摘要

Abstract:Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.

3. 【2606.06242】Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

链接https://arxiv.org/abs/2606.06242

作者:AJ Carl P. Dy,Aivin V. Solatorio

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:figures and tables, Institutional documents, analytical information embedded, substantial amounts, generic document layout

备注: 23 pages, 8 figures

点击查看摘要

Abstract:Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at this https URL and the source code is available at this https URL.

4. 【2606.06225】Bridging the Semantic-Collaborative Gap: An Asymmetric Graph Architecture for Cold-Start Item Recommendation

链接https://arxiv.org/abs/2606.06225

作者:Anh Truong,John Trenkle,Yuanbo Chen,Honghong Zhao,Abdullah Alchihabi,Effy Fang,Michael Tamir

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:observed user interactions, leverage observed user, graph-based recommendation models, interaction history, user interactions

备注

点击查看摘要

Abstract:Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. In Tubi's production retrieval system, this challenge is further constrained by the serving interface: new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. We address this setting by formulating cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow with respect to the graph and encodes content solely from intrinsic features. The RHS tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. We further extend the same representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement, promotion speed, impression acquisition, and device cold-start engagement.

5. 【2606.06106】WebKnoGraph: GNN-Powered Internal Linking

链接https://arxiv.org/abs/2606.06106

作者:Emilija Gjorgjevska,Georgina Mirceva,Miroslav Mirchev

类目:Information Retrieval (cs.IR)

关键词:search engine optimization, generic tool recommendations, fixed page templates, Internal link optimization, engine optimization

备注

点击查看摘要

Abstract:Internal link optimization is a recurring task in search engine optimization, yet many production workflows rely on manual judgment, fixed page templates, or generic tool recommendations. Practitioners need ways to evaluate candidate links before deployment because link changes can redistribute authority and affect semantic coherence in ways that are difficult to isolate after release. We present WebKnoGraph, an open-source framework for evaluating internal linking strategies on website crawls. The framework models a website as a directed graph, represents pages by embeddings, scores candidate links with GraphSAGE, and evaluates interventions by embedding the site into larger host environments. We instantiate WebKnoGraph on a production crawl of this http URL and compare automatic with expert-assisted link selection in an empirical FineWeb-based host graph and a synthetic Barabási-Albert host graph, using PageRank-based authority metrics and semantic coherence. The results show that automatic selection generally produces stronger authority redistribution, with higher Authority Yield, but also larger semantic coherence costs. Expert-assisted selection better preserves semantic coherence and, when targeting low-PageRank pages, achieves the highest Authority Yield, although with the least favorable loss-gain balance. Authority Volatility provides an additional stability perspective, but is interpreted cautiously because the two regimes use different numbers of intervention sets. These findings support a practical workflow in which candidate intervention sets are generated at scale, evaluated jointly across authority gain, volatility, loss-gain balance, and semantic coherence, and then reviewed for editorial deployability before implementation.

6. 【2606.06073】Edge-Aware Curvature Modeling for Graph Understanding in Large Language Models

链接https://arxiv.org/abs/2606.06073

作者:Zhenghong Lin,Zhibin Shi,Hongyang Dong,Xinjie Ye,Yuhong Chen,Shiping Wang

类目:Information Retrieval (cs.IR)

关键词:shown promising capabilities, jointly modeling graph-structured, modeling graph-structured data, graph-aware Large Language, Large Language Models

备注

点击查看摘要

Abstract:Recently, graph-aware Large Language Models (LLMs) have shown promising capabilities in jointly modeling graph-structured data and textual information. Existing approaches typically employ a graph encoder and a frozen LLM to obtain node representations from graph and textual views, followed by node-level alignment to bridge the two modalities. However, such alignment mechanisms primarily focus on node information while overlooking edge-level structures, leading to suboptimal information propagation across views. In this work, we conduct a comprehensive theoretical analysis to uncover why node-level alignment is insufficient for aligning textual and graph representations. Specifically, we prove theoretically for the first time that neglecting edge information leads to suboptimal solutions and negatively curved edges induce bottlenecked information flow, giving rise to the over-squashing phenomenon between graph and textual views. To address the two challenges, we innovatively proposed a CureLLM framework of Curvature-enhanced Graph Representations for Large Language Model whose goal is to inject the signals of edge information into the existing LLMs. Specifically, CureLLM first introduces the training-free textual prompt mechanism to make the LLM model generate the output directly based on the edge-aware prompt without learnable parameter costs. Furthermore, a novel curvature-aware graph representation learning is designed to capture the edge structure information to enhance the downstream tasks, where the message passing between text and graph representations only depends on edges with positive curvature. Finally, we conduct evaluations with 20 different compared methods on 11 real world datasets from various domains and the experiment results demonstrate the superiority of our proposed CureLLM framework.

7. 【2606.06036】Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

链接https://arxiv.org/abs/2606.06036

作者:Shuo Ji,Yibo Li,Bryan Hooi

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:long interaction histories, recent progress, interaction histories, long interaction, memory

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.

8. 【2606.05931】o Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

链接https://arxiv.org/abs/2606.05931

作者:Erfan Loweimi,Mengjie Qian,Kate Knill,Guanfeng Wu,Chi-Ho Chan,Abbas Haider,Muhammad Awan,Josef Kittler,Hui Wang,Mark Gales

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

关键词:voice and face, retrieving a person, unlike curated benchmarks, real-world broadcast archives, Abstract

备注: INTERSPEECH 2026

点击查看摘要

Abstract:When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

9. 【2606.05907】Knowledge Manifold: A Riemannian Geometric Framework for Semantic Mapping and Geodesic Analysis of Scientific Literature

链接https://arxiv.org/abs/2606.05907

作者:Tomonaga Okabe,Kazuhiko Komatsu

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:n-gram TF-IDF representations, character n-gram TF-IDF, positional relationships derived, semantic positional relationships, Riemannian geometric space

备注

点击查看摘要

Abstract:We present the knowledge manifold: a Riemannian geometric space in which a corpus of documents is arranged according to semantic positional relationships derived from character n-gram TF-IDF representations. The framework proceeds in five tightly coupled stages. First, each document is converted to a character-level n-gram TF-IDF vector (4-7 grams, up to 250,000 features, L2-normalized) and embedded in a two-dimensional knowledge map via constrained stress minimization with repulsion, variance, and centering regularizers. Second, knowledge at an arbitrary query point is estimated through Smoothed Particle Hydrodynamics (SPH) interpolation using a cubic-spline kernel, yielding an interpolated TF-IDF feature vector that can be linguistically characterized. Third, directional knowledge gradients at 0, 45, and 90 degrees are computed from the SPH interpolation map, and pairwise directional similarity is quantified via inner product and cosine similarity. Fourth, a Gaussian Process Regression (GPR) model, with a Constant x RBF + White kernel fitted on a 10-dimensional SVD projection, provides a Bayesian posterior mean, uncertainty estimate, and per-document contribution rate at the query point. Fifth, geodesics in the knowledge space are obtained by minimizing a discrete Riemannian path energy derived from the SPH-induced metric tensor, using L-BFGS-B with seven deterministic initial-path candidates. We apply the formulation to a corpus of 20 papers in fiber-reinforced composite materials and aerospace structural mechanics, showing that the semantic map recovers meaningful research clusters, geodesic paths reveal natural conceptual bridges between distant topics, and SPH/GPR interpolation enables the generation of virtual knowledge: hypothetical paper abstracts describing unstudied but geometrically predicted research directions.

10. 【2606.05693】MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry

链接https://arxiv.org/abs/2606.05693

作者:Joey Chan,Wonbin Kweon,Ashley Shin,Niharika Bhattacharjee,Pengcheng Jiang,Yue Guo,Jiawei Han

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:SMILES differ substantially, structures remains limited, Large language models, Large language, SMILES differ

备注

点击查看摘要

Abstract:Large language models (LLMs) have shown promise for molecular property prediction, but their ability to reason over chemical structures remains limited, as molecular representations such as SMILES differ substantially from the natural language on which LLMs are primarily trained. To bridge this semantic and chemical knowledge gap, we propose MolE-RAG, a training-free, molecule-centric retrieval-augmented generation framework for LLM-based molecular property prediction. MolE-RAG augments each prediction with three complementary sources of inference-time context: retrieved chemistry literature, molecule-specific information including compound synonyms, identifiers, functional group annotations, and physicochemical descriptors, and structurally similar molecules retrieved from the training set. We evaluate MolE-RAG across nine molecular property prediction tasks using proprietary, chemistry-specialized, and open-source LLMs. Across general-purpose LLMs, MolE-RAG improves ROC-AUC by up to 28 percentage points on classification tasks and reduces regression RMSE by up to 67% relative to a SMILES-only baseline. We further find that the utility of each context source varies across models and tasks, with different models benefiting most from textual retrieval, molecular context, or structural retrieval. These results suggest that molecule-centric retrieval can improve LLM-based molecular property prediction without model fine-tuning while providing a flexible framework for integrating heterogeneous chemical knowledge at inference time.

11. 【2606.05658】Agent-Orchestrated Adaptive RAG: A Comparative Study on Structured and Multi-Hop Retrieval

链接https://arxiv.org/abs/2606.05658

作者:Anuj Maharjan,Devinder Kaur,Richard Molyet

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, enhances Large Language, Language Models, Large Language, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding their responses in external knowledge, but conventional pipelines rely on static, single-step retrieval that limits performance on complex queries. This paper presents an Agent-Orchestrated Adaptive RAG framework that introduces dynamic query decomposition, iterative retrieval, and a bounded self-reflective evaluation loop. We evaluate the system across two complementary datasets: a domain-specific DevOps knowledge base and the multi-hop reasoning benchmark MuSiQue. Using metrics that include overall score, citation accuracy, mean reciprocal rank, and topic coverage, we find that query decomposition yields consistent gains in the structured domain (overall score $+0.04$, MRR $+0.17$ on DevOps) but degrades ranking precision on the multi-hop benchmark, while the reflection mechanism improves citation accuracy at a substantial latency cost. These contrasting results show that agentic enhancements are not universally beneficial and must be applied selectively according to query and domain characteristics. Our findings argue for adaptive, cost-aware orchestration rather than uniformly aggressive reasoning pipelines.

12. 【2606.05621】ANCHOR: Agentic Noise Creation Framework for Human Simulation and Denoising Recommendation

链接https://arxiv.org/abs/2606.05621

作者:Xiangming Li,Hua Chu,Chengyu Feng,Jianan Li,Yangtao Zhou

类目:Information Retrieval (cs.IR)

关键词:Distilling accurate user, Distilling accurate, accurate user preferences, noise, remains a fundamental

备注

点击查看摘要

Abstract:Distilling accurate user preferences from noisy implicit feedback remains a fundamental bottleneck in recommendation systems, highlighting the need for recommendation denoising. However, real-world data lack explicit noise annotations, forcing existing methods to rely on unsupervised side information or handcrafted heuristics. These approaches often incur high external costs, generalize poorly, or depend on unreliable priors, causing noise misidentification and corrupting true user preference representations. To address these limitations, we propose a paradigm-level reformulation of recommendation denoising. Instead of indirectly inferring noisy interactions through heuristics, our Creation-Recognition paradigm proactively creates labeled noisy interactions and trains a dedicated recognizer to identify them, transforming denoising from heuristic filtering into supervised learning. Based on this paradigm, we present ANCHOR, an agent-based framework inspired by recent LLM-as-User research. ANCHOR simulates user behaviors to generate realistic noise labels and enables supervised denoising through two stages: noise creation and noise recognition. In the noise creation stage, ANCHOR adopts a recommender-in-the-loop agentic architecture to synthesize both diverse out-of-preference noise and informative boundary-adjacent noise. For out-of-preference noise, it implements five extensible simulation mechanisms to approximate major sources of noisy implicit feedback. For boundary-adjacent noise, an adversarial boundary refinement mechanism generates ambiguous interactions that challenge the recognizer and target the decision boundary. In the noise recognition stage, ANCHOR leverages the generated labels to train a reusable parametric recognizer that integrates collaborative signals and semantic representations to detect noise patterns in real interaction data.

13. 【2606.05568】ColBERTSaR: Sparsified ColBERT Index via Product Quantization

链接https://arxiv.org/abs/2606.05568

作者:Eugene Yang,Andrew Yates,Dawn Lawrie,James Mayfield,Saron Samuel,Rohan Jha

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:support candidate set, neural retrieval architecture, heavy index structure, set retrieval based, effective neural retrieval

备注: 6 pages, 1 figure, accepted at SIGIR 2026 as a short paper

点击查看摘要

Abstract:While ColBERT is an effective neural retrieval architecture, it requires a heavy index structure to support candidate set retrieval based on approximated token embeddings, gathering and decompressing document token embeddings, and applying the MaxSim operation. Indexes in PLAID and similar ColBERT implementations require five to ten times the disk storage of the original raw text, which limits their scalability. Furthermore, prior work has identified that the gathering and decompression stages are the primary inefficiencies at query time. Limiting the number of document tokens that must be gathered by thresholding and score approximation does not eliminate the need for the entire index to support ad hoc queries. In this work, we propose an embedding quantization approach that turns a ColBERT index into a true inverted index. We show that, theoretically, ColBERT with embedding quantization is equivalent to learned-sparse retrieval except for the scoring mechanism. Empirically, we demonstrate that our index is 50-70% smaller than a one-bit PLAID index while retaining retrieval effectiveness.

14. 【2606.05537】PHKT:Personalized Dynamic Hypergraph-enhanced KAN-Transformer for Multi-behavior Sequential Recommendation

链接https://arxiv.org/abs/2606.05537

作者:Ruijie Du,Hao Chen,Xin Zhang,Dongjing Wang,Ze Zhang,Xudong Shen,Runze Wu,Dongjin Yu

类目:Information Retrieval (cs.IR)

关键词:provide richer supervisory, richer supervisory information, purchases can provide, provide richer, richer supervisory

备注: 14 pages, 6 figures, 6 tables

点击查看摘要

Abstract:In multi-behavior recommendation, auxiliary behaviors such as clicks, add-to-cart, and purchases can provide richer supervisory information for predicting target behaviors. Although existing graph and hypergraph methods are capable of modeling high-order relationships among users, items, and behaviors, they still have limitations in heterogeneous semantics, user-specific weighting, and sequence dependency modeling. While standard Transformers excel at sequence modeling, their shared feedforward mapping struggles to accommodate the differentiated requirements of heterogeneous latent patterns in multi-behavior scenarios. To address this, this paper proposes the Personalized Hypergraph-enhanced Kolmogorov-Arnold Network Transformer (PHKT). Specifically, we design a personalized dynamic hypergraph module that performs behavior-aware weighting of item similarities based on users' historical behavior sequences to capture user-specific heterogeneous high-order relationships. Meanwhile, a Transformer is used as the temporal backbone to model the evolution of short- and long-term preferences, and KAN is introduced to replace the traditional MLP in the feedforward network to enhance fine-grained modeling capability for nonlinear responses to different latent patterns. Experiments on three real datasets, Tmall, RetailRocket, and IJCAI, show that PHKT consistently outperforms nine strong baseline models across multiple evaluation metrics, demonstrating its effectiveness in multi-behavior preference modeling and target behavior prediction.

15. 【2606.05436】n Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

链接https://arxiv.org/abs/2606.05436

作者:Alejandro Lozano,Keiko Ihara,Ping-Hao Yang,Carrie E. Robertson,Jennifer Stern,Allan Purdy,Hsiangkuo Yuan,Pengfei Zhang,Yulia Orlova,Olga Fermo,Jennifer Hranilovich,Fred Cohen,Todd J. Schwedt,Jenelle A. Jindal,Serena Yeung-Levy,Chia-Chun Chiang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:high-quality patient care, Summarizing the latest, latest medical literature, latest medical, decision-making is essential

备注

点击查看摘要

Abstract:Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.

16. 【2606.05308】Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

链接https://arxiv.org/abs/2606.05308

作者:Abhishek Divekar

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Applications (stat.AP)

关键词:extended Prediction-Powered Inference, small human-labeled set, large LLM-judged set, Prediction-Powered Inference, Inference to produce

备注: Accepted at ACL 2026 - GEM Workshop

点击查看摘要

Abstract:With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

17. 【2606.05257】Scaling Laws for Behavioral Foundation Models over User Event Sequences

链接https://arxiv.org/abs/2606.05257

作者:Rickard Brüel Gabrielsson

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:actions in recommendation, user actions, lack the kind, provide for language, critical batch size

备注

点击查看摘要

Abstract:Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning $10^{15}$-$10^{19}$ training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ($s^{\star}\!\approx\!2\%$ of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its $D/N$ ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by $10^{19}$ FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.

18. 【2606.05182】LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

链接https://arxiv.org/abs/2606.05182

作者:Rahul Subramani

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, discard critical details, Large language, Episodic Retrieval Network, finite context windows

备注

点击查看摘要

Abstract:Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn. On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0.81), LANTERN-Rerank recovers 78.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72.4%; Wilcoxon p0.0001, 95% CI [+3.1, +8.6] pp, d=0.43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0.005) using zero LLM calls. When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8.4 percentage points on average (Wilcoxon p0.05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.

19. 【2505.03336】Eliminating Out-of-Domain Recommendations in LLM-based Recommender Systems: A Unified View

链接https://arxiv.org/abs/2505.03336

作者:Hao Liao,Jiwei Zhang,Jianxun Lian,Wensheng Lu,Mingqi Wu,Shuo Wang,Yong Zhang,Yitian Huang,Mingyang Zhou,Rui Mao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

关键词:Large Language Models, Recommender systems based, Language Models, Large Language, Recommender systems

备注: 20 pages

点击查看摘要

Abstract:Recommender systems based on Large Language Models (LLMs) are often plagued by hallucinations of out-of-domain (OOD) items. To address this, we propose RecLM, a unified framework that bridges the gap between retrieval and generation by instantiating three grounding paradigms under a single architecture: embedding-based retrieval, constrained generation over rewritten item titles, and discrete item-tokenizer generation. Using the same backbone LLM and prompts, we systematically compare these three views on public benchmarks. RecLM strictly eradicates OOD recommendations (OOD@10 = 0) across all variants, and the constrained generation variants RecLM-cgen and RecLM-token achieve overall state-of-the-art accuracy compared to both strong ID-based and LLM-based baselines. Our unified view provides a systematic basis for comparing three distinct paradigms to reduce item hallucinations, offering a practical framework to facilitate the application of LLMs to recommendation tasks. Source code is at this https URL.

计算机视觉

1. 【2606.06485】PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

链接https://arxiv.org/abs/2606.06485

作者:Shaohui Dai,Yansong Qu,You Shen,Shengchuan Zhang,Liujuan Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, multimodal large language, enabled unified solutions, multimodal large, large language models

备注: Project page: [this https URL](https://atrovast.github.io/PAR3D/)

点击查看摘要

Abstract:Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

2. 【2606.06477】Complexity-Balanced Diffusion Splitting

链接https://arxiv.org/abs/2606.06477

作者:Noam Issachar,Dani Lischinski,Raanan Fattal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Standard continuous-time generative, intricate data distributions, Standard continuous-time, signal regimes, data distributions

备注

点击查看摘要

Abstract:Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricate data distributions. While scaling model capacity improves performance, deploying a massive network uniformly across the entire generative timeline is inherently inefficient. In this work, we propose Complexity-Balanced Splitting (CBS), a principled framework for temporal capacity allocation that distributes the generative workload across multiple specialized sub-networks. Grounded in function approximation theory and de Boor's equidistribution principle, CBS partitions the diffusion timeline into segments of equal approximation burden, allocating more representational capacity to regions where the generative dynamics are more difficult to model. To estimate this local complexity, we introduce two complementary and tractable monitor functions: a spatial measure based on the flow's Dirichlet energy, and a geometric measure based on the acceleration of the sampling trajectories. Using a lightweight auxiliary model to estimate these complexity profiles, our approach eliminates the need for heuristic temporal splits or computationally expensive search procedures. Extensive evaluation across multiple architectures (SiT, JiT, and UNet) and datasets demonstrates that CBS consistently improves synthesis quality without increasing per-step inference cost. In particular, CBS improves FID by ~35% on SiT-XL with CFG relative to naive temporal partitioning. Project page is available at this https URL.

3. 【2606.06476】hinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

链接https://arxiv.org/abs/2606.06476

作者:Chenming Zhu,Jingli Lin,Yilin Long,Peizhou Cao,Tai Wang,Jiangmiao Pang,Xihui Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:abilities remain largely, remain largely constrained, reasoning abilities remain, shown strong visual, shown strong

备注: Project page: [this https URL](https://zcmax.github.io/projects/Thinking-With-Imagination)

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

4. 【2606.06458】In-Context Multiple Instance Learning

链接https://arxiv.org/abs/2606.06458

作者:Alexander Möllers,Marvin Sextro,Julius Hense,Gabriel Dernbach,Klaus-Robert Müller

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Instance Learning, Multiple Instance, Instance Learning, addresses problems, satellite imagery

备注

点击查看摘要

Abstract:Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.

5. 【2606.06407】A Vision-language Framework for Comparative Reasoning in Radiology

链接https://arxiv.org/abs/2606.06407

作者:Tengfei Zhang,Ziheng Zhao,Lisong Dai,Xiaoman Zhang,Pengcheng Qiu,Ya Zhang,Yanfeng Wang,Weidi Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:imaging artificial intelligence, remains poorly aligned, isolated image interpretation, Medical imaging artificial, artificial intelligence

备注

点击查看摘要

Abstract:Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

6. 【2606.06390】HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

链接https://arxiv.org/abs/2606.06390

作者:Wenbo Li,Xiaoliang Ju,Zipeng Qin,Rongyao Fang,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:modern interior design, crucial for robot, modern interior, whole-home floorplan generation, Indoor scene

备注

点击查看摘要

Abstract:Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: this https URL

7. 【2606.06379】EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

链接https://arxiv.org/abs/2606.06379

作者:Qiwei Zeng,Hao Wang,Jinghao Lin,Shuchang Ye,Yuezhe Yang,Yige Peng,Haoyuan Che,Jinman Kim,Lei Bi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Medical vision-language models, clinical image interpretation, shown increasing potential, vision-language models, report generation

备注

点击查看摘要

Abstract:Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.

8. 【2606.06369】Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

链接https://arxiv.org/abs/2606.06369

作者:Maëlic Neau,Salim Baloch,Jakob Suchan,Zoe Falomir,Mehul Bhatt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Learning-driven Scene Graph, frequent relation types, Scene Graph Generation, capture reliable visual, reliable visual commonsense

备注

点击查看摘要

Abstract:Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.

9. 【2606.06363】GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery

链接https://arxiv.org/abs/2606.06363

作者:Hao Lei,Xi Cheng,Chenlu Shu,Zhiheng Chen,Zhengjie Duan,Haoyu Wang,Zhanfeng Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Urban green-space extraction, limits semantic reuse, similar vegetation patterns, visually similar vegetation, Urban green-space

备注: 34 pages, 5 figures

点击查看摘要

Abstract:Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.

10. 【2606.06361】Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

链接https://arxiv.org/abs/2606.06361

作者:Woojung Han,Seil Kang,Youngjun Jun,Min-Hung Chen,Fu-En Yang,Seong Jae Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visually stunning content, leverage input images, generate visually stunning, diffusion models leverage, violates physical laws

备注: ICML 2026

点击查看摘要

Abstract:Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time).

11. 【2606.06359】Comparison of Deep Learning Frameworks For Rice Disease Mapping From UAV Multispectral Imaging

链接https://arxiv.org/abs/2606.06359

作者:Yadav Raj Ghimire,Jagrati Talreja,Tewodros Syum Gebre,Timothy Agboada,Shikha V. Chandel,Leila Hashemi Beni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:bacterial leaf blight, convolutional neural networks, UAV multispectral imagery, leaf blight, neural networks

备注: This paper has been accepted in IGARSS 2026. Copyright 2026 IEEE

点击查看摘要

Abstract:In this study, UAV multispectral imagery is used to segment the severity of bacterial leaf blight (BLB) in rice using convolutional neural networks (CNNs) and transformer-based models. The evaluated architectures include U-Net with a ResNet- 101 encoder, U-Net++ with EfficientNet-B3 and EfficientNetB7, DeepLabV3+, and SegFormer, all trained under a common pipeline with three input configurations (multispectral only, multispectral+NDVI, and multispectral+NDRE). Experiments are conducted using the publicly available BLB dataset with performance reported using mean IoU (mIoU), mean F1 (mF1), mean accuracy (mAcc), precision, and recall. U-Net++ with EfficientNet-B3 achieved the highest performance, with an mIoU of 97.62%. SegFormer obtained lower segmentation accuracy but comparable inference speed. Overall, the results indicate that lightweight CNN backbones remain more reliable for operational BLB monitoring while integration of vegetation indices provides small and consistent improvements. The study also highlights the value of standardised UAV datasets to compare disease mapping methods and encourages the use of CNN architectures for field implementation.

12. 【2606.06338】StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

链接https://arxiv.org/abs/2606.06338

作者:Zhengqian Wu,Zhixian Liu,Aodong Chen,Jingyang Zhang,Ruizhe Li,Hanlin Ge,Zhongyuan Wang,Chunxia Xiao,Chao Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:DVU, aims to answer, Video question answering, DVU dataset, DVU datasets

备注: Accepted by IJCV 2026

点击查看摘要

Abstract:Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU this http URL difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: this https URL

13. 【2606.06329】Efficient Mean Curvature Computation on High-Dimensional Data Manifolds

链接https://arxiv.org/abs/2606.06329

作者:Alexandre L. M. Levada

类目:Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:Curvature Boundary Points, Boundary Points, Curvature Boundary, Estimating local, key ingredient

备注: 31 pages, 2 figures and 5 tables

点击查看摘要

Abstract:Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean Curvature Boundary Points (MCBP) method. The naive implementation of this computation, based on a local shape operator approximated from k-nearest neighbor patches, involves an explicit construction of a matrix $H$ whose trace form yields an $O(m^4)$ cost per point, rendering the approach intractable for datasets with more than a few dozen features. This paper introduces two complementary contributions that together reduce this cost by several orders of magnitude. The first contribution is an exact algebraic identity. This identity, derived from the orthogonality of the eigenvectors of the covariance matrix and the cyclicity of the trace operator, eliminates $H$ entirely and reduces the per-point cost to $O(m^2)$ after the eigendecomposition. The second contribution addresses the remaining $O(m^3)$ bottleneck of the full eigendecomposition. Since the local covariance matrix has rank at most $k-1 \ll m$, we replace it with a truncated SVD of the $k \times m$ centered data matrix, an $O(k^2 m)$ operation, and derive an analytical approximation for the contribution of the null-space eigenvectors based on the expected value of their outer product under the Haar measure. The resulting estimator has total cost $O(k^2 m + k m p^2)$, where $p = k-1$. Experiments on real-world datasets confirm speedups of 50 to 300 times relative to the original implementation, with negligible loss when the fast estimator is used to replace the original version. By providing a scalable and data-driven estimate of local curvature, the proposed method establishes curvature as a practical geometric feature for a broad range of machine learning tasks, from classical to modern deep learning pipelines.

14. 【2606.06309】RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

链接https://arxiv.org/abs/2606.06309

作者:Chensheng Dai,Shengjun Zhang,Yifan Li,Zhang Zhang,Zheng Zhu,Yueqi Duan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable performance, Diffusion Transformers, high inference latency, achieved remarkable, remarkable performance

备注: Project Page: [this https URL](https://simon-dcs.github.io/Website-of-RhymeFlow/) , Code: [this https URL](https://github.com/Simon-Dcs/RhymeFlow)

点击查看摘要

Abstract:Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.

15. 【2606.06294】owards One-to-Many Temporal Grounding

链接https://arxiv.org/abs/2606.06294

作者:Qi Xu,Yue Tan,Shihao Chen,Jiahao Meng,Anna Wang,Shunping Ji,Hao Fei,Jason Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Temporal Grounding, aims to localize, localize video segments, textual query, Grounding

备注: Accepted to ICML'26

点击查看摘要

Abstract:Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

16. 【2606.06292】Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

链接https://arxiv.org/abs/2606.06292

作者:Ariel Herrera,Xueyang Kang,Atal Anil Kumar

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:textiles remains challenging, Robotic manipulation, robust visual perception, visual perception required, manipulation of textiles

备注

点击查看摘要

Abstract:Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.

17. 【2606.06278】Geodesic Flow Matching on a Riemannian Degradation Manifold for Blind Image Restoration

链接https://arxiv.org/abs/2606.06278

作者:Akshay Janardan Bankar,Ankita Chatterjee,Sayan Banerjee,Shreyas Pandith,Kalakonda Sai Shashank,Amit Satish Unde

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Blind image restoration, restoration requires recovering, requires recovering clean, Blind image, image restoration requires

备注: Submitted to ECCV 2026

点击查看摘要

Abstract:Blind image restoration requires recovering clean images from observations corrupted by unknown and potentially mixed degradations. While recent deterministic flow-based methods model restoration as transport processes that map degraded images to clean ones, they typically rely on Euclidean interpolation, implicitly assuming linear degradation geometry. In this paper, we explicitly model degradations as points on a low-dimensional Riemannian manifold and formulate restoration as geodesic transport on the joint image-manifold space. Using a geodesic flow matching objective, we learn intrinsic transport dynamics that respect the curvature of degradation space. This framework generalizes linear flow matching, provides a principled treatment of mixed degradations as geodesic compositions, and yields a clean theoretical interpretation for generalization beyond observed degradations.

18. 【2606.06255】RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning

链接https://arxiv.org/abs/2606.06255

作者:Ziyang Yu,Xiang Li,Qiong Chang,Jun Miyazaki

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:underpinning LiDAR-based autonomous, LiDAR-based autonomous driving, primary sensory representation, Farthest Point Sampling, underpinning LiDAR-based

备注: 28 pages,15 figures

点击查看摘要

Abstract:Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.

19. 【2606.06249】GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention

链接https://arxiv.org/abs/2606.06249

作者:Giordano Cicchetti,Eleonora Grassucci,Danilo Comminiello

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Transformer-based multimodal models, Transformer-based multimodal, information across heterogeneous, Transformer-based, modalities

备注

点击查看摘要

Abstract:Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.

20. 【2606.06242】Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

链接https://arxiv.org/abs/2606.06242

作者:AJ Carl P. Dy,Aivin V. Solatorio

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:figures and tables, Institutional documents, analytical information embedded, substantial amounts, generic document layout

备注: 23 pages, 8 figures

点击查看摘要

Abstract:Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at this https URL and the source code is available at this https URL.

21. 【2606.06228】SAM-Flow: Source-Anchored Masked Flow for Training-Free Image Editing

链接https://arxiv.org/abs/2606.06228

作者:Haowang Cui,Rui Chen,Tao Luo,Tao Guo,Zheng Qin,Jiaze Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently attracted increasing, Training-free image editing, modify real images, powerful pre-trained diffusion, attracted increasing attention

备注: Code is available at: [this https URL](https://github.com/chwbob/Sam-Flow)

点击查看摘要

Abstract:Training-free image editing has recently attracted increasing attention due to its ability to modify real images using powerful pre-trained diffusion and flow-matching models without additional training. However, existing inversion-based and differential-flow-based methods usually perform global latent transport, which inevitably propagates editing effects to non-target regions and leads to background leakage. To address this problem, we propose SAM-Flow, a source-anchored masked flow framework for localized training-free image editing. Instead of updating the whole latent representation, SAM-Flow first uses a scout image and token-grounded attention maps to localize the editable semantic regions. It then applies differential velocity updates only within these regions, while anchoring the remaining areas to the source-image latent trajectory. To further improve spatial stability and boundary naturalness, we introduce a time-varying source-anchored projection mechanism with dynamic soft masks, transition regions, and temporal mask accumulation. The proposed method is plug-and-play and can be integrated with mainstream flow-matching backbones such as Stable Diffusion 3 and FLUX without any fine-tuning. Extensive qualitative and quantitative experiments demonstrate that SAM-Flow achieves accurate semantic editing while significantly improving background preservation, providing a simple and general localized editing paradigm for training-free image editing. Code is available at: this https URL.

22. 【2606.06224】Symb-xMIL: Symbolic Explanations for Multiple Instance Learning in Digital Pathology

链接https://arxiv.org/abs/2606.06224

作者:Yanqing Luo(1 and 2),Julius Hense(1 and 2),Niklas Prenißl(3 and 4),Andreas Mock(5 and 6 and 7),Klaus-Robert Müller(1 and 2 and 8 and 9),Thomas Schnake(10 and 11 and 12),Mina Jamshidi Idaji(1 and 2) ((1) Berlin Institute for the Foundations of Learning and Data, Berlin, Germany, (2) Machine Learning Group, Technische Universität Berlin, Berlin, Germany, (3) Institute of Pathology, Charité Universitätsmedizin, Berlin, Germany, (4) Berlin Institute of Health at Charité -- Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, BIH Charité Digital Clinician Scientist Program, Berlin, Germany, (5) Institute of Pathology, Ludwig Maximilian University of Munich, Munich, Germany, (6) Division of Translational Medical Oncology, DKFZ, Heidelberg, Germany, NCT Heidelberg, Heidelberg, Germany, (7) German Cancer Consortium (DKTK), partner site Munich, a partnership between DKFZ and Ludwig-Maximilians-Universität München (LMU), Germany, (8) Department of Artificial Intelligence, Korea University, Seoul, Korea, (9) Max-Planck Institute for Informatics, Saarbrücken, Germany, (10) Department of Chemistry, Chemical Physics Theory Group, University of Toronto, Canada, (11) Vector Institute for Artificial Intelligence, Toronto, Canada, (12) Acceleration Consortium, University of Toronto, Canada)

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multiple instance learning, instance learning, multiple instance, validation and discovery, discovery in digital

备注: 23 pages, 18 figures

点击查看摘要

Abstract:Explanations of multiple instance learning (MIL) models are widely used for validation and discovery in digital histopathology. Existing methods primarily rely on heatmaps that highlight influential regions but do not explain how evidence from different tissue regions is combined to produce a prediction. This limits interpretability, especially when decisions depend on interactions between tissue features. We introduce Symbolic explainable MIL (Symb-xMIL), a post-hoc explanation framework that quantifies how a MIL model's behavior aligns with human-readable decision rules, expressed as logical relationships (e.g., AND, OR, NOT) between input features. These alignment scores reveal semantic patterns underlying the model's predictions. We evaluate Symb-xMIL on synthetic and real-world histopathology datasets. On synthetic MIL data, Symb-xMIL reliably recovers ground-truth logical rules. In a clinical tumor detection task, the best-aligned rules uncover heterogeneous decision patterns and expose hidden model errors. On an HPV-prediction task on TCGA-HNSCC, a cohort of head and neck cancer, our framework refines patient survival stratification beyond HPV status with potential clinical relevance. Overall, Symb-xMIL extends MIL explainability beyond visual attribution toward structured, rule-based reasoning, enabling more transparent and semantically grounded interpretation of model predictions.

23. 【2606.06217】DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

链接https://arxiv.org/abs/2606.06217

作者:Tan Zhang,Quanyou Li,Lu Zhang,Jun Liu,Xiaofeng Zhu,Ping Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:noisy low-altitude UAV, low-altitude UAV views, on-site compute constraints, tight on-site compute, low-altitude UAV

备注

点击查看摘要

Abstract:When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at this https URL.

24. 【2606.06199】SC-MFJ: A Simple Haptic Quality Metric for Medical Image Segmentation

链接https://arxiv.org/abs/2606.06199

作者:Souraj Adhikary,Negar Chabi,Andre Mastmeyer

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Hausdorff distance measure, Dice and Hausdorff, Hausdorff distance, measure geometric overlap, rendering in surgical

备注: 11 pages, 5 figures, 5 tables, [this http URL](http://www.wscg.eu/)

点击查看摘要

Abstract:Standard segmentation metrics such as Dice and Hausdorff distance measure geometric overlap but say nothing about whether a segmented surface is suitable for haptic rendering in surgical simulation. We propose SC-MFJ (Surface-Constrained Mean Force Jerk), a simple, inexpensive metric that samples a segmented organ surface with many short virtual stylus walks and measures how jerky the resulting contact forces are. The metric is computed from existing segmentation outputs and uses roughly one minute of CPU time per case. We evaluate three pancreas CT segmentation approaches-binary nnU-Net output, Gaussian-smoothed output, and learned signed distance function (SDF) regression-across 80 cases in five-fold cross-validation. SC-MFJ reveals a 147x gap in haptic quality between the raw binary baseline and simple Gaussian post-processing, a difference entirely invisible to Dice and HD95. It also shows that learned SDF regression, despite requiring full model retraining, produces more variable haptic quality than Gaussian smoothing, with a case-level standard deviation of 168 N/s2 compared with 22 N/s2 for Gaussian. A second evaluation on the LiTS liver dataset (131 cases) confirms the generality of these findings: the binary-to-Gaussian gap widens to 189x, and Gaussian smoothing again produces consistently low force jerk across all folds. Our results suggest that for haptic simulation applications, a one-line post-processing step may be sufficient, and that a cheap metric like SC-MFJ can flag problems that geometric metrics miss.

25. 【2606.06194】ActiveMimic: Egocentric Video Pretraining with Active Perception

链接https://arxiv.org/abs/2606.06194

作者:Xingyao Lin,Guojin Zhong,Tianyi Lu,Ziyi Ye,Yichen Zhu,Zuxuan Wu,Yu-Gang Jiang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Egocentric human video, Egocentric human, human video, human video offers, active perception

备注: Project Page: [this https URL](https://activemimic.github.io/)

点击查看摘要

Abstract:Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

26. 【2606.06186】Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

链接https://arxiv.org/abs/2606.06186

作者:Liangsheng Liu,Si Chen,Jiamin Wu,Weiwei Feng,Zhixin Cheng,Xiaotian Yin,Wenfei Yang,Tianzhu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown strong zero-shot, strong zero-shot generalization, remain highly vulnerable, Vision-Language Models, posing serious risks

备注: Accepted by ICLR2026

点击查看摘要

Abstract:Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.

27. 【2606.06176】RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

链接https://arxiv.org/abs/2606.06176

作者:Haochen Hu,Yanrui Bin,Chih-yung Wen,Bing Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Underwater Image Enhancement, Underwater Image, Image Enhancement, mitigating degradations caused, water medium

备注

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

28. 【2606.06158】Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

链接https://arxiv.org/abs/2606.06158

作者:Kevin Dave,Sai Aditya Patkuri,Chhaya Kumar Das,Gouranga Bala,R. Venkatesh Babu,Rajeshkumar SA

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:underlying visual complexity, video tokenisation seeks, dynamically allocate token, allocate token budgets, token budgets based

备注

点击查看摘要

Abstract:Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.06158 [cs.CV]

(or
arXiv:2606.06158v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.06158

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2606.06155】AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

链接https://arxiv.org/abs/2606.06155

作者:Qize Yu,Jiadi You,Yuran Wang,Jiaqi Liang,Bowen Ping,Yang Tian,Yue Chen,Minghong Cai,Zeying Gong,Ruihai Wu,Yinchuan Li,Junwei Liang,Yingcong Chen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:rich world knowledge, pretrained vision-language models, enable instruction-following robotic, leverage the rich, rich world

备注: Preprint. Code and project page are available. Code: [this https URL](https://github.com/Skywalker-yqz/AffordanceVLA) Project page: [this https URL](https://skywalker-yqz.github.io/AffordanceVLA/)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.

30. 【2606.06142】Computation-Aware Event-to-Frame Reconstruction via Selective Attention

链接https://arxiv.org/abs/2606.06142

作者:Jingqian Wu,Yunbo Jia,Edmund Y. Lam

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:frame-based vision pipelines, bridges asynchronous event, asynchronous event streams, reconstruction bridges asynchronous, vision pipelines

备注

点击查看摘要

Abstract:Event-to-frame (E2F) reconstruction bridges asynchronous event streams with frame-based vision pipelines, but existing methods often face a trade-off between reconstruction quality and computational efficiency. In this work, we propose an efficient E2F framework that emphasizes causal temporal modeling and computation-aware design. The architecture adopts a recurrent encoder-decoder to incrementally aggregate event information with compact hidden states. To improve robustness under fast motion and illumination variations, a selective context fusion strategy is introduced to integrate event-driven features with prior intensity cues. Within this fusion process, a lightweight hybrid attention mechanism enhances feature selectivity without relying on heavy attention operations. Experimental results on standard benchmarks demonstrate that the proposed approach achieves competitive reconstruction performance while maintaining a favorable balance between accuracy and model complexity.

31. 【2606.06120】Diff-CA: Separating Common and Salient Factors with Diffusion Models

链接https://arxiv.org/abs/2606.06120

作者:Michaël Soumm,Alexandre Fournier Montgieux,Yunlong He,Pietro Gori,Alasdair Newson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Contrastive Analysis aims, Analysis aims, Contrastive Analysis, aims to separate, data distributions

备注

点击查看摘要

Abstract:Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.

32. 【2606.06113】Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

链接https://arxiv.org/abs/2606.06113

作者:Huaisong Zhang,Hao Yu,Yuxuan Zhang,Jiahe Wang,Xinrui Chen,Haoxiang Cao,Feng Lu,Wendong Zhang,Changqian Yu,Chun Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating increasingly photorealistic, increasingly photorealistic images, structurally complex failures, generating increasingly, increasingly photorealistic

备注: 25 pages, 9 figures

点击查看摘要

Abstract:Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

33. 【2606.06103】MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation Models

链接https://arxiv.org/abs/2606.06103

作者:Tariq M. Khan,Syed Saud Naqvi,Thantrira Porntaveetus,Hamid Alinejad-Rokny,Shahzaib Iqbal,Imran Razzak,Mohammad AU Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Dataset Knowledge Card, Medical Segmentation Dataset, stronger architectures, fundamental question, search for stronger

备注

点击查看摘要

Abstract:Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC), a framework for making these factors explicit. MS-DKC records dataset evidence through image/acquisition, morphology, supervision, context-dependence, and deployment-risk descriptors. These descriptors are mapped to failure modes, design priors, and risk-aligned criteria, making segmentation design more traceable than architecture-first comparison. We evaluate MS-DKC on DRIVE, ISIC2018, and ACDC, representing distinct regimes. DRIVE contains sparse, thin, branching vessels, favoring detail-preserving models, sensitivity-aware optimization, threshold analysis, and topology-aware metrics. DKC-TNet-v2 achieved Dice 0.8044 and IoU 0.6730 with 35103 parameters, while SA-UNetv2-DKC-AmbRef reached Dice 0.8141, IoU 0.6865, sensitivity 0.8265, specificity 0.9804, and AUC 0.9853. ISIC2018 involves compact but appearance-variable lesions; validation-constrained score-function selection on Att-Next-Topo/ATTNext produced MS-DKC-AttNextTopo-VCSF-NoAug with Dice 0.8872, IoU 0.8214, precision 0.9173, Boundary F1 0.4878, and ASSD 4.13, while plausible additions failed to improve the risk-aligned profile. ACDC provides a multi-class cardiac case, where MS-DKC recommends four-class softmax segmentation, class-balanced Dice/CE supervision, and class-wise surface evaluation. Overall, the results support dataset-conditioned design: different datasets require different priors, operating points, and evidence before a model can be judged appropriate.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.06103 [cs.CV]

(or
arXiv:2606.06103v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.06103

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
34. 【2606.06100】HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

链接https://arxiv.org/abs/2606.06100

作者:Moshiur Farazi,Sameera Ramasinghe,Mahbub Ahmed Turza,Shafin Rahman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding inter-object relationships, Vision-Language Models, requires understanding inter-object, inter-object relationships, understanding inter-object

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38\% to 58.86\%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03\% vs.\ 57.21\% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94\%, $+$6.25pp over baseline). The learned curvature stabilises at $\kappa{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $\kappa$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81\%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.

35. 【2606.06078】Knowledge Distillation for Visual Autoregressive Models

链接https://arxiv.org/abs/2606.06078

作者:Elia Peruzzo,Aritra Bhowmik,Guillaume Sautiere,Yuki M Asano,Amirhossein Habibian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:motivating effective model, computationally intensive, motivating effective, effective model compression, highly expressive

备注

点击查看摘要

Abstract:Autoregressive (AR) image generation models are highly expressive but computationally intensive, motivating effective model compression. Knowledge distillation (KD) is a natural approach for model compression and has been widely studied in language modeling, yet its behavior in visual AR generation remains underexplored. In this work, we present the first systematic study of distillation strategies for AR image models. Our analysis shows that while standard distillation can yield meaningful gains, recent methods developed for language do not directly transfer to images: long decoding horizons and visual token ambiguity make teacher supervision unreliable especially under student-conditioned contexts. To address this, we propose VarKD, a distillation framework for visual autoregressive models that distills on student samples while selectively applying teacher supervision and reducing token-level ambiguity. Experiments on ImageNet across multiple AR backbones show that VarKD consistently outperforms prior distillation baselines, narrowing the gap to large-scale models.

36. 【2606.06076】Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

链接https://arxiv.org/abs/2606.06076

作者:Haocheng Luo,Jiahui Liu,Ruicheng Zhang,Zhizhou Zhong,Jiaqi Huang,Zunnan Xu,Quan Shi,Jun Zhou,Xiu Li

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:general multimodal understanding, vision-language models excel, multimodal understanding, excel at general, general multimodal

备注: 17 pages, preprint

点击查看摘要

Abstract:While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at this https URL.

37. 【2606.06074】VZCrash: A Large-Scale IMU Dataset of Ego-Vehicle Crashes

链接https://arxiv.org/abs/2606.06074

作者:Tommaso Bianconcini,Henrique Piñeiro Monteagudo,Aurel Pjetri,Tomaso Trinci,Leonardo Taccari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Inertial Measurement Unit, featuring Inertial Measurement, Measurement Unit, Inertial Measurement, data featuring Inertial

备注: Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). VZCrash is publicly available at this URL: [this https URL](https://huggingface.co/datasets/vzc-research-chapter/VZCrash)

点击查看摘要

Abstract:We introduce VZCrash, the largest publicly available dataset of real-world vehicle collision data featuring Inertial Measurement Unit (IMU) telemetry. The dataset contains more than 31,000 validated crashes and 158,000 negative samples, including hard cases and distractors. Each sample includes acceleration and angular velocity at 100 Hz, and GPS speed at 1 Hz. Events in VZCrash were captured by devices installed on a fleet of 73,010 commercial vehicles of different sizes driving in the United States over the span of several years. We also present an extensive experimental study enabled by the volume of the dataset. We first benchmark several different approaches, from a simple threshold-based heuristic to state-of-the-art deep learning models. Then, we present an experiment demonstrating the importance of scaling data to train high-quality crash detection models, and we show that scale is especially important when these models need to be deployed into a real-world environment.

38. 【2606.06066】FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning

链接https://arxiv.org/abs/2606.06066

作者:Marian Lupascu,Nipun Jindal,Ionut Mironica,Zhaowen Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:degrades text legibility, sacrifices typographic fidelity, control typically degrades, typically degrades text, enabling precise font

备注: 12 pages, 8 figures, accepted at ICANN 2026

点击查看摘要

Abstract:Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.

39. 【2606.06060】ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE

链接https://arxiv.org/abs/2606.06060

作者:Mishan Aliev,Eva Neudachina,Ilya Bykov,Aleksandr Oganov,Kirill Struminsky,Aibek Alanov,Denis Rakitin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern diffusion models, models generate high-quality, generate high-quality images, iterative denoising process, denoising process makes

备注

点击查看摘要

Abstract:Modern diffusion models generate high-quality images and videos, but their iterative denoising process makes inference expensive. Feature caching accelerates sampling by reusing or predicting intermediate activations across neighboring denoising steps, exploiting the redundancy of computations along the reverse trajectory. In this work, we focus on the caching schedule: selecting which denoising steps should be fully recomputed. Existing schedules are either fixed (e.g. uniform) or chosen adaptively from per-step error heuristics; in both cases, the actual compute cost is a side-effect of hand-tuned thresholds rather than a quantity the user can specify. We propose ReCache, which inverts this: given a target budget k, it learns the recomputation schedule that maximizes generation quality, turning compute into a directly controllable input. ReCache trains via policy gradients, sidestepping backpropagation through full diffusion inference, and uses no labelled data. Generations from uncached inference serve as matching targets, paired with a reward for generation quality. ReCache is compatible with any caching mechanism, including feature reuse and feature forecasting; for each mechanism, a single trained policy adapts across computational budgets at inference time. ReCache consistently outperforms scheduling baselines: under a $\times5.04$ FLOPs reduction on FLUX, it reduces LPIPS by 31% (from 0.456 to 0.316) compared to DiCache; on Wan 2.1 at a $\sim \times2.6$ speedup, it drops LPIPS by 65% (from 0.480 to 0.169) and boosts the VBench score by 7% (5.6 points, from 70.4 to 76.0) over uniform HiCache. Code is available at this https URL.

40. 【2606.06048】LLM-Conditioned Synthesis of Pathological Gaits via Structured Gait-Language Representations

链接https://arxiv.org/abs/2606.06048

作者:Mritula Chandrasekaran,Sanket Kachole,Jarik Francik,Dimitrios Makris

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:datasets remain scarce, remain scarce due, gait datasets remain, due to privacy, movement variability

备注: Accepted at CVPR MOMA Workshop 2026 and selected for spotlight presentation at the workshop

点击查看摘要

Abstract:Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The proposed method generates fixed-length synthetic skeleton-based gait sequences for pathological gait classification tasks. The framework combines motion tokenisation, pathology-aware language conditioning, LLM-based semantic augmentation, and language-to-gait generation. A key contribution is the proposed pathological tokeniser, which is designed to preserve pathology-specific motion characteristics during discrete representation learning. Experiments suggest that the proposed synthetic sequences improve downstream classification for recurrent classifiers when combined with real data. The best result is obtained using a GRU classifier trained with real and synthetic samples, achieving 92.77\% accuracy under a leave-one-subject-out protocol.

41. 【2606.06042】LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

链接https://arxiv.org/abs/2606.06042

作者:Jianzong Wu,Hao Lian,Jiongfan Yang,Dachao Hao,Ye Tian,Yunhai Tong,Jingyuan Zhu,Biaolong Chen,Qiaosong Qi,Aixi Zhang,Wanggui He,Mushui Liu,Jinlong Liu,Hao Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging frontier field, Developing unified video, interpreting interleaved multimodal, interleaved multimodal inputs, Developing unified

备注

点击查看摘要

Abstract:Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

42. 【2606.06039】xture-preserving implicit neural representation for Cone beam CT truncated reconstruction

链接https://arxiv.org/abs/2606.06039

作者:Genyuan Zhang,Junyao Wang,Haoran Lan,Chuandong Tan,Songtao Zhu,Fenglin Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Cone-beam computed tomography, truncated cone-beam computed, Cone-beam computed, computed tomography, field of view

备注

点击查看摘要

Abstract:Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.

43. 【2606.06020】ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition

链接https://arxiv.org/abs/2606.06020

作者:Pablo Ayuso-Albizu,Pablo Carballeira,Juan C. SanMiguel,Paula Moral

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Pedestrian Attribute Recognition, explore image synthesis, Attribute Recognition, diffusion models guided, address the limited

备注: Under review at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at this http URL.

44. 【2606.06002】Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

链接https://arxiv.org/abs/2606.06002

作者:Mengshi Qi,Wei Deng,Xianlin Zhang,Huadong Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:http URL, http URL PRM-guided, http URL hierarchical, http URL existing, Large Vision-Language Models

备注

点击查看摘要

Abstract:Large Vision-Language Models have achieved significant reasoning performance in various this http URL, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error this http URL this paper, we consider the task as a planning problem constrained by spatial and layout this http URL solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making this http URL the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a this http URL effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS this http URL hierarchical representation abstracts a scene into room level, region level, floor object level, and supported object this http URL PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer this http URL the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement this http URL make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the this http URL existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3,250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art approaches.

45. 【2606.05999】ATT-CR: Adaptive Triangular Transformer for Cloud Removal

链接https://arxiv.org/abs/2606.05999

作者:Yang Wu,Ye Deng,Pengna Li,Wenli Huang,Kangyi Wu,Xiaomeng Xin,Jinjun Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ground objects obscured, remote sensing images, Cloud removal aims, aims to accurately, accurately reconstruct

备注

点击查看摘要

Abstract:Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

46. 【2606.05998】Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images

链接https://arxiv.org/abs/2606.05998

作者:Jihun Cho,Soo-Yeon Jeong,Eun-Jeong Bae,Sun-Young Ihm

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:stages in dentistry, impression taking, essential stages, notable limitations, patient oral cavity

备注: 4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

点击查看摘要

Abstract:Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

47. 【2606.05997】Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting

链接https://arxiv.org/abs/2606.05997

作者:Kyriakos Chaviaras,Maria Lymperaiou,Athanasios Voulodimos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Lab at CLEF, present the AILS-NTUA, AILS-NTUA submission, addressing multimodal sexism, Task

备注

点击查看摘要

Abstract:We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.

48. 【2606.05981】Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

链接https://arxiv.org/abs/2606.05981

作者:Yoshiyuki Ootani

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:diffusion U-Net inverts, Aggressive distillation, MLLM text encoder, bottleneck of real-time, critical path

备注: 12 pages, 4 figures, 12 tables. Under review at IEEE Transactions on Circuits and Systems for Video Technology. Code, evaluation harness, and the released v3 Temporal LLLite adapter weights are at [this https URL](https://github.com/otanl/dreamlite-stream) (also mirrored to Hugging Face and Zenodo)

点击查看摘要

Abstract:Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

49. 【2606.05975】-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

链接https://arxiv.org/abs/2606.05975

作者:Jingkun Feng,Reza Sabzevari

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:segmentation enables robots, localize functional object, functionality segmentation enables, enables robots, robots to localize

备注

点击查看摘要

Abstract:Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

50. 【2606.05949】Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

链接https://arxiv.org/abs/2606.05949

作者:Yifan Chang,Jiaxin Ai,Jianwen Sun,Yuandong Pu,Siqi Luo,Liangliang Zhao,Yuchen Ren,Minghao Liu,Yunfei Yu,Yu Qiao,Kaipeng Zhang,Yihao Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visualize complex concepts, communicating research findings, scientific illustration generation, natural science, concepts and processes

备注

点击查看摘要

Abstract:Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

51. 【2606.05931】o Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

链接https://arxiv.org/abs/2606.05931

作者:Erfan Loweimi,Mengjie Qian,Kate Knill,Guanfeng Wu,Chi-Ho Chan,Abbas Haider,Muhammad Awan,Josef Kittler,Hui Wang,Mark Gales

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

关键词:voice and face, retrieving a person, unlike curated benchmarks, real-world broadcast archives, Abstract

备注: INTERSPEECH 2026

点击查看摘要

Abstract:When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

52. 【2606.05917】MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

链接https://arxiv.org/abs/2606.05917

作者:Qing Yang,Pengcheng Huang,Xinze Li,Zhenghao Liu,Yukun Yan,Yu Gu,Ge Yu,Gang Li,Maosong Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision-Language Models, lengthy video contexts, answering remains challenging, remains challenging, challenging for Vision-Language

备注: 21 pages, 8 figures

点击查看摘要

Abstract:Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at this https URL.

53. 【2606.05916】Unveiling the Unknown: Open Vocabulary Object Detection with Scene Graphs

链接https://arxiv.org/abs/2606.05916

作者:Yi Chen,Yinghao Lu,Zhehao Li,Chenchen Yan,Jiafei Wu,Chong Wang,Jiangbo Qian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Open-vocabulary object detection, Open-vocabulary object, training data, seeks to identify, object detection seeks

备注

点击查看摘要

Abstract:Open-vocabulary object detection seeks to identify novel object categories that were not part of the training data. Many knowledge distillation-based approaches have shown promising performance by transferring knowledge from pre-trained vision-language models to object detection. However, these methods often overlook structured, image-specific relationships between objects, such as interactions and spatial arrangements. This oversight can significantly restrict the effectiveness of detecting novel categories. To address this issue, we propose a Scene-guided Relational Modeling detection framework. This framework utilizes scene graphs to capture structured semantic and spatial relationships between candidate regions and their contextual objects. It explicitly models interactions among neighboring regions and incorporates a Relation Attention Module to implicitly amplify the key relational cues extracted from the scene graph. Furthermore, we present a scene-based textual alignment branch that distills category knowledge from captions to guide relational alignment. This approach facilitates a seamless integration of visual relations with semantic information for enhanced detection performance. Comprehensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the AP for novel categories on COCO and LVIS datasets.

54. 【2606.05915】CamFlow+: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization Applications

链接https://arxiv.org/abs/2606.05915

作者:Haipeng Li,Zhen Liu,Zhanglei Yang,Hai Jiang,Tianhao Zhou,Zhengzhe Liu,Ping Tan,Bing Zeng,Shuaicheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computational photography, fundamental to computer, computer vision, vision and computational, Estimating

备注

点击查看摘要

Abstract:Estimating 2D camera motion is fundamental to computer vision and computational photography. Existing homography-based methods work well for planar scenes or pure rotation, but struggle with camera translation, depth variation, and local parallax; local homography and mesh-based models improve flexibility but still rely on piecewise planar assumptions. We introduce CamFlow+, a hybrid-basis framework that represents 2D camera motion directly in dense-flow space. CamFlow+ combines homography-derived physical bases, stochastic bases sampled from homography flows, and depth-translational bases derived from depth and camera intrinsics, relaxing the single-plane constraint while preserving camera-motion regularity. A depth-aware smoothness term further regularizes translation-induced parallax in continuous-depth regions while preserving motion changes near depth boundaries. We evaluate CamFlow+ on GHOF-Cam, a camera-motion benchmark that masks out dynamic objects and ill-posed occlusion regions in an optical-flow benchmark to isolate camera-induced motion. Experiments show that CamFlow+ improves sparse and dense camera-motion estimation. In digital video stabilization, CamFlow+ also improves global and local stability, achieving the best top-1 preference rate in a blind user study. Code and datasets will be available on the project page: this https URL.

55. 【2606.05912】Self-Learning Expression Deformations for Data-Efficient Gaussian Avatars

链接https://arxiv.org/abs/2606.05912

作者:Jiahao Yang,Xiaohang Yang,Qing Wang,Yilan Dong,Gregory Slabaugh,Shanxin Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modeling dynamic facial, representations remains challenging, remains challenging due, Gaussian representations remains, Modeling dynamic

备注

点击查看摘要

Abstract:Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalability and accessibility. In this work, we introduce Self-Adaptive Gaussian Expression (SAGE), a framework for self-learning expression-induced Gaussian deformations that enables high-fidelity, animatable avatars from minimal input data. Our method jointly optimizes 2D Gaussian surfels and a Signed Distance Field (SDF) to enforce compact, surface-aligned Gaussian distributions, while a self-supervised expression learning phase replaces long training sequences with geometric and appearance consistency constraints. This design allows flexible deployment across multiple reconstruction regimes: in the multiview setting, only a single frame (timestep) is required instead of thousands; in the monocular setting, only head rotations are needed without expression sequences; and in the one-shot setting, no pretraining or priors are necessary. Experiments demonstrate that our approach achieves reconstruction and animation quality comparable to state-of-the-art methods, while reducing data requirements by several orders of magnitude. Our results highlight the potential of self-supervised Gaussian deformation learning as a step toward accessible, data-efficient avatar creation.

56. 【2606.05896】Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

链接https://arxiv.org/abs/2606.05896

作者:Jianxu Shangguan,Jing Xu,Hang Ye,Xiaoxuan Ma,Yizhou Wang,Wentao Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Creating lifelike digital, lifelike digital humans, intelligence requires unifying, requires unifying cognitive, Creating lifelike

备注

点击查看摘要

Abstract:Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.

57. 【2606.05883】Geometry-Aware Dataset Condensation for Diffusion Model Training

链接https://arxiv.org/abs/2606.05883

作者:Xiao Cui,Yulei Qin,Mo Zhu,Wengang Zhou,Hongsheng Li,Houqiang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Dataset condensation aims, construct compact datasets, Dataset condensation, condensation aims, aims to construct

备注: ICML 2026

点击查看摘要

Abstract:Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at this https URL.

58. 【2606.05873】LadderMan: Learning Humanoid Perceptive Ladder Climbing

链接https://arxiv.org/abs/2606.05873

作者:Siheng Zhao,Yuanhang Zhang,Ziqi Lu,Pieter Abbeel,Rocky Duan,Koushil Sreenath,Yue Wang,C. Karen Liu,Guanya Shi

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:complex whole-body coordination, hold great promise, Humanoid robots hold, robots hold great, Humanoid robots

备注

点击查看摘要

Abstract:Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at this https URL .

59. 【2606.05872】Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

链接https://arxiv.org/abs/2606.05872

作者:Olasimbo Ayodeji Arigbabu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:commonly evaluated, entropy, task success, agent, agent behavior

备注: 6 pages, 2 Tables

点击查看摘要

Abstract:AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.

60. 【2606.05833】Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

链接https://arxiv.org/abs/2606.05833

作者:Haibo Wang,Lifu Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, lack intrinsic

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

61. 【2606.05829】Gender Artifacts from Art History to Text-to-Image Generation

链接https://arxiv.org/abs/2606.05829

作者:Piera Riccio,Miriam Doh,Benedikt Höltgen,Noa Garcia,Nanne van Noord

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:encode social hierarchies, including distinct constructions, specific socio-historical contexts, social hierarchies, including distinct

备注

点击查看摘要

Abstract:Artistic styles are rooted in specific socio-historical contexts that encode social hierarchies, including distinct constructions of gender. Yet in AI research, style has long been treated as a surface-level visual property: a filter of color, brushstroke, and texture applied to otherwise content-neutral scenes. We introduce the first dataset to investigate the interplay between gender representation and style in both historical and generated images. StyleGender comprises 74k images spanning 19 artistic styles, comprising art historical images with style and gender annotations, T2I-generated images under controlled style and gender prompts, and a semantically aligned set enabling direct art history-to-generation comparison. By proposing two Set Gender Artifact (SGA) metrics (PixelSGA and MaskSGA), capturing gender signals at the pixel level and in compositional structure, we show that (1) gender representation shapes visual features across artistic styles, (2) style keywords carry these patterns into T2I generation, and (3) generative models tend to amplify gender artifacts beyond what is observed in historical sources.

62. 【2606.05816】Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning

链接https://arxiv.org/abs/2606.05816

作者:Jihun Cho,Soo-Yeon Jeong,Sun-Young Ihm

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:contextual emotional understanding, visual object-related patterns, effectively capture sentiment, models cannot effectively, types of text

备注

点击查看摘要

Abstract:T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and Stable Diffusion 3.5 Medium fine-tuned with LoRA on children's drawing images with emotion-based trigger words for image generation. Additionally, this paper presents experiments examining the effect of emotion trigger words on generated images and discusses the limitations of CLIP Score as an evaluation metric for emotion-aware image generation.

63. 【2606.05785】Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation

链接https://arxiv.org/abs/2606.05785

作者:Shawaiz Obaid,Nida Chandio,Neha Jamil,Muhammad Khuram Shahzad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:modern smart cities, License Plate Detection, forms the backbone, smart cities, Plate Detection

备注: 8 pages, 7 figures

点击查看摘要

Abstract:Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.

64. 【2606.05778】Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

链接https://arxiv.org/abs/2606.05778

作者:Qifei Jia,Xintong Yao,Minghao Li,Yajie Chai,Qiming Lu,Baoyue Shen,Yasen Zhang,Runyu Shi,Ying Huang,Yue Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional Image Aesthetic, Image Aesthetic Assessment, Opinion Scores, Aesthetic Assessment, Traditional Image

备注

点击查看摘要

Abstract:Traditional Image Aesthetic Assessment (IAA) methods mainly rely on regressing absolute Mean Opinion Scores (MOS). However, such a paradigm overlooks the inherently dynamic nature of human aesthetic perception, which relies on subconscious comparison against implicit visual references. Consequently, the lack of causal reasoning regarding aesthetic differences prevents models from learning generalizable aesthetic principles, thus limiting their generalization across diverse scenarios. In this work, we rethink the IAA task and propose Relative Edit-induced Difference Aesthetic learning (RED-Aes), a novel framework that leverages controllable image editing models to simulate the human aesthetic reasoning process. Instead of fitting absolute score distributions, RED-Aes explicitly learns the visual factors that drive aesthetic changes. To support this paradigm, we construct the RED-20k dataset, which comprises editing-based image pairs, quantitative aesthetic differences, and Chain-of-Thought (CoT) reasoning. Furthermore, we introduce a three-stage training strategy guided by a relative ranking consistency reward, optimizing the model solely via relative supervision. Extensive experiments demonstrate that RED-Aes achieves state-of-the-art performance on multiple public benchmarks, exhibiting superior generalization capabilities.

65. 【2606.05774】LiAuto-GeoX: Efficient Grounded Driving Transformer

链接https://arxiv.org/abs/2606.05774

作者:Jiawei Lian,Haoyi Sun,Yang Wu,Lifu Mu,Siyuan Wang,Le Hui,Ning Mao,Tao Wei,Pan Zhou,Kun Zhan,Jian Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated immense potential, open challenge, demonstrated immense, immense potential, remains an open

备注

点击查看摘要

Abstract:Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

66. 【2606.05769】Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

链接https://arxiv.org/abs/2606.05769

作者:Tianxiang Jiang,Linquan Wu,Sheng Xia,Songze Li,Ziang Yan,Haoyu Yang,Yu Qiao,Yi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:infer unobserved future, partial video evidence, requires models, models to infer, infer unobserved

备注: [this https URL](https://github.com/OpenGVLab/Future-L1)

点击查看摘要

Abstract:Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

67. 【2606.05760】ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection

链接https://arxiv.org/abs/2606.05760

作者:Ruchika Sharma,Rudresh Dwivedi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recurrent Neural Network, online content, increasingly challenging, challenging the credibility, credibility of online

备注

点击查看摘要

Abstract:Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.

68. 【2606.05759】Physics-Guided Deep Unfolding for Blind Cross-Sensor Spectral Super-Resolution via Learning the Spectral Transformation Function

链接https://arxiv.org/abs/2606.05759

作者:Zhaolin Li,Jinsong Chen,Shanxin Guo,Tuo Zhang,Xinglong Zhang,Pan Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:quantitative remote sensing, rich spectral information, hyperspectral sensors remain, sensors remain costly, reconstruct hyperspectral images

备注

点击查看摘要

Abstract:Hyperspectral imaging provides rich spectral information for quantitative remote sensing, yet hyperspectral sensors remain costly and thus unavailable in many UAV deployments. Spectral super-resolution (SSR) seeks to reconstruct hyperspectral images (HSIs) from multispectral images (MSIs). Most existing SSR methods assume a fixed and known spectral response function (SRF) and are therefore limited to single-sensor settings. In practical cross-sensor scenarios, the spectral degradation from HSI to MSI is unknown and varies with sensor characteristics and scene content, which renders HSI reconstruction ill-posed. This paper proposes a physics-guided deep unfolding network, termed PGU-Net, to address blind cross-sensor SSR by jointly estimating the HSI and a learnable spectral transformation function (STF). PGU-Net unrolls an alternating optimization procedure into an end-to-end trainable architecture with stages, where each stage sequentially updates the HSI and the STF. Both modules combine learnable proximal networks with differentiable closed-form solvers, enabling physical interpretability while retaining strong representation capacity. Experiments on benchmark datasets (CAVE and NTIRE 2022) with multiple SRFs demonstrate accurate recovery of the STF (degradation operator) and improved reconstruction performance over state-of-the-art SSR methods. Furthermore, evaluations on a real UAV cross-sensor dataset (Headwall Nano HSI and DJI P4 Multispectral MSI) verify the effectiveness and robustness of PGU-Net under truly blind conditions, and suggest that the estimated STF may exhibit land-cover-related differences.

69. 【2606.05758】DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

链接https://arxiv.org/abs/2606.05758

作者:Zhuoming Liu,Jinhong Lin,Kwan Man Cheng,Lin Zhang,Shayok Bagchi,Yin Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:modern vision-language models, vision-language models, build on autoregressive, discrete tokens, modern vision-language

备注

点击查看摘要

Abstract:Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.

70. 【2606.05753】Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

链接https://arxiv.org/abs/2606.05753

作者:XiuYu Zhang,Junfeng Fang,Zhenkai Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Latent visual reasoning, inserts supervised latent, supervised latent tokens, tokens between perception, generation in vision-language

备注

点击查看摘要

Abstract:Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

71. 【2606.05737】Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

链接https://arxiv.org/abs/2606.05737

作者:Yitong Chen,Shiduo Zhang,Jingjing Gong,Xipeng Qiu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:VLA action generation, VLA action, image-generation view, iterative denoising, inherit the image-generation

备注: 20 pages, 10 figures

点击查看摘要

Abstract:Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

72. 【2606.05736】VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

链接https://arxiv.org/abs/2606.05736

作者:Shufan Zhang,Ziyue Lin,Bairun Wang,Lei Jin,Xuanding Ding,Xinzhu Ma,Kunlin Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understand complex temporal, complex temporal events, aims to understand, understand complex, complex temporal

备注: 25 pages, 7 figures

点击查看摘要

Abstract:Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.

73. 【2606.05730】xtWand: A Unified Framework for Scene Text Editing

链接https://arxiv.org/abs/2606.05730

作者:Shuyu Wang,Zhile Guan,Hongxiu Chen,Yule Duan,Weiqi Li,Xin Shan,Ronggang Wang,Jian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scene text removal, unifies scene text, framework that unifies, Overlay-Reference Positional Encoding, scene text

备注

点击查看摘要

Abstract:We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

74. 【2606.05718】ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

链接https://arxiv.org/abs/2606.05718

作者:Kanghui Tian,Siyuan Liu,Ziang Yan,Sheng Xia,Shuai Dong,Yi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:trajectories sampled, teacher, OPD, reasoning, visually grounded reasoning

备注: 25 pages, 11 figures. Preprint, under review

点击查看摘要

Abstract:On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

75. 【2606.05708】Real-Time Threat Detection from Surveillance Cameras using Machine Learning

链接https://arxiv.org/abs/2606.05708

作者:Gajendra Mandal,J. P. Patra,Priyansh Mahant

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Ensuring public safety, densely populated urban, automated video surveillance, populated urban environments, urban environments remains

备注

点击查看摘要

Abstract:Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.

76. 【2606.05703】Parallel Jacobi Decoding for Fast Autoregressive Image Generation

链接https://arxiv.org/abs/2606.05703

作者:Boya Liao,Ying Li,Siyong Jian,Huan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable performance, generating high-fidelity images, demonstrated remarkable, remarkable performance, performance in generating

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

77. 【2606.05702】Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

链接https://arxiv.org/abs/2606.05702

作者:Haoyu Zhou,Qing Qing,Caichong Li,Qixin Zhang,Yongcheng Jing,Ziqi Xu,Juncheng Hu,Xikun Zhang,Renqiang Luo

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:complex visual semantics, interpret complex visual, Recent advancements, reasoning remains under-explored, visual semantics

备注

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in this https URL.

78. 【2606.05700】-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction

链接https://arxiv.org/abs/2606.05700

作者:Kerod Woldesenbet,Abem Woldesenbet

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:SAR amplitude stacks, self-supervised framework, SAR amplitude, temporal anomaly detection, gradient feature prediction

备注: Won IEEE GRSS Data Fusion Contest 2026; to appear in IGARSS 2026 proceedings

点击查看摘要

Abstract:We present T-SAR-JEPA, a self-supervised framework for temporal anomaly detection in SAR amplitude stacks via latent prediction. A ViT-Base/16 encoder from SAR-JEPA is domain-adapted on 39,300 Capella patches using local masked reconstruction with gradient feature prediction. A temporal transformer with sinusoidal time encoding forecasts future latent states from K=7 acquisitions, with progressive unfreezing substantially reducing validation loss. The model operates on amplitude alone; InSAR coherence serves exclusively as independent pseudo-ground-truth. On the DFC 2026 dataset (300 time-series, three AOIs), T-SAR-JEPA achieves ROC-AUC of 77.0% on the Hawaii eruption window, outperforming RX, PaDiM, Linear AR, and LSTM baselines (~50%). Spatial coherence of 99.9% (p 0.001, permutation test) confirms structured detections. Code: this https URL

79. 【2606.05677】LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

链接https://arxiv.org/abs/2606.05677

作者:Shiqiang Lang,Jing Liu,Haoyang He,Peiwen Sun,Yuanteng Chen,Tao Liu,Lan Yang,Longteng Guo,Honggang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, longer visual inputs

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

80. 【2606.05675】wo-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

链接https://arxiv.org/abs/2606.05675

作者:Hongye Xu,Bartosz Krawczyk

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:erasing prior knowledge, Continual learning, seeks models, prior knowledge, models that acquire

备注: Published as a conference paper at ICLR 2026. 23 pages, 8 figures. Code: [this https URL](https://github.com/HXuSz11/BiCyc_ICLR2026)

点击查看摘要

Abstract:Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype-based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; therefore, projection-based drift compensation has become a popular remedy. We show, however, that existing one-directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce BiCyc, a bidirectional projector alignment approach with a cycle-consistency objective. BiCyc jointly optimizes two maps, old-to-new and new-to-old, with stop-gradient gating so that transport and representation co-evolve. Analytically, we show that the cycle loss contracts the singular spectrum toward unity in whitened space, and that improved transport of class means and covariances yields smaller perturbations of classification log-odds, preserving old-class decisions and mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, BiCyc substantially reduces forgetting and improves accuracy in from-scratch settings, while remaining competitive in the pretrained fine-grained regime.

81. 【2606.05665】V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation

链接https://arxiv.org/abs/2606.05665

作者:Tao Liu,Leela Krishna,Gouti Pavan Kumar,Sreeja K,Vishav Garg

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserve frame-level correspondence, follow editing instructions, generation is difficult, instructions and preserve, preserve frame-level

备注: Accepted at ICML 2026 workshop

点击查看摘要

Abstract:Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging editing tasks and evaluates two commercial models, Grok Imagine and Gemini Veo3, and one open-source model, Open Sora 2. Results show complementary model strengths: Grok performs better on editing fidelity, while Veo3 achieves stronger visual quality. On six V2V-specific dimensions, V2V-Bench reaches a Spearman correlation of 0.905 with human judgments.

82. 【2606.05652】CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors

链接https://arxiv.org/abs/2606.05652

作者:Shengxi Li,Zhaokun Hu,Ce Zheng,Mai Xu,Jingyuan Xia,Si Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unsupervised conditional image, manually annotated labels, Unsupervised conditional, unstructured semantic representations, remains challenging due

备注

点击查看摘要

Abstract:Unsupervised conditional image generation (UCGen) aims to control generation without relying on manually annotated labels, yet remains challenging due to unstructured semantic representations across granularities. To address this, we propose a novel coarse-to-fine UCGen framework (CoFi-UCGen) that explicitly disentangles global semantics from fine-grained variations, which to the best of our knowledge, sets out the first successful attempt for both coarse- and fine-grained conditional generation without any labels. More specifically, we first propose the adversarial semantic reciprocal learning theory to ensure the semantic consistency and completeness between images and latent spaces. Based on the consistency, we propose the bit-codes to learn a structured coarse-grained latent space, and further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling for generation. Building upon these bit-codes, we establish a fine-grained semantic basis and introduce a hierarchical modulation mechanism in diffusion models, by enabling layer-wise injection from coarse conditions to progressively control fine-grained attributes during generation. Extensive experiments demonstrate that without any label priors or pre-trained feature extractors, our CoFi-UCGen consistently outperforms existing UCGen methods in terms of image quality, semantic consistency, and control accuracy, verifying the effectiveness of explicit coarse-to-fine semantic decomposition for the challenging UCGen task.

83. 【2606.05650】GS-NFS: Bandwidth-adaptive Streaming of Dynamic Gaussian Splats and Point Clouds

链接https://arxiv.org/abs/2606.05650

作者:Rajrup Ghosh,Haodong Wang,Haoran Hong,Eduardo Pavez,Amartya Chaudhuri,Weiwu Pang,Harsha V. Madhyastha,Antonio Ortega,Ramesh Govindan

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Networking and Internet Architecture (cs.NI)

关键词:holds great promise, Gaussian Splatting, video streaming technology, holds great, scenes with high

备注

点击查看摘要

Abstract:Dynamic 3D Gaussian Splatting (3DGS) holds great promise as a 3D video streaming technology since it can represent complex 3D scenes with high fidelity. In this approach, every frame in a 3D video represents the environment as a collection of Gaussians with position and other attributes such as scale, rotation, opacity, and color. Frames capture fine details, permit views from any arbitrary perspective, but are an order of magnitude, or more, larger than 2D video frames. A line of recent work has explored how to compress dynamic 3DGS frames, but these approaches are often slow, in part because their compression techniques are not amenable to efficient acceleration. GS-NFS accelerates dynamic 3DGS compression and decompression on a GPU, to the point where it can encode and decode at full frame rate. It achieves this by developing novel GPU-based parallelizations of existing algorithms for encoding both positions and attributes of Gaussians. As a result, it is 1-2 orders of magnitude faster than the state-of-the-art in encoding and decoding a frame, while offering competitive compression performance and rendering quality.

84. 【2606.05641】Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure

链接https://arxiv.org/abs/2606.05641

作者:Blessing Agyei Kyem,Joshua Kofi Asamoah,Eugene Denteh,Armstrong Aboah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurate pixel-level masks, connected crack geometry, domain shift, accurate pixel-level, geometry and confidence

备注: 60 pages, 17 figures, 11 tables

点击查看摘要

Abstract:Reliable crack assessment requires not only accurate pixel-level masks but also connected crack geometry and confidence estimates that remain stable under domain shift. However, existing segmentation models can achieve high overlap scores while fragmenting cracks, missing fine branches, and providing no calibrated uncertainty. To address this gap, this paper proposes CrackGeoFM, a multi-task framework that combines a frozen visual foundation backbone with crack-specific adaptation for mask prediction, skeleton reconstruction, and uncertainty estimation. The framework integrates a Frequency-Guided Crack Enhancement Module (FCEM) to enhance high-frequency crack cues, a Crack-Domain Feature Adaptation Module (CFAM) to adapt frozen backbone features to crack-domain patterns, and a Structure-Aware Multi-Task Decoder (SMTD) to jointly decode masks, skeletons, and uncertainty. Across 20 crack datasets, CrackGeoFM achieves state-of-the-art segmentation, improved topology preservation, calibrated uncertainty, and effective few-shot adaptation with only five labeled images. These results support reliable, generalizable, and engineering-oriented crack analysis for infrastructure assessment.

85. 【2606.05635】ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

链接https://arxiv.org/abs/2606.05635

作者:Dehong Kong,Lina Lei,Lingtao Zheng,Chenyang Wu,Ailing Zhang,Xinran Qin,Teng Ma,Jiaqi Xu,Zhixin Wang,Zhikai Chen,Xuecheng Qi,Renjing Pei,Fan Li

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:aesthetically pleasing crop, composition typically produces, single aesthetically pleasing, Prior work, composing multiple shots

备注

点击查看摘要

Abstract:Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.

86. 【2606.05624】KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

链接https://arxiv.org/abs/2606.05624

作者:Tengjiao Sun,Pengcheng Fang,Xiaoyu Zhan,Yanwen Guo,Dongjie Fu,Xiaohao Cai,Hansung Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:embodied-agent workflows rarely, workflows rarely stop, human motion models, sketched root path, synthesize plausible motions

备注

点击查看摘要

Abstract:Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

87. 【2606.05611】What's Under the Skin? Estimating Swine Body Condition

链接https://arxiv.org/abs/2606.05611

作者:Mk Bashar,Kuljit Bhatti,Gary Rohrer,Madonna Benjamin,Tami Brown-Brandl,Daniel Morris

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sow body condition, body condition, piglet survival, important indicator, indicator for growers

备注

点击查看摘要

Abstract:Sow body condition is an important indicator for growers as it has a large impact on lactation performance and piglet survival. However, body condition measures used during production, such as visual scoring and calipers, correlate poorly with underlying tissue composition. Ultrasound scans can provide direct measurements of subcutaneous backfat thickness and loin muscle depth, but their operation is labor intensive and not scalable for production. We present PigFormer, an end-to-end two-stage system that takes raw depth frames from a ceiling-mounted RGB-D camera and predicts subcutaneous backfat thickness, loin muscle depth, and total tissue thickness at the last rib. Stage 1 is a geometric front-end that converts raw depth into a standardized height map via SAM3-to-MaskDINO segmentation distillation, ground-plane removal, and orientation normalization. Stage 2 is a Slice Attention Encoder that treats each height map as a sequence of cross-sectional slices and captures spatial relationships along the full dorsal surface. On a multi-site dataset of 319 sow and gilt instances from two facilities, PigFormer achieves 2.43 mm backfat MAE and 3.87 mm overall MAE. It outperforms strong single-stage ResNet-18 and ViT-small baselines. PigFormer offers a practical path toward continuous, automated, non-contact body condition monitoring in commercial swine production. Code is available at this https URL.

88. 【2606.05587】HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery

链接https://arxiv.org/abs/2606.05587

作者:Phillip Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:UAV imagery presents, presents unique challenges, imagery presents unique, Multi-object tracking, UAV imagery

备注: 18 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Multi-object tracking (MOT) from UAV imagery presents unique challenges: altitude varies across sequences, objects are small and densely packed, and frequent occlusion causes identity switches. Existing graph-based trackers assume fixed spatial context and treat all objects uniformly, ignoring the heterogeneous lifecycle states of detections, active tracklets, and lost targets. We propose HDST-GNN, a Heterogeneous Dynamic Spatiotemporal Graph Neural Network with three novel contributions. First, Altitude-Adaptive Edge Construction estimates a camera-altitude proxy from mean object area and adjusts the graph connectivity radius accordingly. Second, Heterogeneous Node Representation models detections (Type-D), confirmed tracklets (Type-T), and lost tracklets (Type-L) as distinct node types with dedicated projections and typed edge relations. Third, Occlusion-Gated Temporal Aggregation gates each node's attention contribution by its occlusion confidence, preventing occluded nodes from corrupting neighbour embeddings. HDST-GNN is trained end-to-end with a differentiable Sinkhorn head using joint cross-entropy and triplet loss. On VisDrone2019-MOT with oracle detections, HDST-GNN achieves 94.51% MOTA and 97.24% IDF1, outperforming SORT by +5.0 MOTA points and reducing identity switches by 81%. With real YOLOv8n detections, HDST-GNN reduces identity switches by 49% vs. SORT. Ablation studies confirm the independent contribution of each component.

89. 【2606.05586】BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

链接https://arxiv.org/abs/2606.05586

作者:Wenlin Liu,Xikun Hu,Ping Zhong

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Convolutional Neural Networks, Convolutional Neural, Vision Transformers, sensing object detection, global context modeling

备注

点击查看摘要

Abstract:In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31\%, 73.41\% and 71.86\% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.

90. 【2606.05581】Monte Carlo Steklov Operators for Large-Scale Geometry Processing in the Wild

链接https://arxiv.org/abs/2606.05581

作者:Arman Maesumi,Tanish Makadia,Aruna Anderson,Oras Phongpanangam,Justin Solomon,Daniel Ritchie

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Intrinsic methods fill, Intrinsic methods, fill the default, default toolbox, Monte Carlo method

备注: 21 pages

点击查看摘要

Abstract:Intrinsic methods fill the default toolbox for geometry processing on meshes. Intrinsic operators, in particular the Laplacian, underlie methods that require invariance to isometry and have hence been employed in many algorithms for shape analysis, learning, and editing. However, intrinsic methods are predicated on assumptions that quickly become brittle when working with in-the-wild geometry, where (i) mesh quality is not guaranteed, and (ii) many meshes are modeled with multiple connected components. In such settings, volumetric constructions are better-defined, since restrictions on surface topology can be relaxed. This paper presents a Monte Carlo method for estimating the Dirichlet-to-Neumann (DtN) operator -- a boundary-to-boundary volumetric operator -- and its associated Steklov eigenmodes. We build on recent developments in Monte Carlo geometry processing by casting this boundary operator itself as the subject of estimation. The DtN operator, defined through a volumetric stochastic process, is then generalized to the exterior domain, where it couples disconnected components through the surrounding ambient space. We show that our method is orders of magnitude faster than existing boundary-element approaches for computing Steklov spectra while remaining robust to poor triangulations, high-resolution meshes, and multi-component geometry. To demonstrate this scalability, we compute interior and exterior Steklov eigenspectra for approximately 450,000 shapes from the uncurated Objaverse dataset. We incorporate these operators into Steklov-CLIP, a mesh-based neural network that uses volumetric spectral operators for large-scale contrastive 3D representation learning. The resulting network learns semantically meaningful global and dense shape representations, illustrating that geometrically-principled volumetric operators can be made practical at the scale of modern 3D datasets.

91. 【2606.05576】UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

链接https://arxiv.org/abs/2606.05576

作者:Gexin Huang,Yanting Yang,Myeongkyun Kang,Beidi Zhao,Jun Zhou,Chen Zhou,Gang Wang,Zu-hua Gao,Xiaoxiao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-language models, answering and multimodal, Vision-language, evidence, multimodal reasoning benchmarks

备注: 10 pages, 1 figure

点击查看摘要

Abstract:Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.

92. 【2606.05536】Dual Feature Decoupling for Fine-Grained OOD Detection

链接https://arxiv.org/abs/2606.05536

作者:Xiaokun Li,Yaping Huang,Qingji Guan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:applying machine learning, machine learning models, OOD detection, fine-grained OOD detection, real-world scenarios

备注

点击查看摘要

Abstract:Out-of-distribution detection (OOD) is an indispensable technique when applying machine learning models to real-world scenarios. Most existing OOD detection methods have been developed under the idealized assumption of large inter-class distributional differences, while largely overlooking fine-grained tasks characterized by subtle variations, such as medical image classification and vehicle recognition. The high visual similarity among fine-grained subcategories, together with the interference of background factors, makes OOD detection extremely challenging. To tackle this problem, we propose a novel Dual Feature Decoupling Network (DFDNet), which addresses fine-grained OOD detection from the perspective of feature disentanglement. The proposed DFDNet comprises two key components: a spatial-frequency decoupling module and a reconstruction-guided decoupling module. The spatial-frequency decoupling module is designed to preserve content features that are discriminative for classification while suppressing task-irrelevant style information. On the other hand, the reconstruction-guided decoupling module introduces a novel pixel-level adversarial reconstruction task to further remove low-level, non-discriminative information and enhance category-specific high-level semantic representations. Extensive experiments demonstrate that our method achieves competitive performance improvements on multiple datasets.

93. 【2606.05535】Noise-Aware Visual Representation Learning for Medical Visual Question Answering

链接https://arxiv.org/abs/2606.05535

作者:I Putu Adi Pratama,Bahadorreza Ofoghi,Atul Sajjanhar,Shang Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:clinically relevant queries, clinical decision support, answer clinically relevant, Medical visual question, interpret medical images

备注: 15 pages, 2 figures. Conference submission

点击查看摘要

Abstract:Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.

94. 【2606.05533】What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

链接https://arxiv.org/abs/2606.05533

作者:Rohan Siva,Neel P. Bhatt,Yunhao Yang,Seoyoung Lee,Nishant Gadde,Christian Ellis,Alvaro Velasquez,Zhangyang Wang,Ufuk Topcu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:robot planning systems, planning systems rely, latent space, systems rely, latent spaces organized

备注: Code, videos, and data available at: [this https URL](https://A4Dance-reasoning.github.io)

点击查看摘要

Abstract:Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: this https URL.

95. 【2606.05531】Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

链接https://arxiv.org/abs/2606.05531

作者:Mohammad Mahdi Abootorabi,Omid Ghahroodi,Anas Madkoor,Marzia Nouri,Doratossadat Dastgheib,Mohamed Hefeeda,Ehsaneddin Asgari

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:chart meaningful progress, field lacks benchmarks, human-like multimodal intelligence, true reasoning abilities, rapid progress

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: this https URL.

96. 【2606.05515】BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

链接https://arxiv.org/abs/2606.05515

作者:Muhammad Usama,Didier Stricker,Mohammad Sadil Khan,Muhammad Zeshan Afzal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:largely open problem, representation learning substrate, representation learning, open problem, largely open

备注

点击查看摘要

Abstract:Learning representations of CAD models is a largely open problem. While 3D representation learning has flourished around point clouds and meshes, the native format of CAD - boundary representations BReps, which encodes exact parametric surfaces, curves, and their topology, has received little attention as a representation learning substrate. We introduce BRepCLIP, the first framework to align BRep geometry with language and image embeddings through contrastive pretraining. We model each CAD object as a sequence of face and edge tokens with separate discrete vocabularies for surface and curve geometry, augmented with spatial and semantic descriptors that capture surface types (e.g., cylindrical, torus, NURBS) and curve primitives (e.g., line, arc, B-spline). A transformer encoder aggregates these tokens into a global BRep embedding, aligned with CLIP's text and image encoders via a joint contrastive objective. BRepCLIP generates more discriminative and semantically grounded embeddings than existing point-based alternatives, improving Top-1 retrieval over OpenShape by 40.4%, 22.0%, and 23.9% on ABC, CADParser, and Automate, respectively, and improving zero-shot classification on FabWave by 15% in Top-1 score. We further demonstrate its utility as a CAD-aware similarity metric for evaluating text and image-conditioned CAD generation, establishing the importance of structure-aware pretraining for multimodal CAD understanding. Project page is available at this https URL

97. 【2606.05506】Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

链接https://arxiv.org/abs/2606.05506

作者:Amirhossein Zhalehmehrabi,Tiziano Tezze,Alberto Castelini,Alessandro Farinelli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sensor-guided adaptive contrastive, visual representation learning, contrastive learning framework, propose a sensor-guided, adaptive contrastive learning

备注: 8 pages, Submitted to RAL

点击查看摘要

Abstract:We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:

Comments:
8 pages, Submitted to RAL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.05506 [cs.CV]

(or
arXiv:2606.05506v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.05506

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
98. 【2606.05491】Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers

链接https://arxiv.org/abs/2606.05491

作者:Jean Cordonnier,Chenghao Xu,Olga Fink,Malcolm Mielle

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:imagery enables precise, thermal imagery enables, enables precise, imagery enables, combining RGB

备注: Accepted at ICRA 2026's Workshop MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding

点击查看摘要

Abstract:Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.

99. 【2606.05489】LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

链接https://arxiv.org/abs/2606.05489

作者:Shahrzad Esmat,Chaunte W. Lacewell,Sameh Gobriel,Nilesh Jain,Ali Jannesari

类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词:multi-modal question answering, spanning visual search, Tree-structured Parzen Estimators, Gaussian Process Bayesian, Retrieval systems underpin

备注: 13 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Retrieval systems underpin modern AI applications -- spanning visual search, recommendation engines, and multi-modal question answering. Modern multi-stage retrieval systems require the joint optimization of highly coupled parameters, yet traditional hyperparameter optimization (HPO) methods -- including Tree-structured Parzen Estimators (TPE) and Gaussian Process Bayesian Optimization -- rely on an independence assumption that fundamentally prevents them from navigating these coupled configuration spaces. We address this limitation with a phase-aware large language model (LLM) agent that conditions each proposal on its full optimization history, navigating the coupled parameter space across phase-partitioned exploration, exploitation, and fine-tuning stages. Evaluated on the HICO-DET human-object interaction retrieval benchmark using Intel VDMS (Visual Data Management System), our agent outperforms Optuna TPE by +33.3% and VDTuner by +34.2% under SIEVE (Safeguarded Index Evaluation of Vector-search Efficiency, a quality-constrained throughput metric), delivering a 15.3x throughput gain over UniIR. Validation across three benchmarks confirms that the agent's advantage grows with the degree of parameter coupling: +33.3% on HICO-DET (high coupling), methods converge within 1% on GLDv2 (moderate coupling) and within 3.6% on SIFT1M (near-independent control). Cross-system validation on Milvus confirms the optimizer ranks first on all three datasets without modification, demonstrating transferability across vector database management system (VDBMS) platforms.

100. 【2606.05478】Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?

链接https://arxiv.org/abs/2606.05478

作者:Joong Ho Kim,Keith G. Mills

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Diffusion Models, Human Preference Metrics, revolutionized text-driven generation, photorealistic visual content, synthesis of high-quality

备注: Code is available at [this https URL](https://github.com/LSU-ATHENA/HPM-Predict)

点击查看摘要

Abstract:Diffusion Models (DM) have revolutionized text-driven generation by enabling the synthesis of high-quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics (HPM) that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.

101. 【2606.05471】Formal Concept Lattices are Good Semantic Scaffolds for Concept-Based Learning

链接https://arxiv.org/abs/2606.05471

作者:Deepika SN Vemuri,Sayanta Adhikari,Ankit Saha,Krishn Vishwas Kher,Vineeth N Balasubramanian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human reasoning, semantic, Learning, Abstract, Formal Concept Analysis

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Learning semantics is essential for deep learning models to be interpretable and better aligned with human reasoning. Concept-based models approach this by representing classes through meaningful semantic abstractions, but typically treat all concepts as a flat, unstructured set learned at a single neural network layer. This overlooks a fundamental property of human semantic understanding: concepts being organized hierarchically, from general to specific. While deep networks do learn a hierarchy of visual features, this structure is rarely aligned with explicit semantic hierarchies. Drawing on Formal Concept Analysis, we demonstrate that formal concept lattices provide principled semantic scaffolds to guide neural network learning. These lattices naturally identify where in the network concepts should be learned based on their level of generality. This allows the model to develop staged, semantically grounded representations throughout its depth. Empirical results on real-world datasets show that our models produce more interpretable embeddings, support more effective interventions, and learn concept representations that are both meaningful and hierarchically structured.

102. 【2606.05460】ORACLE-CT: Anatomy-Aware Support Pooling for CT Classification

链接https://arxiv.org/abs/2606.05460

作者:Lavsen Dahal,Yubraj Bhandari,Geoffrey Rubin,Joseph Y. Lo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:classification is challenging, confined to specific, specific organs, Abdominal CT disease, localized disease evidence

备注

点击查看摘要

Abstract:Abdominal CT disease classification is challenging because each scan is a large 3D volume with many possible findings, while diagnostic evidence is often confined to specific organs or anatomical compartments. Most study-level classifiers aggregate encoder features using anatomy-agnostic pooling or attention, creating a mismatch between localized disease evidence and global evidence aggregation. We propose ORACLE--CT, an encoder-agnostic anatomy-aware aggregation framework that uses multi-organ segmentation to define label-specific anatomical supports and restrict attention pooling to relevant regions. The framework supports single-organ, multi-organ union, comparative, localized, and global support strategies. We evaluate ORACLE--CT with three encoder families: DINOv3, I3D--ResNet-121, and the radiology-native Pillar--0 encoder. Models are trained end-to-end on MERLIN and evaluated internally and under frozen external transfer to Duke--Abdomen and AMOS. Compared with global average pooling, support-masked pooling improved MERLIN macro-AUROC/AUPRC from 0.838/0.638 to 0.858/0.676 for DINOv3 and from 0.829/0.617 to 0.848/0.659 for I3D--ResNet-121. On harmonized 10-label external evaluation, DINOv3 improved on Duke--Abdomen from 0.802/0.628 to 0.835/0.683 and on AMOS from 0.742/0.313 to 0.762/0.350, with similar gains for I3D--ResNet-121. For Pillar--0, most gains came from learned attention, with smaller additional benefit from anatomical masking. ORACLE--CT improves discrimination and external robustness while preserving an auditable link between predictions and anatomical evidence.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.05460 [cs.CV]

(or
arXiv:2606.05460v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.05460

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
103. 【2606.05458】Horse Eye Blink Detection and Classification for Equine Affective State Assessment

链接https://arxiv.org/abs/2606.05458

作者:João Alves,Signe Møller-Skuldbøl,Pia Haubro Andersen,Rikke Gade

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:facial action units, affective state assessment, equine facial action, action units, facial action

备注: CVPRW2026 CV4Animals

点击查看摘要

Abstract:Automated detection of equine facial action units (AUs) is a promising yet under-explored avenue for pain and affective state assessment in horses. Half and full-blink movements are recognised indicators of pain and stress, but as micro-expressions, their subtle, fine-grained nature makes them easily missed by the naked eye and only discernible through frame-by-frame video inspection, making reliable automated detection from video a particularly demanding task. We develop and evaluate three methods for automated blink classification from horse videos: a frame-based YOLOv12 detector, an optical flow magnitude thresholding approach, and a fine-tuned VideoMAE model, tested on a publicly available dataset. We achieve a macro-F1 score of 0.898 when doing blink classification and 0.926 on binary blink detection. Our results highlight both the potential and the inherent challenges of fine-grained AU detection for equine welfare monitoring.

104. 【2606.05455】Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification

链接https://arxiv.org/abs/2606.05455

作者:Feixiang Zhou,Jianyang Xie,Zhuangzhi Gao,Qinkai Yu,Fu Wang,Yuheng Fan,Jing Li,Zheheng Jiang,Yitian Zhao,Yanda Meng,He Zhao,Gregory Y.H. Lip,Yalin Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including product understanding, missing-modality problem poses, recommendation systems, multimedia applications, including product

备注

点击查看摘要

Abstract:The missing-modality problem poses a significant challenge in image-tabular multimodal learning across a wide range of multimedia applications, including product understanding, recommendation systems, and medical diagnosis. This challenge is particularly pronounced when the two modalities are highly heterogeneous, as images and tabular attributes differ substantially in their semantic granularity and data distributions. Existing methods learn modality-invariant representations through disentanglement and alignment over global token-averaged features, capturing only coarse cross-modal consistency and overlooking fine-grained semantic and distributional misalignment, which hampers the exploitation of complementary cues under missing modalities. To address this, we propose DFPL, a novel framework for fine-grained prototype learning. Specifically, Shared-Specific Prototype Modeling (SSPM) extracts compact and diverse shared and modality-specific prototypes, and further performs prototype-level disentanglement to suppress redundant intra-modality correlations. Additionally, we propose a Prototype-guided Fine-grained Alignment (PFA) module that jointly enforces prototype-level distribution matching and prototype-to-class semantic alignment within a unified prototype space, thereby preserving both fine-grained distributional and semantic consistency across modalities. We further introduce a Class-aware Multi-scale Aggregation (CMA) module to adaptively aggregate shared semantics and modality-specific characteristics from global and prototype levels for robust predictions. Extensive experiments on three diverse image-tabular benchmarks demonstrate the superiority of our method compared to the previous approaches under various missing-modality settings. Code will be made publicly available.

105. 【2606.05437】Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation

链接https://arxiv.org/abs/2606.05437

作者:Simegnew Yihunie Alaba,Yuichi Motai

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Unscented Kalman Filter, Kalman Filter, Unscented Kalman, Visual-Inertial Odometry, Multiscale Convolutional Neural

备注: 13 pages

点击查看摘要

Abstract:This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.

106. 【2606.05409】Would you still call this Dax? Novel Visual References in VLMs and Humans

链接https://arxiv.org/abs/2606.05409

作者:Ada Defne Tür,Gaurav Kamath,Joyce Chai,Siva Reddy,Benno Krojer

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:remains largely underexplored, exposure remains largely, Visual References Dataset, Vision-language models, largely underexplored

备注

点击查看摘要

Abstract:Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

107. 【2606.05399】UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching

链接https://arxiv.org/abs/2606.05399

作者:Qilin Huang,Quynh Anh Huynh,Long Le,Chen Wang,Chuhao Chen,Ryan Lucas,Eric Eaton,Lingjie Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing feed-forward networks, feed-forward networks excel, point-estimate paradigm fundamentally, paradigm fundamentally fails, real world inherent

备注: Published at CVPR 2026 as a Highlight. Project page: [this https URL](https://unipixie.github.io/)

点击查看摘要

Abstract:Existing feed-forward networks excel at predicting a single set of physical properties from visual appearance, but this point-estimate paradigm fundamentally fails to capture the real world's inherent physical ambiguity. We address this by reframing physics prediction as a task of learning a controllable, continuous distribution of material properties. We introduce UNIPIXIE, a framework trained to predict a continuous and parameterized path of physically plausible material properties from a single visual input. By learning a direct mapping along an object's softest-to-stiffest spectrum on our PIXIEMULTIVERSE dataset, UNIPIXIE allows for controllable generation of diverse, physically valid material fields via a single intuitive parameter. Crucially, UNIPIXIE introduces a novel unified architecture to produce simulation-ready parameters for diverse physics solvers, including continuum-based Material Point Method (MPM), reduced-order deformation based on Linear Blend Skinning (LBS), and anchor-based Spring-Mass systems, addressing a key portability issue in prior work. Experiments show our approach not only generates a rich variety of plausible dynamics but also reduces Young's Modulus prediction error by over 50% against the strongest deterministic baseline, bridging the gap between static point estimates and the continuous nature of physical reality. Project page: this https URL

108. 【2606.05379】Deep Learning-assisted AMD Staging based on OCT and OCT Angiography

链接https://arxiv.org/abs/2606.05379

作者:Yukun Guo,Tristan T. Hormel,An-Lun Wu,Liqin Gao,Min Gao,Steven T. Bailey,Yali Jia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:optical coherence tomography, AMD, OCTA, OCT, age-related macular degeneration

备注

点击查看摘要

Abstract:To develop and evaluate deep learning models for automated grading of age-related macular degeneration (AMD) severity using optical coherence tomography (OCT) and OCT angiography (OCTA) data. Two hundred seventy-one participants aged = 50 years with varying AMD severities. Central macular 6 x 6 mm OCT/OCTA volumes were acquired using a swept-source OCTA system (SOLIX; Visionix/Optovue Inc., CA). AMD severity was graded into four stages (No AMD, Early AMD, Intermediate AMD, and Advanced AMD) according to the AREDS simplified severity scale. Three deep learning models were developed using different input modalities: (1) biomarker maps derived from segmented pathological features, including retinal fluid, drusen, geographic atrophy (GA), and macular neovascularization (MNV); (2) two-dimensional (2D) en face OCT and OCTA projections; and (3) three-dimensional (3D) OCT/OCTA volumes. EfficientNet-based architectures were trained using normalized inputs, data augmentation, and five-fold cross-validation. A total of 2,030 OCT/OCTA volumes from 351 eyes of 271 participants were analyzed. All models demonstrated strong AMD staging performance with substantial agreement with the reference standard (QWK = 0.83). The biomarker-based model achieved the highest overall performance (QWK = 0.85 +/- 0.03, mean +/- standard deviation) and the best detection of early AMD (F1-score = 0.59 +/- 0.14). The 3D model achieved performance comparable to the 2D OCT/OCTA model (QWK = 0.83 +/- 0.04 vs. 0.83 +/- 0.09), while the 2D OCT/OCTA model showed the highest precision (0.79 +/- 0.06) and most accurately identified eyes without AMD. Deep learning models using OCT/OCTA data can accurately and automatically grade AMD severity. Among the evaluated approaches, the biomarker-based model provided the most balanced performance and showed particular value for early AMD detection.

109. 【2606.05375】hree-Dimensional Retinal Microvasculature Restoration in OCT Angiography

链接https://arxiv.org/abs/2606.05375

作者:Yukun Guo,Min Gao,Tristan T. Hormel,Steven T. Bailey,Thomas S. Hwang,Yali Jia

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Optical coherence tomographic, coherence tomographic angiography, imaging retinal microvasculature, Optical coherence, single OCTA volume

备注

点击查看摘要

Abstract:Optical coherence tomographic angiography (OCTA) is a powerful technique for imaging retinal microvasculature. However, acquiring reliable quantification of retinal blood flow and areas of retinal nonperfusion is challenging because of imaging artifacts. Existing methods primarily focus on noise suppression, projection artifact removal, or signal enhancement to improve the image quality of OCTA in cross-sectional or two-dimensional (2D) en face projections, while neglecting the intrinsic three-dimensional vascular architecture. In this study, we propose a deep learning-based algorithm for restoring capillary anatomical vasculature from a single OCTA volume. The network consists of an EfficientNet-B5 encoder and a decoder incorporating concurrent spatial and channel squeeze-and-excitation modules, connected via skip connections to preserve spatial resolution. Three adjacent B-frames are used as input to predict the restored middle B-frame. We evaluated the performance of the model using the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) against ground truth generated from averaging multiple scans. The results show that the proposed model significantly (both p 0.001) improved image quality compared with the original single OCTA volume, with a PSNR of 26.16 +/- 1.26 vs. 22.23 +/- 0.78 and an SSIM of 0.91 +/- 0.02 vs. 0.72 +/- 0.03. The proposed model also significantly (p 0.001) improved microvascular fidelity, measured by the Dice coefficient overlap between the model output and ground truth, in both 2D and 3D by at least 3.8% and 51.2%, respectively, across several different vascular slabs.

110. 【2606.05368】Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

链接https://arxiv.org/abs/2606.05368

作者:Sayan Mandal,Rocco Sedona,Simon Besnard,Mikhail Urbazaev,Morris Riedel,Ehsan Zandi,Gabriele Cavallaro

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:spatially explicit characterization, canopy-top height proxies, pipelines predict canopy-top, predict canopy-top height, forest vertical structure

备注: 32 pages, 21 figures

点击查看摘要

Abstract:Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

111. 【2606.05359】Recovering Physically Plausible Human-Object Interactions from Monocular Videos

链接https://arxiv.org/abs/2606.05359

作者:Dingbang Huang,Etienne Vouga,Qixing Huang,Georgios Pavlakos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reconstruct physically plausible, physically plausible human-object, monocular videos, plausible human-object interactions, plausible human-object

备注: CVPR 2026. Project Page: [this https URL](https://dingbang777.github.io/RePHO/)

点击查看摘要

Abstract:In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: this https URL

112. 【2606.05354】LightVesselNet: An Ultra-Lightweight Sub-100K Parameter Network for Retinal Blood Vessel Segmentation

链接https://arxiv.org/abs/2606.05354

作者:Shadman Sobhan,Farhana Jalil

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Retinal blood vessel, retinopathy and glaucoma, vessel segmentation plays, blood vessel segmentation, plays a vital

备注

点击查看摘要

Abstract:Retinal blood vessel segmentation plays a vital role in the early detection of diabetic retinopathy and glaucoma. While recent deep learning models have achieved great segmentation accuracy, they typically require heavy computational resources, making real-world deployment on edge devices difficult. In this paper, we propose LightVesselNet, an efficient neural network designed for retinal vessel segmentation in a resource-constrained environment. Despite containing only 75K parameters, LightVesselNet performs competitively with much larger models. The network employs a compact encoder decoder architecture enhanced with channel and spatial attention mechanisms, a multi-scale feature aggregation module at the bottleneck, and a subpixel upsampling strategy in the decoder. A dedicated edge residual connection preserves fine vessel detail throughout decoding. Extensive experiments on five publicly available datasets: DRIVE, STARE, CHASEDB1, FIVES, and HRF, yield sensitivity scores of 0.8189, 0.8499, 0.8640, 0.8634, 0.8096, and Dice coefficients of 0.8070, 0.8072, 0.8181, 0.8649, and 0.7686, respectively. LightVesselNet shows improved efficiency (Performance vs Parameter or GFlops) compared to State-of-the-Art models. Cross-dataset evaluation confirms the model's generalisation capability. Overall, LightVesselNet is a strong candidate for deployment in low-resource clinical settings and mobile screening tools.

113. 【2606.05347】opoPult-SSL: Gland-Mask-Free Cross-Device Meibomian Gland Segmentation via Self-Distilled Weak Clinical Priors

链接https://arxiv.org/abs/2606.05347

作者:Nicolò Savioli,Luca Del Tongo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Pult grades, cheap clinical signals, imaging device creates, morphometric ratios, routinely recorded

备注: 13 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Every new clinical imaging device creates a domain shift where dense gland masks are expensive yet cheap clinical signals -- eyelid outlines, Pult grades, morphometric ratios -- are routinely recorded. We present TopoPult-SSL, a two-stage framework for cross-device meibomian gland segmentation. Stage 1 adapts a source-trained model without target gland masks in the training loss, using four weak-prior anchors driven by target eyelid masks and clinical metadata only. Stage 2, when target gland masks are available, distils complementary Stage-1 teachers into a single compact student via supervised self-distillation. We develop and validate the technique on the public MGD-1k to CAMG research benchmark (1,000 to 100 images, different device), where the distilled model achieves Dice 0.716+/-0.006 (best 0.726), surpassing UA-MT (0.710) and the ensemble teacher (0.720) -- with a single pass. The gland-mask-free Stage-1 variant reaches Precision 0.694 vs. 0.30-0.34 for SAM/MedSAM (p0.001), enabling deployment without dense gland contouring. Code and reproducibility scripts are released.

114. 【2606.05328】he Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show

链接https://arxiv.org/abs/2606.05328

作者:Parsa Esmati,Somjit Nath,Katja Hofmann,Derek Nowrouzezahrai,Samira Ebrahimi Kahou,Majid Mirmehdi

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:candidate world simulators, generate increasingly realistic, Modern video diffusion, models generate increasingly, temporally coherent videos

备注

点击查看摘要

Abstract:Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model's intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.

115. 【2606.05290】Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

链接https://arxiv.org/abs/2606.05290

作者:Tobia Poppi,Silvia Cappelletti,Sara Sarto,Florian Schiffers,Garin Kessler,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:remain largely model-specific, existing approaches remain, approaches remain largely, Recent progress, central challenge

备注: Project page: [this https URL](https://aimagelab.github.io/cross-model-safety-representations/)

点击查看摘要

Abstract:Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.

116. 【2606.05275】Personal AI Agent for Camera Roll VQA

链接https://arxiv.org/abs/2606.05275

作者:Thao Nguyen,Krishna Kumar Singh,Donghyun Kim,Yong Jae Lee,Yuheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:personal camera roll, question answering setting, personal camera, camera roll, camera roll visual

备注: Project page, code, and demo: [this https URL](https://thaoshibe.github.io/camroll)

点击查看摘要

Abstract:We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

117. 【2606.05261】NIV: Neural Axis Variations for Variable Font Generation

链接https://arxiv.org/abs/2606.05261

作者:Nadav Benedek,Ariel Shamir,Ohad Fried

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Variable fonts enable, optical size, semantic design axes, variable font, Variable

备注

点击查看摘要

Abstract:Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor-intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per-point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi-axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high-complexity CJK glyphs, and even out-of-distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at this https URL. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.

118. 【2606.05259】VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

链接https://arxiv.org/abs/2606.05259

作者:Lin Fu,Zheyuan Yang,Yang Wang,Tingyu Song,Arman Cohan,Yilun Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale training corpus, training corpus specifically, corpus specifically designed, strengthen knowledge, video reasoning

备注: ICML 2026 Spotlight

点击查看摘要

Abstract:We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

119. 【2606.05254】Flash-WAM: Modality-Aware Distillation for World Action Models

链接https://arxiv.org/abs/2606.05254

作者:Arman Akbari,Ci Zhang,Arash Akbari,Lin Zhao,Yixiao Chen,Weiwei Chen,Xuan Zhang,Geng Yuan,Yanzhi Wang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:jointly generate future, World-action models, generate future video, precludes real-time control, achieving strong performance

备注

点击查看摘要

Abstract:World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.

120. 【2606.05185】Drishti AI-Event Guardian: An Intelligent Real-Time Crowd Monitoring and Emergency Response System for Mass Gathering Events

链接https://arxiv.org/abs/2606.05185

作者:Ritabrata Roy Choudhury,Arkajyoti Karmakar,Rudra Pratap Mitra

类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:insufficient crowd monitoring, Mass gathering events, caused by insufficient, monitoring and inadequate, Mass gathering

备注: 22 pages

点击查看摘要

Abstract:Mass gathering events are associated with critical safety incidents caused by insufficient crowd monitoring and inadequate emergency response coordination. Traditional surveillance systems lack intelligent analytics, resulting in delayed threat identification, poor resource deployment, and weak support for vulnerable individuals during dense public assemblies. This paper presents Drishti AI-Event Guardian, an intelligent crowd management framework using deep learning for public safety enhancement. The architecture combines multimodal data from CCTV networks and UAV platforms, processed by models on Google Vertex AI infrastructure. Core methods include real-time crowd density estimation using YOLOv8, spatiotemporal anomaly detection, and predictive crowd-flow modeling through gradient-boosted regression. Drishti also integrates four modules: (i) facial recognition for missing person identification with crowd-wide notification; (ii) medical emergency reporting with automated dispatch; (iii) a conversational AI chatbot for reports and complaints; and (iv) an intelligent guard reallocation engine that dynamically reassigns personnel in response to crowd density changes. The system is evaluated on two scenarios: the Kumbh Mela gathering and the RCB Victory Parade event, achieving crowd density estimation MAE of 3.2 persons/m2, anomaly detection F1-score of 0.91, facial recognition precision of 0.93, and median alert latency of 111 ms. Predictive congestion modeling provides five-minute forecasts with MAPE of 8.3%, enabling preemptive intervention. The chatbot resolved 89% of incident filings without human operators, while guard reallocation reduced responder deployment latency by 34% versus manual reassignment. Results demonstrate a shift from passive surveillance toward active crowd intelligence and scalable foundation for events from local gatherings to mega festivals.

121. 【2606.05172】Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing

链接https://arxiv.org/abs/2606.05172

作者:Yixuan Ding,Wei Huang,Ruijie Quan,Xiaojuan Qi,Yi Yang

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:real user requests, natural language instructions, Diffusion-based image editing, contextual constraints embedded, Diffusion-based image

备注: 23 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark for REasoning-aware image Editing that evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercial image editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guided post-edit baseline as an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.

122. 【2605.30819】Function2Scene: 3D Indoor Scene Layout from Functional Specifications

链接https://arxiv.org/abs/2605.30819

作者:Ruiqi Wang,Qimin Chen,Daniel Ritchie,Angel X. Chang,Manolis Savva,Kai Wang,Hao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:methods generate rooms, synthesis methods generate, object-centric prompts, methods generate, scene synthesis methods

备注: project page: [this https URL](https://function2scene.github.io/)

点击查看摘要

Abstract:Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

123. 【2509.15061】Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

链接https://arxiv.org/abs/2509.15061

作者:Xingyao Lin,Xinghao Zhu,Tianyi Lu,Sicheng Xie,Hui Zhang,Xipeng Qiu,Zuxuan Wu,Yu-Gang Jiang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:passively follow instructions, mere executors, executors that passively, passively follow, embodied agents

备注: 9 pages, 4 figures, 7 tables

点击查看摘要

Abstract:The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

124. 【2606.05849】Inverse Design of Realizable Metasurface based Absorbers using Improved Conditioning and Diversity Enhanced Progressively Growing GANs

链接https://arxiv.org/abs/2606.05849

作者:Vineetha Joy,Mohammad Abdullah,Pramit Pal,Anshuman Kumar,Amit Sethi,Hema Singh

类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)

关键词:enable precise manipulation, Metasurfaces enable precise, beam steering, stealth technology, enable precise

备注

点击查看摘要

Abstract:Metasurfaces enable precise manipulation of electromagnetic waves for applications such as beam steering, sensing, and stealth technology. However, inverse design of metasurfaces with targeted EM responses remains challenging due to the computational expense of iterative full wave simulation driven optimization and the limited conditioning fidelity and diversity of existing generative approaches. To address these challenges, this paper presents a generative inverse design framework for controllable and physically consistent metasurface synthesis under continuous spectral constraints. The proposed approach employs a progressively growing Wasserstein generative adversarial network with gradient penalty integrated with feature wise linear modulation based conditioning for stable propagation of continuous spectral and fabrication constraints. EM consistency is embedded directly into the generative learning process through a surrogate assisted spectral alignment loss, enabling physics constrained generation during training. Further, a determinantal point process based diversity regularization strategy is incorporated to generate geometrically diverse yet spectrally consistent realizations for the same target response. The effectiveness of the proposed framework is demonstrated through the generation of practically realizable metasurface absorbers exhibiting diverse reflection characteristics in the frequency range of 2 to 18 GHz. EM simulations validate that the generated designs meet the target specifications with high accuracy. The final proposed framework achieved an average mean squared error of 0.0052, diversity score of 0.8730, band alignment accuracy of 0.8533, and a valid EM design generation percentage of 89.57, clearly demonstrating its capability to generate highly accurate, diverse, electromagnetically consistent and fabrication realizable metasurface configurations.

125. 【2606.05255】Oklch+: A Three-Parameter Extension of Oklab for Improved Color Difference Prediction

链接https://arxiv.org/abs/2606.05255

作者:Naoyuki Uchida

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:prediction accuracy falls, accuracy falls short, cylindrical representation Oklch, difference prediction accuracy, perceptually motivated color

备注: 3 figures, 8 tables. Submitted to Color Research Application

点击查看摘要

Abstract:Oklab and its cylindrical representation Oklch are widely adopted in interpolation and design workflows as perceptually motivated color spaces, but their color difference prediction accuracy falls short of CIEDE2000. We propose Oklch+, a three-parameter extension of Oklab comprising a power transformation on the L-axis and a Naka-Rushton compression on the C-axis, with Euclidean distance computed in the resulting transformed Oklab coordinates. The Naka-Rushton function is bounded in [0,1], reflecting the saturating nature of chroma sensitivity at high colorimetric values. Evaluated on COMBVD -- 3,813 suprathreshold color difference pairs spanning six independent experimental datasets -- Oklch+ achieves STRESS = 29.09, closely matching CIEDE2000 (29.13; difference = 0.04), using only three parameters optimized against color difference data compared to approximately 17 for CIEDE2000. Cross-validation on a held-out BFD-P D65 subset (2,028 pairs) confirms generalization (STRESS = 26.14), with Oklch+ substantially outperforming Oklab (51.45) and achieving STRESS comparable to CIEDE2000 (24.12) on the held-out set. Improvement over Oklab (47.35) is confirmed across all six COMBVD sub-datasets. Because Oklch+ defines a coordinate system in which Euclidean distance approximates perceptual distance, linear interpolation in the transformed space offers substantially improved perceptual uniformity relative to Oklab. Current evaluation is limited to the sRGB-centered COMBVD dataset; validation in high-chroma regions with empirical observer-rated discrimination data remains future work.