本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新607篇论文,其中:

  • 自然语言处理83
  • 信息检索15
  • 计算机视觉100

自然语言处理

1. 【2606.19341】Native Active Perception as Reasoning for Omni-Modal Understanding

链接https://arxiv.org/abs/2606.19341

作者:Zhenghao Xing,Ruiyang Xu,Yuxuan Wang,Jinzheng He,Ziyang Ma,Qize Yang,Yunfei Chu,Jin Xu,Junyang Lin,Chi-Wing Fu,Pheng-Ann Heng

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Sound (cs.SD)

关键词:processing frames uniformly, causing computational cost, understanding typically rely, processing frames, query difficulty

备注: Accepted at ICML 2026. Code and models: [this https URL](https://github.com/harryhsing/omniagent)

点击查看摘要

Abstract:Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

2. 【2606.19336】Learning User Simulators with Turing Rewards

链接https://arxiv.org/abs/2606.19336

作者:Yingshan Susan Wang,Cedegao E. Zhang,Linlu Qiu,Zexue He,Pengyuan Li,Alex Pentland,Roger P. Levy,Yoon Kim

类目:Computation and Language (cs.CL)

关键词:agent assistants, personalization systems, social sciences, interactive settings, settings could advance

备注

点击查看摘要

Abstract:Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

3. 【2606.19334】Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

链接https://arxiv.org/abs/2606.19334

作者:Denis Peskoff,Joe Barrow,Christopher Vu,Diag Davenport

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:authoritative legal text, increasingly depends, county ordinance codes, Local Ordinance Corpus, Progress

备注: 14 pages, 6 figures

点击查看摘要

Abstract:Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: this https URL

4. 【2606.19327】Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

链接https://arxiv.org/abs/2606.19327

作者:Siyi Gu,Jialin Chen,Sophia Zhou,Arman Cohan,Rex Ying

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:commonly driven, driven by supervised, reinforcement learning, supervised distillation, reasoning language models

备注

点击查看摘要

Abstract:Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

5. 【2606.19308】Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

链接https://arxiv.org/abs/2606.19308

作者:Leyang Shen,Yang Zhang,Xiaoyan Zhao,Chun Kai Ling,Tat-Seng Chua

类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Large language model, demonstrated great potential, Large language, based multi-agent systems, language model

备注: 18 pages, 8 figures

点击查看摘要

Abstract:Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

6. 【2606.19266】rade-offs in Medical LLM Adaptation: An Empirical Study in French QA

链接https://arxiv.org/abs/2606.19266

作者:Ikram Belmadani,Oumaima El Khettari,Carlos Ramisch,Frederic Bechet,Richard Dufour,Benoit Favre

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, strategies remains unclear, large language, remains unclear, development of large

备注

点击查看摘要

Abstract:The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

7. 【2606.19264】Structured Inference with Large Language Gibbs

链接https://arxiv.org/abs/2606.19264

作者:Sanghyeok Choi,Henry Gouk,Esmeralda S. Whitammer

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, Large Language Gibbs, probabilistically coherent manner, coherent manner poses, difficult inference problem

备注: Code: [this https URL](https://github.com/hyeok9855/large-language-gibbs)

点击查看摘要

Abstract:The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

8. 【2606.19257】DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

链接https://arxiv.org/abs/2606.19257

作者:Zirui Wu,Lin Zheng,Jiacheng Ye,Shansan Gong,Xueliang Zhao,Yansong Feng,Wei Bi,Lingpeng Kong

类目:Computation and Language (cs.CL)

关键词:parallel block-wise denoising, block sizes, reasoning remains unresolved, models accelerate decoding, block-wise denoising

备注

点击查看摘要

Abstract:Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at this https URL.

9. 【2606.19236】STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

链接https://arxiv.org/abs/2606.19236

作者:Haipeng Luo,Qingfeng Sun,Songli Wu,Can Xu,Wenfeng Deng,Han Hu,Yansong Tang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Verifiable Rewards algorithms, Reinforcement Learning, Learning with Verifiable, Verifiable Rewards, dominant post-training paradigm

备注: LLM, Reinforcement Learning

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training this http URL is available at this https URL.

10. 【2606.19218】RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

链接https://arxiv.org/abs/2606.19218

作者:Pushwitha Krishnappa,Amit Das,Vinija Jain,Aman Chadha,Tathagata Mukherjee

类目:Computation and Language (cs.CL)

关键词:evaluating LLM-generated text, genuine content alignment, discriminative power, LLM-generated text, surface coincidence

备注

点击查看摘要

Abstract:Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at this https URL

11. 【2606.19183】Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

链接https://arxiv.org/abs/2606.19183

作者:Soheyl Bateni,Maryam Abdolali

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, interpreting free-text documentation, information order, Language-assisted Machine-learning Pipeline

备注

点击查看摘要

Abstract:Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

12. 【2606.19170】Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

链接https://arxiv.org/abs/2606.19170

作者:Shiho Matta,Yin Jou Huang,Fei Cheng,Takashi Kodama,Hirokazu Kiyomaru,Yugo Murawaki

类目:Computation and Language (cs.CL)

关键词:large language model, language model designed, designed for controlled, large language, introduce Dango

备注: 8 pages main text, 20 pages total including references and appendices

点击查看摘要

Abstract:We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

13. 【2606.19144】Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

链接https://arxiv.org/abs/2606.19144

作者:Jingyi Zhou,Senlin Luo,Haofan Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:http URL, social cognitive energy, made significant progress, social cognitive, Current conversational

备注

点击查看摘要

Abstract:Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI this http URL address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy this http URL construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p 0.001), and interaction trajectories exhibit progressive energy reduction over this http URL findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

14. 【2606.19139】Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

链接https://arxiv.org/abs/2606.19139

作者:Ramza Basharat,Muhammad Usman Ali

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Automatic Handwritten Text, Urdu Handwritten Text, Handwritten Text Recognition, Handwritten Text, Automatic Handwritten

备注

点击查看摘要

Abstract:Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

15. 【2606.19121】Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

链接https://arxiv.org/abs/2606.19121

作者:Hui Zhang,Shuren Song

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:expanding context windows, accumulating defensive rules, addressing conceptual drift, prevailing engineering intuition, designing symbolic identifier

备注: 22 pages, 2 tables, 1 figure. Action research. Bilingual submission (Chinese companion version included as supplementary). Submitted to ICSE 2027 IOR track

点击查看摘要

Abstract:The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

16. 【2606.19111】Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

链接https://arxiv.org/abs/2606.19111

作者:Haewoon Kwak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Team science holds, multi-agent LLM teams, team science predicts, Team science, autonomous teams

备注: 33 pages

点击查看摘要

Abstract:Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

17. 【2606.19051】Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

链接https://arxiv.org/abs/2606.19051

作者:Qiuyu Fang,Jiayi Hao,Chengzhi Zhang

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:Research methods, essential carriers, contribution in academic, research intelligence analysis, knowledge contribution

备注: ASIST 2026

点击查看摘要

Abstract:Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

18. 【2606.19005】Sumi: Open Uniform Diffusion Language Model from Scratch

链接https://arxiv.org/abs/2606.19005

作者:Mengyu Ye,Keito Kudo,Wataru Ikeda,Ryosuke Matsuda,Keisuke Sakaguchi,Jun Suzuki

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Diffusion, uniform diffusion, promising alternative, uniform diffusion language, models

备注

点击查看摘要

Abstract:Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

19. 【2606.19002】Enhancing Multilingual Reasoning via Steerable Model Merging

链接https://arxiv.org/abs/2606.19002

作者:Zhuoran Li,Rui Xu,Jian Yang,Junnan Liu,Zhijun Chen,Qianren Mao,Hongcheng Guo,Jiaheng Liu,Likang Xiao,Ming Li,Xiaojie Wang

类目:Computation and Language (cs.CL)

关键词:effective technique, technique for composing, composing the capabilities, multilingual reasoning, Model

备注: 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

点击查看摘要

Abstract:Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

20. 【2606.18989】G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

链接https://arxiv.org/abs/2606.18989

作者:Fengying Ye,Yanming Sun,Runzhe Zhan,Zheqi Zhang,Lidia S. Chao,Derek F. Wong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:weak surface-form grounding, making literal mappings, literal mappings unreliable, surface-form grounding, mappings unreliable

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

21. 【2606.18986】Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

链接https://arxiv.org/abs/2606.18986

作者:Yafeng Wu,Huu Hiep Nguyen,Thin Nguyen,Hung Le

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural-language question answering, question answering, time-series question answering, Byte Pair Encoding, Recent advances

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

22. 【2606.18954】GraphPO: Graph-based Policy Optimization for Reasoning Models

链接https://arxiv.org/abs/2606.18954

作者:Yuliang Zhan,Xinyu Tang,Jian Li,Dandan Zheng,Weilong Chai,Jingdong Chen,Jun Zhou,Ge Wu,Wenyue Tang,Hao Sun

类目:Computation and Language (cs.CL)

关键词:Reinforcement Learning, Learning with Verifiable, Verifiable Rewards, large reasoning models, RLVR typically samples

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

23. 【2606.18947】Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

链接https://arxiv.org/abs/2606.18947

作者:Emmanuel Aboah Boateng,Kyle MacDonald,Amardeep Kumar,Siddharth Kodwani,Sudeep Das

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:LLM agents increasingly, Production LLM agents, bundles retrieval policy, agents increasingly depend, LLM agents

备注: 15 pages, Figure 8

点击查看摘要

Abstract:Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

24. 【2606.18946】SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

链接https://arxiv.org/abs/2606.18946

作者:Jingkun Luo,Yifan Sun,Da-Tian Peng,Guanxiong Pei

类目:Computation and Language (cs.CL)

关键词:AI-generated text detection, existing methods classify, discarding inter-sentence dependencies, existing benchmarks omit, Sentence-level AI-generated text

备注: 16 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: this https URL

25. 【2606.18941】Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

链接https://arxiv.org/abs/2606.18941

作者:Pierre Dantas,Lucas Cordeiro,Waldir Junior

类目:Programming Languages (cs.PL); Computation and Language (cs.CL)

关键词:61131-3 Ladder Diagram, PLCopen XML defines, Ladder Diagram programs, IEC 61131-3 Ladder, Ladder Diagram

备注: 18 pages

点击查看摘要

Abstract:PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using rung elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:https://doi.org/10.5281/zenodo.20699856).

26. 【2606.18922】As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

链接https://arxiv.org/abs/2606.18922

作者:Jasmine Owers,Edwin Simpson,Martha Lewis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:current language models, language models, Figurative language, written and spoken, language

备注: 16 pages, 16 figures; for associated code and data see [this https URL](https://github.com/jrdowers/Negation-and-Fig-Lang;) To be published in Transactions of the Association for Computational Linguistics

点击查看摘要

Abstract:Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

27. 【2606.18910】REVES: REvision and VErification--Augmented Training for Test-Time Scaling

链接https://arxiv.org/abs/2606.18910

作者:Yuanxin Liu,Ruida Zhou,Xinyan Zhao,Amr Sharaf,Hongzhou Lin,Arijit Biswas,Mohammad Ghavamzadeh,Zhaoran Wang,Mingyi Hong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:enhancing Large Language, Large Language Model, Large Language, Test-time scaling, enhancing Large

备注

点击查看摘要

Abstract:Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at this https URL.

28. 【2606.18902】SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

链接https://arxiv.org/abs/2606.18902

作者:Ziyi Zhu,Luka Smyth,Saki Shinoda,Jinghong Chen

类目:Computation and Language (cs.CL)

关键词:Context engineering, parameter updates, engineering has emerged, primary lever, lever for improving

备注

点击查看摘要

Abstract:Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

29. 【2606.18893】Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

链接https://arxiv.org/abs/2606.18893

作者:Zhuangzhuang Pan,Ning Dong,Yingna Su,Yan Xia

类目:Computation and Language (cs.CL)

关键词:Multimodal emotion-cause pair, Multimodal emotion-cause, emotion-cause pair extraction, requires reliable pair, reliable pair confidence

备注: 11 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

30. 【2606.18889】Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

链接https://arxiv.org/abs/2606.18889

作者:Adrian Cosma,Nicoleta-Nina Basoc,Andrei Niculae,Cosmin Dumitrache,Emilian Radoi

类目:Computation and Language (cs.CL)

关键词:Text-based telemedicine increasingly, telemedicine increasingly relies, primarily reflects perceived, Text-based telemedicine, reflects perceived communication

备注: 4 Tables, 8 Figures

点击查看摘要

Abstract:Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor's control over medical reasoning and final wording.

31. 【2606.18875】Efficient Financial Language Understanding via Distillation with Synthetic Data

链接https://arxiv.org/abs/2606.18875

作者:Wen-Fong(Xavier)Huang,Edwin Simpson

类目:Computation and Language (cs.CL)

关键词:expert annotation cost, Large instruction-following models, costly to deploy, annotation cost, powerful but costly

备注

点击查看摘要

Abstract:Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

32. 【2606.18856】Approximate Structured Diffusion for Sequence Labelling

链接https://arxiv.org/abs/2606.18856

作者:Nicolas Floquet,Joseph Le Roux,Nadi Tomeh

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Natural Language Processing, Conditional Random Field, task of Natural, Language Processing, Sequence labelling

备注

点击查看摘要

Abstract:Sequence labelling, a core task of Natural Language Processing (NLP), consists in assigning each token of an input sentence a label. From a Machine Learning point of view, sequence labelling is often cast as a Linear-Chain Conditional Random Field (CRF) parametrised by a neural network. While this approach gives good empirical results, CRFs assume a finite decision span (eg label bigrams) which can limit their expressivity and hurt performance when long-range dependencies are required. We show we can leverage diffusion to train a CRF conditioned on an entire label sequence, with the caveat that the condition is on a noisy version of labels. We show experimentally that this method, in conjunction with approximate CRF inference, improves label accuracy with a 16.5% error reduction for POS-tagging.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2606.18856 [cs.CL]

(or
arXiv:2606.18856v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.18856

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Joseph Le Roux [view email] [v1]
Wed, 17 Jun 2026 09:36:34 UTC (66 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Approximate Structured Diffusion for Sequence Labelling, by Nicolas Floquet and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CL

prev

|
next

new
|
recent
| 2026-06

Change to browse by:

cs
cs.LG

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

33. 【2606.18852】Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

链接https://arxiv.org/abs/2606.18852

作者:Wicaksono Leksono Muhamad,Yunita Sari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Classifying implicit hate, implicit hate speech, hate speech remains, Classifying implicit, remains a challenge

备注

点击查看摘要

Abstract:Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

34. 【2606.18850】ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

链接https://arxiv.org/abs/2606.18850

作者:Bohou Zhang,Xiaoyu Tao,Mingyue Cheng,Huijie Liu,Qi Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Abstractive summarization plays, enabling efficient understanding, Abstractive summarization, plays a crucial, crucial role

备注

点击查看摘要

Abstract:Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at this https URL.

35. 【2606.18831】Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

链接https://arxiv.org/abs/2606.18831

作者:Xiaoyue Xu,Sikui Zhang,Xiaorong Wang,Xu Han,Chaojun Xiao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:lengthy trajectories, large language models, essential capability, capability for large, large language

备注: 15 pages, 6 figures, 12 tables

点击查看摘要

Abstract:Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

36. 【2606.18829】GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

链接https://arxiv.org/abs/2606.18829

作者:Zhe Ren,Yibo Yang,Yimeng Chen,Zijun Zhao,Benshuo Fu,Zhihao Shu,Bingjie Zhang,Yangyang Xu,Dandan Guo,Shuicheng Yan

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:assume single-user settings, LLM agents largely, largely assume single-user, agents largely assume, leaving shared assistants

备注: 24 pages, 8 figures. Code and dataset are available at [this https URL](https://github.com/rzhub/GateMem) and [this https URL](https://huggingface.co/datasets/Ray368/GateMem)

点击查看摘要

Abstract:Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

37. 【2606.18797】Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

链接https://arxiv.org/abs/2606.18797

作者:Qingyu Lu,Ruochen Li,Liang Ding,Yufei Xia,Youxiang Zhu,Dacheng Tao

类目:Computation and Language (cs.CL)

关键词:affect patient care, mischaracterized radiographic observations, directly affect patient, radiology reports requires, reports requires strict

备注: Under Review

点击查看摘要

Abstract:Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

38. 【2606.18788】HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

链接https://arxiv.org/abs/2606.18788

作者:Jaward Sesay,Yue Yu,Börje F. Karlsson

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Teaching machines, requires synthesizing stroke, single person handwriting, pressure and script, vary in shape

备注

点击查看摘要

Abstract:Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

39. 【2606.18782】RedactionBench

链接https://arxiv.org/abs/2606.18782

作者:Sean Brynjólfsson,Shashvat Jayakrishnan,Esha Sali,Diptanshu Purwar,Madhav Aggarwal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, personally identifiable information, increasingly applied, applied to sensitive

备注

点击查看摘要

Abstract:Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

40. 【2606.18781】Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

链接https://arxiv.org/abs/2606.18781

作者:Shanshan Lyu,Yiwei Wang,Yujun Cai,Jiafeng Guo,Shenghua Liu

类目:Computation and Language (cs.CL)

关键词:Dense retrieval ranks, Evidence Dilution Index, Dense retrieval, ranks one query, Dense

备注: Code is available at [this https URL](https://github.com/PunchlineAAAA/DICE)

点击查看摘要

Abstract:Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey 4k rises from 30.0 to 90.0 and Needle 4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

41. 【2606.18780】SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

链接https://arxiv.org/abs/2606.18780

作者:Quanjiang Guo,Chong Mu,Jiazhou Pan,Ming Jia,Ling Tian,Hui Gao,Zhao Kang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Named Entity Recognition, Multimodal Information Extraction, Multimodal Named Entity, Relation Extraction, Information Extraction

备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

42. 【2606.18767】Output Vector Editing for Memorization Mitigation in Large Language Models

链接https://arxiv.org/abs/2606.18767

作者:Ahmad Dawar Hakimi,Kaiwei Lei,Isabelle Augenstein,Hinrich Schütze

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, language models memorize, creating privacy, training data

备注

点击查看摘要

Abstract:Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

43. 【2606.18728】LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

链接https://arxiv.org/abs/2606.18728

作者:Songhan Zuo,Shengbin Yue,Tao Chiang,Guanying Li,Yun Song,Xuanjing Huang,Zhongyu Wei

类目:Computation and Language (cs.CL)

关键词:Chinese civil litigation, Chinese civil, paired Chinese civil, Chinese civil judgments, lawyer drafts

备注

点击查看摘要

Abstract:Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

44. 【2606.18717】Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

链接https://arxiv.org/abs/2606.18717

作者:Tolga Şakar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:fragmenting semantically loaded, drive modern language, semantically loaded suffixes, language models split, modern language models

备注

点击查看摘要

Abstract:Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: this https URL model: this https URL interactive demo: this https URL.

45. 【2606.18709】LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

链接https://arxiv.org/abs/2606.18709

作者:Han Chen,Ming Li,Chenguang Wang,Yijun Liang,Dawei Zhou,Hong jiao,Tianyi Zhou

类目:Computation and Language (cs.CL)

关键词:item meaningfully distinguishes, meaningfully distinguishes students, Classical Test Theory, fundamental psychometric property, higher proficiency

备注

点击查看摘要

Abstract:Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

46. 【2606.18699】W-LegalBench: Measuring Taiwanese Legal Understanding

链接https://arxiv.org/abs/2606.18699

作者:Fei-Yueh Chen,Chun Huang Lin,Chan Wei Hsu,Kuan Hsuan Yeh,Zih-Ching Chen,Kuan-Ming Chen,Patrick Chung-Chia Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, shown impressive capabilities, Large language, jurisdiction-specific legal reasoning, reasoning remains underexplored

备注: 10 pages, 2 figures, To appear in ICAIL 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

47. 【2606.18694】Attention as Frustrated Synchronization

链接https://arxiv.org/abs/2606.18694

作者:Joshua Nunley

类目:Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO)

关键词:synchronizes perfectly computes, attention architecture built, Frustrated Synchronization Network, departures from agreement, oscillators that synchronizes

备注: 25 pages, 4 figures. Preliminary report at the 1-10M parameter scale

点击查看摘要

Abstract:A network of oscillators that synchronizes perfectly computes nothing further, so an attention architecture built from synchronization must locate its computation in structured departures from agreement. We introduce the Frustrated Synchronization Network (FSN), whose token states are phases on a torus and whose entire value pathway is one learned complex coupling kernel over harmonics and a one-step delay. Each component of the kernel is a frustration in the sense of the synchronization literature. The complex phases are static Kuramoto-Sakaguchi frustration angles, the signed harmonics are repulsive Daido components, and the delay term, which couples each token to the successors of the tokens it attends to, is algebraically identical to Kuramoto-Sakaguchi coupling whose frustration angle is the data's own transition, so next-token prediction is implemented as synchronization frustrated by the data. At matched one-million-parameter and training budgets on character-level text and code, the FSN's validation loss is below a tuned RoPE-SwiGLU transformer's at every epoch measured, and the comparison survives training the baseline to convergence: every thirty-epoch enwik8 seed finishes below the transformer's converged fifty-epoch loss of 1.611, and the FSN's completed fifty-epoch runs converge to 1.5953 +/- 0.0014. A variant with every feed-forward block replaced by mean-field coupling to learned collective modes, leaving no multilayer perceptron in the stack, tracks the transformer. On natural text the unfrustrated base layer falls behind the converged transformer at every copy depth, worst on long-range copy events; the kernel reverses the deficit at every depth of four and beyond. Headline comparisons are at the one-million-parameter scale; a scale ladder is complete through four million parameters with the advantage persisting, and remaining arms are marked as in progress.

48. 【2606.18686】ForecastBench-Sim: A Simulated-World Forecasting Benchmark

链接https://arxiv.org/abs/2606.18686

作者:Jaeho Lee,Nick Merrill,Ezra Karger

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:outcomes resolve slowly, resolve slowly, tail events, general-purpose AI systems, systems usually inherit

备注: 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

点击查看摘要

Abstract:Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

49. 【2606.18668】EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

链接https://arxiv.org/abs/2606.18668

作者:Shuang Xie,Yunan Lu,Han Li,Lingyun Wang

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:centralized multi-agent systems, delegates user requests, coordinator delegates user, centralized multi-agent, multi-agent systems

备注

点击查看摘要

Abstract:In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub-agents' ability to calibrate their responses to capability constraints. In particular, sub-agents built on smaller fine-tuned models often struggle with such calibration, leading them to over-answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub-Agent Modeling), a production-oriented framework that reframes sub-agent abstention as an inter-agent communication protocol: a sub-agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human-agent interaction data using an ensemble of calibrated LLM-as-a-Judge models, producing structured abstention labels and rationales under a taxonomy of sub-agent failure modes. These data are used to fine-tune sub-agents to detect failure conditions and return rationales for coordinator-level clarification, rerouting, or fallback. We evaluate EARS in a large-scale production e-commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub-agent-side explanatory abstention improves MAS reliability.

50. 【2606.18663】RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

链接https://arxiv.org/abs/2606.18663

作者:Kaiyan Zhao,Zhongtao Miao,Akiko Aizawa,Yoshimasa Tsuruoka

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model pretraining, critical for Large, Language Model

备注: Work in progress

点击查看摘要

Abstract:Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

51. 【2606.18656】he Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

链接https://arxiv.org/abs/2606.18656

作者:Naihao Deng,Yiming Feng,Chimaobi Okite,Kaijian Zou,Lu Wang,Rada Mihalcea,Yulong Chen

类目:Computation and Language (cs.CL)

关键词:paper studies stereotypes, stereotypes and biases, studies stereotypes, potentially disturbing, illustration purposes

备注

点击查看摘要

Abstract:Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

52. 【2606.18636】PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

链接https://arxiv.org/abs/2606.18636

作者:Yingyu Shan,Zeming Liu,Silin Li,Boao Qian,Jiashu Yao,Yuhang Guo,Haifeng Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, language interaction capabilities, natural language interaction, Language Models, Large Language

备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have empowered home assistants with natural language interaction capabilities. However, current assistants overlook the progressive omission that occurs in human dialogue as shared context accumulates, leading to more elliptical expressions for efficient communication. Thus, current assistants still struggle to interpret such elliptical expressions accurately, which limits their effectiveness in real-world applications. In practical smart home scenarios, assistants face two major challenges caused by elliptical commands: (1) referential ambiguity caused by different environmental expectations among multiple users; and (2) intention ambiguity resulting from user preferences that evolve over time or change with the environment. To address these challenges, we introduce PEC-Home, the first simulated home dataset specifically designed for interpreting progressively elliptical commands in smart homes. Extensive experiments on various LLMs, including GPT-4o, show that existing home assistants struggle to execute user-intended operations based solely on elliptical commands. Even when equipped with tools for storing and retrieving user dialogue history, execution accuracy remains below that achieved with complete commands.}.

53. 【2606.18624】PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

链接https://arxiv.org/abs/2606.18624

作者:Jihyung Park,Minchao Huang,Leqi Liu,Elias Stengel-Eskin

类目:Computation and Language (cs.CL)

关键词:Natural language understanding, Natural language, requiring pragmatic reasoning, explicitly stated, understanding often depends

备注: First two authors contributed equally. Code and models: [this https URL](https://github.com/jihyung803/PragReST)

点击查看摘要

Abstract:Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

54. 【2606.18620】BCL: Bayesian In-Context Learning Framework for Information Extraction

链接https://arxiv.org/abs/2606.18620

作者:Haoliang Liu,Chengkun Cai,Xu Zhao,Han Zhu,Shizhou Huang,Xinglin Zhang,Tao Chen,Jenq-Neng Hwang,Zhang Huaping,Lei Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, increasingly adopt in-context, tasks increasingly adopt, Existing information extraction, information extraction

备注: ACL 2026 Findings

点击查看摘要

Abstract:Existing information extraction (IE) tasks increasingly adopt in-context learning (ICL) with large language models. However, current approaches either show inconsistent performance across model scales or lack systematic optimization and generalizability. Building on this, we propose BCL (Bayesian In-Context Learning Framework for Information Extraction), the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations across IE tasks. Through four steps initialization, observation, weight update, and resampling, BCL generalizes to both sequence labeling and relation classification paradigms. Extensive experiments demonstrate substantial and consistent improvements over existing approaches.

55. 【2606.18613】Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

链接https://arxiv.org/abs/2606.18613

作者:Tianming Du,Peijie Yu,Sihan Shang,Danli Shi,My Linh Nguyen,Shengbo Gao,Guangyuan Li,Yinghong Yu,Yan Jiang,Qianlong Zhao,Behzad Bozorgtabar,Shaoxiong Ji,Jiazhen Pan,Daniel Rueckert,Jiancheng Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:plausible near-term role, EHR system interaction, plausible near-term, near-term role, role of medical

备注: 34 pages with 8 figures

点击查看摘要

Abstract:The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

56. 【2606.18606】Steerable Cultural Preference Optimization of Reward Models

链接https://arxiv.org/abs/2606.18606

作者:Minsik Oh,Advit Deepak,Sophie Wu,Douwe Kiela,Ekaterina Shutova

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language model, technology to serve, essential for large, large language, LLM alignment

备注: Accepted to Pluralistic Alignment @ ICML 2026

点击查看摘要

Abstract:It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at this https URL

57. 【2606.18597】Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

链接https://arxiv.org/abs/2606.18597

作者:Fan Xu,Yangjie Dan,Keyu Yan,Yong Ma,Mingwen Wang

类目:Computation and Language (cs.CL)

关键词:Chinese dialects discrimination, Chinese dialects, challenging natural language, natural language processing, language processing task

备注: Published in ACM TALLIP

点击查看摘要

Abstract:Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

58. 【2606.18587】Dual Dimensionality for Local and Global Attention

链接https://arxiv.org/abs/2606.18587

作者:Zhiyuan Wang,Xuan Luo,Sirui Zeng,Xifeng Yan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Decoder-only Transformers compute, Decoder-only Transformers, Transformers compute attention, Transformers compute, Decoder-only

备注

点击查看摘要

Abstract:Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

59. 【2606.18584】Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

链接https://arxiv.org/abs/2606.18584

作者:Fan Xu,Jian Luo,MingWen Wang,GuoDong Zhou

类目:Computation and Language (cs.CL)

关键词:language processing task, natural language processing, challenging natural language, processing task, Chinese dialect

备注: Published in ACM TALLIP

点击查看摘要

Abstract:Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of speech-driven features towards language discrimination among Chinese dialects. First, we systematically explore the appropriateness of speech-driven MFCC features towards CNN-based language discrimination. Then, we design an end-to-end speech recognition model based on HMM-DNN to predict Chinese dialect words. We adopt attention to extract the discriminative words related to different Chinese dialects. Finally, through a CNN, we combine the word-level embedding and the MFCC-based features. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech-driven approach to fine-grained Chinese dialect discrimination compared to the state-of-the-art methods.

60. 【2606.18571】Fair Cognitive Impairment Detection Through Unlearning

链接https://arxiv.org/abs/2606.18571

作者:William Nguyen,Jiali Cheng,Hadi Amiri

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Mild Cognitive Impairment, Mild Cognitive, Cognitive Impairment, medical condition characterized, decline in memory

备注: Interspeech 2026

点击查看摘要

Abstract:Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

61. 【2606.18543】CEO-Bench: Can Agents Play the Long Game?

链接https://arxiv.org/abs/2606.18543

作者:Haozhe Chen,Karthik Narasimhan,Zhuang Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Language model agents, executors at isolated, proficient executors, software engineering, Language model

备注

点击查看摘要

Abstract:Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

62. 【2606.18530】Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

链接https://arxiv.org/abs/2606.18530

作者:Aaditya Pai

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:evading standard detectors, embed malicious instructions, syntactic injection markers, attacks embed malicious, attack success

备注: 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

点击查看摘要

Abstract:Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

63. 【2606.18508】MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

链接https://arxiv.org/abs/2606.18508

作者:Amirhossein Abaskohi,Raymond Li,Gaetano Cimino,Peter West,Giuseppe Carenini,Issam H. Laradji

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:systems depend critically, Retrieval-augmented generation, systems depend, chunked and searched, depend critically

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on this https URL.

64. 【2606.18502】owards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

链接https://arxiv.org/abs/2606.18502

作者:Paresh Dashore,Shreyas Kulkarni,Uttam Gurram,Nadia Bathaee,Kartik Balasubramaniam,Genta Indra Winata,Sambit Sahu,Shi-Xiong Zhang

类目:Computation and Language (cs.CL)

关键词:Large language model, Large language, enabling broad enterprise, broad enterprise applications, based multi-agent systems

备注: Preprint

点击查看摘要

Abstract:Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-specific customization requirements and high latency and inference costs in agentic workflows. We propose a unified framework for customization and efficient deployment of multi-agent systems in real-world settings. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine-tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost-efficient serving with minimal quality loss. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4.48x speedup in throughput while maintaining performance and improving robustness on long-tail scenarios.

65. 【2606.18487】SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

链接https://arxiv.org/abs/2606.18487

作者:Siddharth Aphale,Kelly Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:rollout distribution, GRPO, standard heuristic, heuristic of selecting, compresses the rollout

备注: 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

点击查看摘要

Abstract:The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($\rho{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

66. 【2606.18473】PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

链接https://arxiv.org/abs/2606.18473

作者:Bo Su,Ankit Shah,Thai Le

类目:Computation and Language (cs.CL)

关键词:large language models, Machine unlearning, aims to remove, large language, preserving the rest

备注: 12 pages, 6 figures

点击查看摘要

Abstract:Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

67. 【2606.18471】Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

链接https://arxiv.org/abs/2606.18471

作者:Hongbo Du,Zixin Lu,Jiaming Qu

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, language models, summarization and revision, clinical text tasks

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia'' communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.

68. 【2606.18466】Montreal Forced Aligner and the state of speech-to-text alignment in 2026

链接https://arxiv.org/abs/2606.18466

作者:Michael McAuliffe,Kaylynn Gunter,Michael Wagner,Morgan Sonderegger

类目:Computation and Language (cs.CL)

关键词:Montreal Forced Aligner, Montreal Forced, neural forced aligners, research and industry, widely used tool

备注

点击查看摘要

Abstract:The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.

69. 【2606.18453】LLM Parameters for Math Across Languages: Shared or Separate?

链接https://arxiv.org/abs/2606.18453

作者:Behzad Shomali,Luisa Victor,Tim Selbach,Ali Hamza Bashir,David Berghaus,Joachim Koehler,Mehdi Ali,Markus Frey

类目:Computation and Language (cs.CL)

关键词:Large language models, mathematical reasoning performance, substantial cross-lingual variation, mathematical reasoning, Large language

备注: 5 pages. Accepted at ACL Student Research Workshop (SRW) 2026. Code: [this https URL](https://github.com/luisavictor/math-across-languages) Translated Datasets: [this https URL](https://huggingface.co/math-across-languages) Webpage: https://math-across-languages.github.io

点击查看摘要

Abstract:Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

70. 【2606.18448】VISUALSKILL: Multimodal Skills for Computer-Use Agents

链接https://arxiv.org/abs/2606.18448

作者:Ziyan Jiang,Li An,Yujian Liu,Jiabao Ji,Qiucheng Wu,Jacob Andreas,Yang Zhang,Shiyu Chang

类目:Computation and Language (cs.CL)

关键词:approach human-level performance, Computer-use agents, approach human-level, unseen software, human-level performance

备注

点击查看摘要

Abstract:Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at this https URL.

71. 【2606.18406】CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

链接https://arxiv.org/abs/2606.18406

作者:Jiaqi Chen,Yongqin Zeng,Shaoshen Chen,Yijian Zhang,Hai-Tao Zheng,Chunxia Ma,XiuTeng Zhou

类目:Computation and Language (cs.CL)

关键词:require continuous long-term, maintain coherent interactions, Personalized dialogue agents, Personalized dialogue, continuous long-term memory

备注: 15 pages, 5 figures

点击查看摘要

Abstract:Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

72. 【2606.18394】JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

链接https://arxiv.org/abs/2606.18394

作者:Lanxiang Hu,Zhaoxiang Feng,Yulun Wu,Haoran Yuan,Yujie Zhao,Yu-Yang Qian,Bojun Wang,Daxin Jiang,Yibo Zhu,Tajana Rosing,Hao Zhang

类目:Computation and Language (cs.CL)

关键词:autoregressive Large Language, Large Language Models, Large Language, accelerates autoregressive Large, overhead stays low

备注

点击查看摘要

Abstract:Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at this https URL.

73. 【2606.18389】Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

链接https://arxiv.org/abs/2606.18389

作者:Jan Cegin,Daniil Gurgurov,Yusser Al Ghussin,Simon Ostermann

类目:Computation and Language (cs.CL)

关键词:Large language models, synthetic data generation, Large language, effective tool, including for low-resource

备注: 25 pages

点击查看摘要

Abstract:Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

74. 【2606.18388】LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

链接https://arxiv.org/abs/2606.18388

作者:Haoyang Fang,Wei Zhu,Boran Han,Alex Zhang,Zhenyu Pan,Shuo Yang,Shuai Zhang,Jiading Gai,Peng Tang,Cuixiong Hu,Xuan Zhu,Huzefa Rangwala,George Karypis,Bernie Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:recurring empirical pattern, capacity parameters accumulate, parameters accumulate monotonically, parameters predominantly oscillate, empirical pattern

备注

点击查看摘要

Abstract:RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

75. 【2606.18383】From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

链接https://arxiv.org/abs/2606.18383

作者:Dibyanayan Bandyopadhyay,Asif Ekbal

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:native hidden activation, pretrained SAE reconstruction, post-hoc generalization framework, extract interpretable features, central question remains

备注

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

76. 【2606.18381】SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

链接https://arxiv.org/abs/2606.18381

作者:Amirhossein Abaskohi,Issam H. Laradji,Peter West,Giuseppe Carenini

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Retrieval-augmented generation, existing methods address, single-level context expansion, systems must balance, contextual coherence

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on this https URL.

77. 【2606.18372】Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

链接https://arxiv.org/abs/2606.18372

作者:Haocheng Zhang,Zhuqian Zhou,Kirk Vanacore,Bakhtawar Ahtisham,René F. Kizilcec

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:capture authentic learning, capture personally identifiable, personally identifiable information, capture authentic, capture personally

备注

点击查看摘要

Abstract:Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

78. 【2606.18284】Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

链接https://arxiv.org/abs/2606.18284

作者:Lorenz Wolf,Connor Watts,Roger Creus Castanyer,Geoffrey Bradway,Maxwill Lin,Augustine N. Mavor-Parker,Matthew Daborn-Sargent

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:reinforcement learning, targeted solve rate, limiting resource, agents via reinforcement, solve rate

备注: 30 pages, 9 figures, 12 tables

点击查看摘要

Abstract:The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1\% \rightarrow 20.0\%$ for a Qwen2.5-3B-Instruct solver and from $5.3\% \rightarrow 12.6\%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8\% \rightarrow 19.6\%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.

79. 【2606.18273】Continuous Audio Thinking for Large Audio Language Models

链接https://arxiv.org/abs/2606.18273

作者:Gyojin Han,Dong-Jae Lee,Changho Choi,Jongsuk Kim,Junmo Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:shown impressive capabilities, Large audio language, Large audio, audio language models, audio understanding tasks

备注: Preprint

点击查看摘要

Abstract:Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

80. 【2606.18264】Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

链接https://arxiv.org/abs/2606.18264

作者:Fan Huang

类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:online platforms remains, Faithful modeling, hateful content propagation, online platforms, open problem

备注

点击查看摘要

Abstract:Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful-content propagation may yield moderation strategies that behave less effectively when deployed in real-world scenarios. Multi-agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user's profile, the surrounding community, and the post's content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size-matched benign control. In the empirical Bluesky data, we found that: 97.4--99.7\% of reposters take a hostile stance; toxicity-engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star-like for the hateful cascades (most reposts come directly from the root) versus tree-like for the benign cascade (reposts propagate through multi-hop chains). In simulation, a multi-LLM-agent simulator reproduces the stance monoculture and the toxicity-delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5--12.9\% reduction at 5.7\% benign collateral.

81. 【2606.19157】IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

链接https://arxiv.org/abs/2606.19157

作者:Sakshi Joshi,Dhruv Subhash Rathi,Sanskar Singh,Eldho Ittan George,R J Hari,Kaushal Bhogale,Mitesh M. Khapra

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:enable speech recognition, speech recognition conditioned, recognition conditioned, conditioned on textual, AudioLLMs enable speech

备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

82. 【2606.18979】Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

链接https://arxiv.org/abs/2606.18979

作者:Franziska Braun,Christopher Witzl,Andreas Erzigkeit,Hartmut Lehfeld,Thomas Hillemacher,Tobias Bocklet,Korbinian Riedhammer

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Early detection, multiple cognitive domains, cognitive impairment relies, assessing multiple cognitive, impairment relies

备注: Accepted at INTERSPEECH 2026

点击查看摘要

Abstract:Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German "Syndrom-Kurz-Test," a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

83. 【2606.18520】Compact Geometric Representations of Hierarchies

链接https://arxiv.org/abs/2606.18520

作者:Prashant Gokhale,Piotr Indyk,Yuhao Liu,Sandeep Silwal,Tony Chang Wang,Haike Xu

类目:Machine Learning (stat.ML); Computational Geometry (cs.CG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Computing geometric representations, training dual encoders, Computing geometric, shared embedding space, modern machine learning

备注: Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages

点击查看摘要

Abstract:Computing geometric representations of data is a cornerstone of modern machine learning, typically achieved by training dual encoders which map queries and documents into a shared embedding space. Recent work of You et al. [NeurIPS '25] has extended this approach to hierarchical retrieval, where relevance is determined by the ancestor-descendant relationships in a Directed Acyclic Graph (DAG). While previous work has shown that valid embeddings exist when the number of descendants is small, these bounds degrade significantly for deep hierarchies, requiring dimensions as large as the total number of nodes. In this paper, we investigate compact reachability embeddings for more general graph classes and provide theoretical guarantees for representing hierarchies using embeddings whose dimension depends on structural graph parameters. We prove that for any directed tree, there exists a reachability embedding in constant dimension 3, independent of the tree's size or depth. We generalize this result to graphs characterized by treewidth $t$, constructing embeddings of dimension $O(t \log n)$, where $n$ is the number of nodes. Complementing these upper bounds, we provide matching or near-matching lower bounds, showing that dimension $\Omega(n)$ is necessary for general DAGs and $\Omega(t/\log(n/t))$ is required for graphs of treewidth $t$. We also obtain upper and lower bounds parameterized by the number of cross-edges in the DAG. We additionally show that our embeddings can be constructed on real world datasets, and that they give much smaller dimensions in high recall regimes compared to prior embeddings with theoretical guarantees.

Comments:
Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages

Subjects:

Machine Learning (stat.ML); Computational Geometry (cs.CG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Cite as:
arXiv:2606.18520 [stat.ML]

(or
arXiv:2606.18520v1 [stat.ML] for this version)

https://doi.org/10.48550/arXiv.2606.18520

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

信息检索

1. 【2606.19051】Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

链接https://arxiv.org/abs/2606.19051

作者:Qiuyu Fang,Jiayi Hao,Chengzhi Zhang

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:Research methods, essential carriers, contribution in academic, research intelligence analysis, knowledge contribution

备注: ASIST 2026

点击查看摘要

Abstract:Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

2. 【2606.19037】Querit-Reranker: Training Compact Multilingual Rerankers via Efficient Label-Free Distribution Adaptation

链接https://arxiv.org/abs/2606.19037

作者:Yunfei Zhong,Jun Yang,Wei Huang,Yinqiong Cai,Haosheng Qian,Yixing Fan,Ruqing Zhang,Lixin Su,Daiting Shi,Jiafeng Guo

类目:Information Retrieval (cs.IR)

关键词:target ranking tasks, generalize across languages, ranking tasks, tasks while remaining, remaining efficient

备注

点击查看摘要

Abstract:Deployable multilingual rerankers must generalize across languages, domains, and target ranking tasks while remaining efficient enough for second-stage reranking. However, adapting them to new target distributions typically requires extensive task-specific relevance annotations, which are costly to obtain. We present Querit-Reranker, a family of multilingual cross-encoder rerankers trained with a data-centric pipeline for label-efficient adaptation. We instantiate it as Querit-Reranker-A0.4B, initialized from an in-house MoE backbone with 0.4B activated parameters, and Querit-Reranker-4B, initialized from Qwen3-Embedding-4B. Our pipeline first learns general relevance modeling from large-scale ranking-oriented data, then adapts to target distributions through synthetic-query mining with teacher scores as continuous soft labels. To consolidate complementary task-adapted strengths, we further merge checkpoints via spherical linear interpolation, obtaining a single deployable model without runtime ensembling overhead. Using Qwen3-Embedding-0.6B as the shared first-stage retriever, Querit-Reranker-A0.4B improves average nDCG@10 from 54.11 to 59.28 on BEIR and from 59.87 to 67.70 on MIRACL. On MTEB Multilingual v2 Reranking, it also substantially outperforms larger embedding-based baselines, while Querit-Reranker-4B further achieves state-of-the-art performance among publicly available models. We release both models on Hugging Face.

3. 【2606.18947】Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

链接https://arxiv.org/abs/2606.18947

作者:Emmanuel Aboah Boateng,Kyle MacDonald,Amardeep Kumar,Siddharth Kodwani,Sudeep Das

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:LLM agents increasingly, Production LLM agents, bundles retrieval policy, agents increasingly depend, LLM agents

备注: 15 pages, Figure 8

点击查看摘要

Abstract:Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

4. 【2606.18933】Zero-Shot Active Feature Acquisition via LLM-Elicitation

链接https://arxiv.org/abs/2606.18933

作者:Binyamin Perets,Natalie Mendelson,Shiran Vainberg,Yehuda Chowers,Shai Shen-Orr,Shie Mannor

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Methodology (stat.ME)

关键词:sequentially selects, ranking decision, Active feature acquisition, observe to reach, feature acquisition

备注

点击查看摘要

Abstract:Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

5. 【2606.18897】SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

链接https://arxiv.org/abs/2606.18897

作者:Jiangnan Xia,Xuansheng Wu,Yu Yang,Xin Wang,Ninghao Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:gained significant attention, Intent-based recommender systems, systems have gained, gained significant, improving accuracy

备注

点击查看摘要

Abstract:Intent-based recommender systems have gained significant attention for improving accuracy and interpretability by modeling the underlying motivations behind user behaviors. Most existing models derive intents directly from user sequences via clustering or prototype learning. However, they are sensitive to sequence quality, require presetting the number of intents, and lack explicit semantic grounding. These issues lead to an incomplete and coarse intent set and limit the effectiveness of recommendation. In this paper, we propose the Sparse Autoencoder for intent-based recommendation (SAERec), a novel recommender that automatically constructs a fine-grained and interpretable intent space from a textual corpus to guide recommendation. Rather than treating texts as side signals, SAERec leverages them as high information density evidence for intent construction. Specifically, we first extract a comprehensive set of fine-grained interpretable intents from the latent space of large language models (LLMs) by using a sparse autoencoder (SAE) to disentangle and interpret text embeddings, which isolates intent-related semantics from textual noise. Then, for each user, we retrieve relevant intents from this set as priors to guide recommendation. It contains personal intents matching a user's current interests and public intents capturing general item patterns shared across users (e.g., quality, price). Finally, to integrate retrieved intents into sequence modeling, we propose a multi-branch attention mechanism that captures temporal dependencies and injects both personal and public intent signals, followed by an adaptive fusion layer to construct the final user representation for recommendation. Extensive experiments on public datasets demonstrate the superiority of SAERec, consistently outperforming state-of-the-art baselines while providing human-understandable explanations.

6. 【2606.18885】LARE: Low-Attention Region Encoding for Text-Image Retrieval

链接https://arxiv.org/abs/2606.18885

作者:Abdulmalik Alquwayfili,Faisal Almeshal,Jumanah Almajnouni,Leena Alotaibi,Faisal Alhajari,Mohammed Alkhrashi,Alreem Almuhrij,Abdullah Aldwyish,Raied Aljadaany,Huda Alamri,Muhammad Kamran J. Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:conventional visual encoders, Low-Attention Region Encoding, neglecting low-attention regions, salience bias, bias of conventional

备注: Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: [this https URL](https://github.com/AbdulmalikDS/LARE) ; Dataset: [this https URL](https://huggingface.co/datasets/AbdulmalekDS/Dense-Set)

点击查看摘要

Abstract:Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

7. 【2606.18850】ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

链接https://arxiv.org/abs/2606.18850

作者:Bohou Zhang,Xiaoyu Tao,Mingyue Cheng,Huijie Liu,Qi Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Abstractive summarization plays, enabling efficient understanding, Abstractive summarization, plays a crucial, crucial role

备注

点击查看摘要

Abstract:Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at this https URL.

8. 【2606.18814】LensKit-Auto

链接https://arxiv.org/abs/2606.18814

作者:Max Breit,Anass Amezian El Idrissi,Rishikesh Giriraj Kulkarni,Luca Quade

类目:Information Retrieval (cs.IR)

关键词:Automated Recommender System, social media, area of application, video streaming, digital marketplaces

备注

点击查看摘要

Abstract:Recommender systems have a wide area of application, e.g. in fields like video streaming, social media, or digital marketplaces. But, for a recommender-system, finding the right algorithm with the right hyperparameters is a reoccurring challenge. There is no one-fits-all solution, since the performance of one algorithm can vary immensely on different data sets. Due to the challenges of finding the right algorithm and the broad use of recommender-systems, it is of interest to create an Automated Recommender System (AutoRecSys) that takes on the task of finding the right algorithm-hyperparameter-combination for a given data set. In this work, we present the enhancement of LensKit-Auto, a framework introduced by Vente et al., that solves exactly this task of finding a fitting algorithm-hyperparameter-combination. LensKit-Auto's biggest strength lies in its ease of use, where it operates as a black-box, into which the user can feed their data set and receive the information of which algorithm and hyperparameters work best on this data set. In this work, we bring LensKit-Auto up to date, so that it works with the new version of its underlying framework, LensKit. We also implement further functionalities, such as the Tree Parzen Estimator as an additional optimization method, the ability to reuse the found algorithm, updated documentation, and the ability to visualize the optimization process. We also adapt an existing meta-learning framework to generate a suitable meta-dataset for LensKit-Auto, which could enable the integration of meta-learning into LensKit-Auto in the future. The presented changes bring LensKit-Auto up to date and enhance its usability, so that even non-experts in the field can find the right algorithm for their use case.

9. 【2606.18811】Rescaling MLM-Head for Neural Sparse Retrieval

链接https://arxiv.org/abs/2606.18811

作者:Youngjoon Jang,Seongtae Hong,Jonah Turner,Heuiseok Lim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:BERT-style masked language, Learned sparse retrieval, Learned sparse, SPLADE training recipes, masked language models

备注

点击查看摘要

Abstract:Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

10. 【2606.18801】SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

链接https://arxiv.org/abs/2606.18801

作者:Youngjoon Jang,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:massive multilingual corpora, global information access, Multilingual Information Retrieval, MLIR, rapid expansion

备注

点击查看摘要

Abstract:With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

11. 【2606.18699】W-LegalBench: Measuring Taiwanese Legal Understanding

链接https://arxiv.org/abs/2606.18699

作者:Fei-Yueh Chen,Chun Huang Lin,Chan Wei Hsu,Kuan Hsuan Yeh,Zih-Ching Chen,Kuan-Ming Chen,Patrick Chung-Chia Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, shown impressive capabilities, Large language, jurisdiction-specific legal reasoning, reasoning remains underexplored

备注: 10 pages, 2 figures, To appear in ICAIL 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

12. 【2606.18508】MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

链接https://arxiv.org/abs/2606.18508

作者:Amirhossein Abaskohi,Raymond Li,Gaetano Cimino,Peter West,Giuseppe Carenini,Issam H. Laradji

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:systems depend critically, Retrieval-augmented generation, systems depend, chunked and searched, depend critically

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on this https URL.

13. 【2606.18381】SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

链接https://arxiv.org/abs/2606.18381

作者:Amirhossein Abaskohi,Issam H. Laradji,Peter West,Giuseppe Carenini

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Retrieval-augmented generation, existing methods address, single-level context expansion, systems must balance, contextual coherence

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on this https URL.

14. 【2606.18379】RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

链接https://arxiv.org/abs/2606.18379

作者:Renzhi Wu,Zikun Cui,Junjie Yang,Tai Guo,Hong Li,Xian Chen,Li Yu,Ke Pan,Sri Reddy,Mahesh Srinivasan,Nipun Mathur,Haomin Yu,Hong Yan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:tightly coupled problems, existing work addresses, billion-node scale requires, scale requires jointly, requires jointly solving

备注

点击查看摘要

Abstract:Graph-based retrieval at billion-node scale requires jointly solving three tightly coupled problems -- graph construction, representation learning, and real-time serving -- yet existing work addresses each in isolation. We present RankGraph-2, a framework deployed at Meta that co-designs all three lifecycle stages for similarity-based retrieval (U2U2I and U2I2I), where each stage's requirements shape the others. Serving requires a co-learned cluster index to avoid expensive online KNN -- this pushes index co-training into the training objective. Training benefits from the observation that similarity-based retrieval tolerates pre-computed neighborhoods, eliminating online graph infrastructure -- this requires construction to produce self-contained data. Construction must also support hour-level refresh for item coverage. Acting on these cascading requirements, RankGraph-2 reduces hundreds of trillions of edges to hundreds of billions via subsampling with popularity bias correction, pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index that reduces serving computational cost by 83%. This lifecycle co-design enables a simple architecture to achieve 3.8 x higher recall than a GAT + Deep Graph Infomax model on a bipartite graph and 2.1 x higher than PyTorch-BigGraph on item retrieval. RankGraph-2 delivers up to +0.96% CTR and +2.75% CVR, and has powered 20+ retrieval launches across major surfaces.

15. 【2606.18520】Compact Geometric Representations of Hierarchies

链接https://arxiv.org/abs/2606.18520

作者:Prashant Gokhale,Piotr Indyk,Yuhao Liu,Sandeep Silwal,Tony Chang Wang,Haike Xu

类目:Machine Learning (stat.ML); Computational Geometry (cs.CG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Computing geometric representations, training dual encoders, Computing geometric, shared embedding space, modern machine learning

备注: Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages

点击查看摘要

Abstract:Computing geometric representations of data is a cornerstone of modern machine learning, typically achieved by training dual encoders which map queries and documents into a shared embedding space. Recent work of You et al. [NeurIPS '25] has extended this approach to hierarchical retrieval, where relevance is determined by the ancestor-descendant relationships in a Directed Acyclic Graph (DAG). While previous work has shown that valid embeddings exist when the number of descendants is small, these bounds degrade significantly for deep hierarchies, requiring dimensions as large as the total number of nodes. In this paper, we investigate compact reachability embeddings for more general graph classes and provide theoretical guarantees for representing hierarchies using embeddings whose dimension depends on structural graph parameters. We prove that for any directed tree, there exists a reachability embedding in constant dimension 3, independent of the tree's size or depth. We generalize this result to graphs characterized by treewidth $t$, constructing embeddings of dimension $O(t \log n)$, where $n$ is the number of nodes. Complementing these upper bounds, we provide matching or near-matching lower bounds, showing that dimension $\Omega(n)$ is necessary for general DAGs and $\Omega(t/\log(n/t))$ is required for graphs of treewidth $t$. We also obtain upper and lower bounds parameterized by the number of cross-edges in the DAG. We additionally show that our embeddings can be constructed on real world datasets, and that they give much smaller dimensions in high recall regimes compared to prior embeddings with theoretical guarantees.

Comments:
Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages

Subjects:

Machine Learning (stat.ML); Computational Geometry (cs.CG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Cite as:
arXiv:2606.18520 [stat.ML]

(or
arXiv:2606.18520v1 [stat.ML] for this version)

https://doi.org/10.48550/arXiv.2606.18520

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

计算机视觉

1. 【2606.19341】Native Active Perception as Reasoning for Omni-Modal Understanding

链接https://arxiv.org/abs/2606.19341

作者:Zhenghao Xing,Ruiyang Xu,Yuxuan Wang,Jinzheng He,Ziyang Ma,Qize Yang,Yunfei Chu,Jin Xu,Junyang Lin,Chi-Wing Fu,Pheng-Ann Heng

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Sound (cs.SD)

关键词:processing frames uniformly, causing computational cost, understanding typically rely, processing frames, query difficulty

备注: Accepted at ICML 2026. Code and models: [this https URL](https://github.com/harryhsing/omniagent)

点击查看摘要

Abstract:Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

2. 【2606.19338】Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

链接https://arxiv.org/abs/2606.19338

作者:Shengyuan Ding,Xilin Wei,Xinyu Fang,Haodong Duan,Dahua Lin,Jiaqi Wang,Yuhang Zang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:closed-loop policies increasingly, Deploying multimodal foundation, policies increasingly requires, increasingly requires conditioning, longer visible

备注

点击查看摘要

Abstract:Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

3. 【2606.19333】Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

链接https://arxiv.org/abs/2606.19333

作者:Bhawna Paliwal,Haritheja Etukuru,William Liang,Pieter Abbeel,Nur Muhammad Mahi Shafiullah,Jitendra Malik

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:scalably generate data, scalably generate, human-like platforms, human videos, human

备注: Project website: [this https URL](https://do-as-i-do.com/)

点击查看摘要

Abstract:How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

4. 【2606.19325】Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

链接https://arxiv.org/abs/2606.19325

作者:Michael Finkelson,Daniel Segal,Eitan Richardson,Shahar Armon,Nani Goldring,Poriya Panet,Nir Zabari,Benjamin Brazowski,Or Patashnik,Yoav HaCohen

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:learnable speaker embeddings, multi-stream transcriptions, dialogue systems bind, systems bind speakers, systems bind

备注: Project page at [this https URL](https://finmickey.github.io/scena/)

点击查看摘要

Abstract:Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

5. 【2606.19316】NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

链接https://arxiv.org/abs/2606.19316

作者:Chong Bao,Yuan Li,Bangbang Yang,Yujun Shen,Hujun Bao,Zhaopeng Cui,Yinda Zhang,Guofeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated significant advantages, Recently neural implicit, Recently neural, scene reconstruction, evolved rapidly

备注: TPAMI 2025; Project Page: [this https URL](https://zju3dv.github.io/neumeshplusplus/)

点击查看摘要

Abstract:Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: this https URL

6. 【2606.19300】Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

链接https://arxiv.org/abs/2606.19300

作者:Xin Ci Wong,Duygu Sarikaya,Kieran Zucker,Marc De Kamps,Nishant Ravikumar

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multiparametric MRI, Glioma segmentation, treatment planning, component of treatment, Dice

备注: Accepted for MIUA2016

点击查看摘要

Abstract:Glioma segmentation in multiparametric MRI is a critical component of treatment planning. A segmentation model that fails silently on treatment-critical sub-regions represents a patient safety risk that overlap-based metrics such as Dice scores cannot expose. We ask whether voxel-level uncertainty estimation via Monte Carlo (MC) Dropout can reliably identify segmentation errors in clinically critical sub-regions, and whether calibration failure modes are detectable from standard reporting metrics alone. In an empirical two-model case study on 126 BraTS21 patients, we evaluate a high-performance pretrained SegResNet and a locally trained UNet with residual units (UNet-Res). MC dropout preserved segmentation accuracy ($|\Delta \text{Dice}|$ $0.01$) while achieving strong uncertainty-error alignment (AUROC for entropy (H) $\approx$0.97), indicating uncertainty correctly ranks erroneous voxels above correct ones. Entropy-based patient stratification identified a high-uncertainty subgroup with substantially lower segmentation performance (median whole-tumour Dice $0.835$ vs. $0.925$), supporting uncertainty as a practical triage signal. However, global alignment can mask important region-specific differences. Despite similar AUROC, UNet-Res exhibited near-zero enhancing tumour entropy ($0.054$) and Expected Calibration Error (ECE) of $0.915$, with a Dice of only $0.714$, indicating severely miscalibrated confidence on the most clinically critical sub-region, a failure mode invisible to standard Dice and AUROC reporting. These findings demonstrate that strong uncertainty-error alignment is necessary but insufficient for clinical safety: sub-region-specific calibration assessment must accompany AUROC evaluation when selecting models for clinical deployment.

7. 【2606.19277】A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

链接https://arxiv.org/abs/2606.19277

作者:Timothy Agboada,Shikha Chandel,Yadav Raj Ghimire,Leila Hashemi-Beni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Question Answering, multi scale object, scale object distribution, unique challenges due, Visual Question

备注: 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

点击查看摘要

Abstract:Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

8. 【2606.19259】A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

链接https://arxiv.org/abs/2606.19259

作者:Yijin Wang,Shuyi Wang,Wenhan Zhang,Yuqi Ouyang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Text-rich images, decision-relevant information, detecting text-rich images, AI-generated text-rich images, images

备注

点击查看摘要

Abstract:Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

9. 【2606.19258】CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

链接https://arxiv.org/abs/2606.19258

作者:Haohua Que,Zhipeng Bao,Qianyi Wu,Handong Yao

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Cloud-hosted large multimodal, large multimodal models, provide strong open-vocabulary, naively transmitting full-resolution, Cloud-hosted large

备注

点击查看摘要

Abstract:Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

10. 【2606.19253】OneCanvas: 3D Scene Understanding via Panoramic Reprojection

链接https://arxiv.org/abs/2606.19253

作者:Bartłomiej Baranowski,Dave Zhenyu Chen,Matthias Nießner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:model-specific geometry encoders, Existing approaches, Vision-Language Models, large training budgets, scene understanding

备注: Project page: [this https URL](https://baranowskibrt.github.io/onecanvas/)

点击查看摘要

Abstract:Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

11. 【2606.19249】ransformer Geometry Observatory TGO-I: Spectral Geometry Observatory

链接https://arxiv.org/abs/2606.19249

作者:Kaustubh Kapil,Kishor P. Upla

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:computer vision applications, numerous computer vision, Transformer Geometry Observatory, Vision Transformers, representational geometry remains

备注

点击查看摘要

Abstract:Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.

12. 【2606.19240】Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

链接https://arxiv.org/abs/2606.19240

作者:Thomas M. Kwok,Nicholas Koenig,Yue Hu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)

关键词:Arm Kinematic Correction, conventional marker-based systems, low-cost and non-invasive, non-invasive alternative, alternative to conventional

备注

点击查看摘要

Abstract:Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

13. 【2606.19215】GUMP-Net: An interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation

链接https://arxiv.org/abs/2606.19215

作者:Liheng Wang,Yinghui Zhang,Licheng Zhang,Hailin Xu,Qiyong Cao,Chong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental research problems, diagnosis and treatment, important and fundamental, fundamental research, research problems

备注: 26 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Pelvic segmentation is one of the most important and fundamental research problems in precise and intelligent diagnosis and treatment, as well as surgical planning and navigation for pelvic fractures. By combining an improved geodesic active contour model with deep neural networks, we propose GUMP-Net, an interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation, in which three network modules are designed to constitute the overall segmentation framework together: the object detection module for automatic level set initialization, the edge detector module for learning an anatomy-aware edge detector function and the iteration module for deep level set evolution. Leveraging the advantages of level set representation and deep learning, GUMP-Net shows more accurate, robust and consistent segmentation performance, especially in small training data situation, compared to the state-of-the-art methods. Extensive experiments on pelvic datasets demonstrate the rationality and effectiveness of the proposed algorithm. Further experiments extended to ankle dataset indicate broader applications to other anatomies. The proposed algorithm not only provides an efficient segmentation method for complex fracture reduction, but also gives an interpretable geometric perspective for understanding deep learning segmentation.

14. 【2606.19204】ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

链接https://arxiv.org/abs/2606.19204

作者:Nengbo Zhang,Chang sheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Pinus sylvestris var, Accurate identification, identification of Pinus, Google Earth Engine, Pinus sylvestris

备注: journal in tree classification

点击查看摘要

Abstract:Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.

15. 【2606.19195】Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

链接https://arxiv.org/abs/2606.19195

作者:Kangsheng Duan,Ziyang Xu,Wenyu Liu,Xiaohu Ruan,Xiaoxin Chen,Xinggang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hinder practical deployment, prohibitive computational costs, computational costs severely, costs severely hinder, severely hinder practical

备注

点击查看摘要

Abstract:While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$\lambda$ Mix Interaction ($L\lambda MI$) block. Comprising Local-$\lambda$ and Interactive-$\lambda$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at this https URL.

16. 【2606.19184】When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

链接https://arxiv.org/abs/2606.19184

作者:Dat Nguyen,Cosmin Radoi,Romain Hermary,Marcella Astrid,Nesryne Mejri,Enjie Ghorbel,Djamila Aouada

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:non-consensual explicit content, harms including financial, including financial fraud, highly realistic deepfakes, real-world harms including

备注

点击查看摘要

Abstract:Recent advances in generative AI, such as diffusion models and face-swapping tools, have enabled the creation of highly realistic deepfakes, leading to real-world harms including financial fraud and non-consensual explicit content. In response, deepfake detection has become an active research area, with recent methods increasingly focusing on improving generalization to unseen manipulations. This is typically evaluated using the Area Under the ROC Curve (AUC) measured separately across multiple datasets. However, such an evaluation fails to reflect real-world scenarios where detectors face a mixture of data sources and varying artifact types. To address this limitation, we introduce a novel metric, Cross-dataset AUC (Cross-AUC) that averages per-domain AUCs with a measure of prediction polarization for taking into account the robustness to domain shift. The polarization extent is quantified by the Wasserstein Distance between class score distributions. Cross-AUC not only assesses the generalization capabilities of deepfake detectors under domain shifts more realistically, but it is also interpretable as it better explains the reason behind a drop in performance. Experiments performed on seven benchmark datasets demonstrate its practical relevance.

17. 【2606.19162】he Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

链接https://arxiv.org/abs/2606.19162

作者:Nicolas Beltran-Velez,Felix Friedrich,Zhang Xiaofeng,Reyhane Askari-Hemmat,Xiaochuang Han,Adriana Romero-Soriano,Michal Drozdzal

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:coherent object structure, aligning with subjective, coherent object, object structure, structure that matching-based

备注: 84 pages, including appendices

点击查看摘要

Abstract:Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

Comments:
84 pages, including appendices

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.19162 [cs.LG]

(or
arXiv:2606.19162v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.19162

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2606.19156】Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

链接https://arxiv.org/abs/2606.19156

作者:Jeongmin Bae,Seoha Kim,Marc Pollefeys,Mahdi Rad,Youngjung Uh,Taein Kwon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:next-generation computing platforms, essential for next-generation, next-generation computing, computing platforms, hand reconstruction

备注: Project page: [this https URL](https://jeongminb.github.io/hand-4dgs/)

点击查看摘要

Abstract:Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

19. 【2606.19151】he Market in the Model: Latent Diffusion as Neural Economy

链接https://arxiv.org/abs/2606.19151

作者:Eryk Salvaggio

类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)

关键词:generative image models, Valuable critique, visual culture, humanities has emphasized, emphasized the role

备注

点击查看摘要

Abstract:Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.

20. 【2606.19139】Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

链接https://arxiv.org/abs/2606.19139

作者:Ramza Basharat,Muhammad Usman Ali

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Automatic Handwritten Text, Urdu Handwritten Text, Handwritten Text Recognition, Handwritten Text, Automatic Handwritten

备注

点击查看摘要

Abstract:Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

21. 【2606.19120】Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

链接https://arxiv.org/abs/2606.19120

作者:Sihan Wang,Xiyao Liu,Lianqing Liu,Zhi Han

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:provide dense token-level, dense token-level targets, token-level targets conditioned, On-policy self-distillation, reference target

备注: 29 pages, 5 figures, 8 tables

点击查看摘要

Abstract:On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

22. 【2606.19103】ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

链接https://arxiv.org/abs/2606.19103

作者:Mukund Khanna,Raj Singh Yadav,Kunal Singh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:natural language instructions, Recent advances, instruction-based image editing, perform complex visual, complex visual edits

备注: CVPR HiGen 2026

点击查看摘要

Abstract:Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at this https URL

Comments:
CVPR HiGen 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.19103 [cs.CV]

(or
arXiv:2606.19103v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.19103

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2606.19100】AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

链接https://arxiv.org/abs/2606.19100

作者:Diogo Glória-Silva,João Cardeira,Manuel Letras da Luz,Afonso Simplício,Gonçalo Vinagre,Diogo Tavares,Rafael Ferreira,Inês Calvo,Inês Vieira,David Semedo,João Magalhães

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains systematically underserved, Large Vision, Brazilian Portuguese, European Portuguese multimodal, European Portuguese

备注

点击查看摘要

Abstract:Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT this http URL will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

24. 【2606.19097】DVANet: Degradation-aware Visual-prior Alignment Network for Image Restoration

链接https://arxiv.org/abs/2606.19097

作者:Yanjie Tu,Qingsen Yan,Axi Niu,Tao Hu,Haokui Zhang,Jiantao Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:handling diverse degradation, image restoration aims, image restoration, unified restoration framework, aims to develop

备注: All-in-One Image Restoration; Deep Unfolding; Degradation Representation; Visual Prior

点击查看摘要

Abstract:All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model's adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.

25. 【2606.19096】PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

链接https://arxiv.org/abs/2606.19096

作者:João Cardeira,Diogo Glória-Silva,Manuel Letras da Luz,Rafael Ferreira,Diogo Tavares,David Semedo,João Magalhães

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:European Portuguese, high-resource languages, largely absent, skew toward high-resource, European

备注

点击查看摘要

Abstract:European Portuguese (pt-PT) is largely absent from OCR benchmarks, which skew toward high-resource languages. The few benchmarks that cover pt-PT focus on historical artifacts and literature. This work addresses modern OCR applications, introducing PorTEXTO, the first benchmark for contemporary and culturally relevant pt-PT visual text extraction. To ascertain quality, we employ an annotation pipeline combining transcriptions from a frontier LVLM with exhaustive review by native speakers. We observe a sharp performance drop from synthetic to real world samples in most models, and find that, currently, specialized multilingual data is a better driver for pt-PT performance than model size or resolution budget, motivating the release of open pt-PT OCR resources.

26. 【2606.19073】aming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

链接https://arxiv.org/abs/2606.19073

作者:Jiayi Gao,Qingchao Chen,Yuxin Peng,Yang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:critical challenge unaddressed, static attributes, Current image editing, human-object pair preservation, global metrics incapable

备注

点击查看摘要

Abstract:Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM QA after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at this https URL.

27. 【2606.19067】Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

链接https://arxiv.org/abs/2606.19067

作者:Roberto Corlito,Fabian Schmidt,Nils Seibert,Markus Enzweiler,Abhinav Valada,Arne Roennau

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:resilient Simultaneous Localization, diverse environments fundamentally, environments fundamentally relies, Autonomous navigation, resilient Simultaneous

备注

点击查看摘要

Abstract:Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

28. 【2606.19062】DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

链接https://arxiv.org/abs/2606.19062

作者:Kaleem Ullah,Altaf Hussain,Muhammad Munsif,Sung Wook Baik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:today media-driven world, made retrieving semantically, retrieving semantically relevant, semantically relevant videos, queries increasingly critical

备注

点击查看摘要

Abstract:In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

29. 【2606.19053】Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

链接https://arxiv.org/abs/2606.19053

作者:Hong-Tao Yu,Chen-Wei Xie,Yuxin Peng,Serge Belongie,Xiu-Shen Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, demonstrated remarkable multimodal, remarkable multimodal perception, Recent advancements, advancements in Large

备注

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at this https URL.

30. 【2606.19046】Low-Rank Tensor Completion Based on Fractional Regularization with Ky Fan p-k Norm

链接https://arxiv.org/abs/2606.19046

作者:Shan Fan,Feng Zhang,Jianjun Wang,Xi-Le Zhao,Tingwen Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tensor nuclear norm, tensor tubal rank, tensor Frobenius norm, paper addresses low-rank, low-rank tensor completion

备注

点击查看摘要

Abstract:This paper addresses low-rank tensor completion (LRTC) by proposing a novel nonconvex surrogate, namely the ratio of the tensor nuclear norm to the tensor Ky Fan p-k norm (TNPK), to accurately approximate the tensor tubal rank. The TNPK possesses appealing properties, including scale invariance, parameter flexibility, and the existence of closed-form solutions under specific choices of p and k. With specific parameter settings of p and k, it reduces to the ratio of the tensor nuclear norm to the tensor Ky Fan k norm (TNK) or the ratio of the tensor nuclear norm to the tensor Frobenius norm (TNF). We construct a LRTC model and, under the tensor null space property (NSP), prove that low-rank tensors are local minimizers of the proposed model. Moreover, we derive the proximal operator of the Ky Fan p-k inverse-norm and further develop an efficient alternating direction method of multipliers (ADMM) algorithm with guaranteed subsequential convergence under mild conditions. Extensive experiments on synthetic and real-world datasets validate the superior performance of our method against state-of-the-art competitors.

31. 【2606.19019】FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

链接https://arxiv.org/abs/2606.19019

作者:Yuchen Rao,Xuqian Ren,Yinyu Nie,Sayan Deb Sarkar,Biao Zhang,Vincent Lepetit,Friedrich Fraundorfer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:casual image captures, image captures remains, Recovering complete, representations of objects, significant challenge

备注: Project page: [this https URL](https://yuchenrao.github.io/projects/flowObject/flowObject.html)

点击查看摘要

Abstract:Recovering complete 3D representations of objects from few casual image captures remains a significant challenge. Recent 3D generative models, particularly those based on Flow-Matching (FM), can synthesize high-quality textured assets; however, they often suffer from ''synthetic bias'' where learned priors override observational evidence, alongside a lack of alignment with the observed instance. Conversely, optimization-based methods like 3D Gaussian Splatting (3DGS) provide high fidelity on visible surfaces but fail to reason about unobserved geometry. In this paper, we present FlowObject, a framework that reformulates sparse-view 3D reconstruction as a training-free, guided inverse problem. Our approach applies a dual-space guidance strategy to steer the Ordinary Differential Equation (ODE) trajectory of a flow-matching model, enabling the completion of unseen regions through learned generative priors while enforcing strict consistency with real-world observations. By integrating a 3DGS refinement stage, FlowObject further bridges the gap between ''synthetic-looking'' generative outputs and photorealistic reconstructions. Comprehensive benchmarks on synthetic and real-world datasets demonstrate that current state-of-the-art methods often struggle to achieve geometric completeness and observational consistency simultaneously, especially under severe occlusions. In contrast, our method significantly outperforms state-of-the-art generative models and optimization-based frameworks in both geometric completeness and view-dependent appearance fidelity.

32. 【2606.18992】Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

链接https://arxiv.org/abs/2606.18992

作者:Amsisan Tran,Baogh Le,Tuan Kiet Pham,Sui Yang Guang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed image retrieval, Composed image, modification to search, CIR, text questions

备注

点击查看摘要

Abstract:Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

33. 【2606.18974】Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

链接https://arxiv.org/abs/2606.18974

作者:Pengyu Li,Zhitao Gao,Lingling Zhang,Muye Huang,Yuanming Li,Fangzhi Xu,Jun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unified multimodal models, Unified multimodal, generated visual thoughts, interleave generated visual, improve spatial tasks

备注

点击查看摘要

Abstract:Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

34. 【2606.18970】A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

链接https://arxiv.org/abs/2606.18970

作者:Syed Mujtaba Haider,Silvia Figini

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:frequently reporting accuracy, reporting accuracy gains, frequently reporting, accuracy gains, quantum generative models

备注: This work has been submitted to the IEEE for possible publication. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

35. 【2606.18960】Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

链接https://arxiv.org/abs/2606.18960

作者:Zirui Zheng,Jiaqian Yu,Xiongfeng Peng,jun shi,Mingyi Li,Chao Zhang,Weiming Li,Dong Wang,Huchuan Lu,Xu Jia

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:generating action-consistent video, costly real-world experimentation, action-consistent video rollouts, robot learning, offering a scalable

备注

点击查看摘要

Abstract:Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

36. 【2606.18955】Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

链接https://arxiv.org/abs/2606.18955

作者:Runze Xu,Yiluo Zhang,Jian Wang,Yu Wang,Jincheng Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:high-fidelity action annotations, diverse robotic datasets, typically requires massive, models typically requires, Training generalist

备注: Accepted to IROS 2026

点击查看摘要

Abstract:Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

37. 【2606.18952】SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

链接https://arxiv.org/abs/2606.18952

作者:Hongzhou Dong,Zili Zhang,Ziting Wen,Yiheng Qiang,Runrong Deng,Wenle Dong,Ziwen Jiang,Xinyang Li,Rui Lu,Shuoyao Sun,Wenyu Wang,Ziyi Xia,Haitao Zheng,Guodong Shi,Xiaoqiang Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unique measurement noise, offering unique potential, fundamentally challenging due, jointly complicate geometric, single photon perception

备注

点击查看摘要

Abstract:Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved this http URL, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

38. 【2606.18943】Physics-IQ Verified

链接https://arxiv.org/abs/2606.18943

作者:Tim Rädsch,Yuki M Asano,Hilde Kuehne,Stefan Bauer,Priyank Jaini,Robert Geirhos,Carsten T. Lüth

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including world modeling, multitude of downstream, downstream tasks, including world, world modeling

备注

点击查看摘要

Abstract:Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $\tau = 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at this https URL

39. 【2606.18906】BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

链接https://arxiv.org/abs/2606.18906

作者:Chaewon Park,Soyoon Lee,Naeun Lee,Minjung Shin,Seogkyu Jeon,Kibeom Hong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Real image editing, enables precise manipulation, Real image, Source Dominance Leakage, Dominance Leakage

备注: Preprint

点击查看摘要

Abstract:Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

40. 【2606.18894】Automatic ply-specific analyses of CFRP micrographs using shortest-path-based ply distinction

链接https://arxiv.org/abs/2606.18894

作者:Jonas Naumann,Jonas P. Appels,Julius Biermann,Christopher Gorsky,Timo de Wolff,Christoph Brauer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:carbon-fiber reinforced polymer, high-resolution carbon-fiber reinforced, reinforced polymer micrographs, present an automated, carbon-fiber reinforced

备注

点击查看摘要

Abstract:We present an automated approach to distinguish between ply instances in semantic segmentation masks of high-resolution carbon-fiber reinforced polymer micrographs. Interpreting the segmentation mask as a graph with pixels as vertices, enables us to use a shortest-path algorithm yielding the ply-separating paths. Thereby, we bridge the gap between semantic segmentation and ply instance segmentation using global information. We successfully apply our approach on high-resolution micrographs featuring a broad range of characteristics like artificially added gaps in single or multiple plies, different stacking sequences and ply traversing cracks. Assigning each fiber pixel to a ply based on the calculated paths, allows for a comprehensive, quantitative ply analysis with respect to its microstructural properties like the local fiber volume fraction as well as locally resolved ply and interleaf layer thickness. These insights help to reveal manufacturing-induced inhomogeneities, draw conclusions on manufacturing parameters and link mechanical properties to underlying microstructural imperfections.

41. 【2606.18886】DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

链接https://arxiv.org/abs/2606.18886

作者:Haoyu Hu,Xiyao Ma,Shiqi Liu,Linsen Zhang,Xiaoliang Xie,Xiaohu Zhou,Zeng-Guang Hou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable semantic, remarkable semantic discrimination, demonstrated remarkable, remarkable semantic, semantic discrimination

备注: Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

点击查看摘要

Abstract:Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

42. 【2606.18885】LARE: Low-Attention Region Encoding for Text-Image Retrieval

链接https://arxiv.org/abs/2606.18885

作者:Abdulmalik Alquwayfili,Faisal Almeshal,Jumanah Almajnouni,Leena Alotaibi,Faisal Alhajari,Mohammed Alkhrashi,Alreem Almuhrij,Abdullah Aldwyish,Raied Aljadaany,Huda Alamri,Muhammad Kamran J. Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:conventional visual encoders, Low-Attention Region Encoding, neglecting low-attention regions, salience bias, bias of conventional

备注: Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: [this https URL](https://github.com/AbdulmalikDS/LARE) ; Dataset: [this https URL](https://huggingface.co/datasets/AbdulmalekDS/Dense-Set)

点击查看摘要

Abstract:Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

43. 【2606.18884】Performance Gap Analysis between Latin and Arabic Scripts HTR

链接https://arxiv.org/abs/2606.18884

作者:Sana Al-azzawi,Elisa Barney,Marcus Liwicki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:systems perform worse, Recent studies, worse on Arabic-script, systems perform, Arabic-script datasets

备注: this paper accepted at TIPS workshop ICPR 2026

点击查看摘要

Abstract:Recent studies have shown that handwritten text recognition (HTR) systems perform worse on Arabic-script datasets than on Latin-script data. However, the reasons for this gap are still not well understood due to the lack of controlled comparisons. In this work, we present a comprehensive study of Arabic and Latin scripts HTR using a unified CRNN model for line-level HTR across nine datasets (including KHATT (Arabic), Muharaf (Arabic), NUST-UHWR (Urdu), PHTD (Persian), IAM (English), READ-2016 (German), and others) and di ferent training sizes (K in {100, 500, 1000, 2000, ..., Kfull}). Our results show the performance gap remains: it is large in low-resource settings, decreases with more data, but remains even at full scale, with a consistent difference of 5-7 CER points. We show that annotation quality matters, as many datasets contain labeling errors. Cleaning reduces error rates and narrows the gap, but does not eliminate it. In addition, we find that a fixed number of training samples provides less effective coverage in Arabic due to higher visual variability, requiring more data to learn similar representations. We compare recognition across datasets in terms of the number of text lines and the number of characters, showing an equivalence trade-off. We compare character frequency distributions across scripts and show that Arabic is significantly more heavy-tailed than Latin. Our error analysis reveals that around 30 percent of substitution errors in Arabic datasets (e.g., KHATT) are caused by confusion between visually similar characters, compared to about 15 percent in Latin-script datasets such as IAM.

44. 【2606.18876】st-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

链接https://arxiv.org/abs/2606.18876

作者:Veit Hucke,Thomas Pinetz,Gregor Reiter,Ursula Schmidt-Erfurth,Hrvoje Bogunović

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Optical coherence tomography, hinders automated analysis, low-cost devices hinders, devices hinders automated, Optical coherence

备注: Accepted in MICCAI

点击查看摘要

Abstract:Optical coherence tomography (OCT) is essential in ophthalmology, but inconsistent image quality especially in low-cost devices hinders automated analysis. To address this, we introduce a flow-matching-based test-time adaptation method that generates high-quality surrogate images from noisy inputs. Typically, domain gaps between test and training data cause pixel distribution mismatches during the denoising process. We overcome this by matching the test image's histogram to synthetic reference trajectories, successfully aligning the input with expected distributions. Additionally, we remove the network's time conditioning to account for slight deviations in real-world noise distributions. Our approach achieves state-of-the-art performance in segmenting critical biomarkers for two stages of Age-related Macular Degeneration (AMD). Code is available: this https URL.

45. 【2606.18872】Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

链接https://arxiv.org/abs/2606.18872

作者:Yuheng Tang,Alexander Ng,Wen Yan,Natasha Thorley,Pawel Rajwa,Yipei Wang,Aqua Asif,Clare Allen,Louise Dickinson,Francesco Giganti,Shonit Punwani,Daniel Alexander,Veeru Kasivisvanathan,Yipeng Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-quality diffusion-weighted imaging, multi-parametric MRI relies, MRI relies heavily, diffusion-weighted imaging, rectal air

备注

点击查看摘要

Abstract:Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are $6\%$ images with PI-QUAL scores lower than 4, $87\%$ of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

46. 【2606.18869】Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

链接https://arxiv.org/abs/2606.18869

作者:YuCheng Tang,Wen Yan,Alexander Ng,Natasha Thorley,Pawel Rajwa,Yipei Wang,Aqua Asif,Clare Allen,Louise Dickinson,Francesco Giganti,David Atkinson,Shonit Punwani,Daniel Alexander,Shaheer Ullah Saeed,Veeru Kasivisvanathan,Yipeng Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Single-shot echo-planar prostate, prostate diffusion-weighted imaging, echo-planar prostate diffusion-weighted, Single-shot echo-planar, derive reliable diagnoses

备注

点击查看摘要

Abstract:Single-shot echo-planar prostate diffusion-weighted imaging (DWI) is frequently complicated by geometric distortions, which impact the ability to derive reliable diagnoses from such images. Developing automated correction methods is challenged by the absence of paired distorted and undistorted clinical scans. In this paper, we first propose a novel weakly-supervised image quality transfer (IQT) framework from undistorted to distorted images that utilizes image quality assessment (IQA) signals to supervise the transfer process. Unlike traditional methods that require expensive, voxel-wise paired data or resort to developing unpaired algorithms, our approach utilizes image-level quality labels (here, distorted vs. undistorted) to establish latent quality prototypes within a pre-trained feature space. Recognizing that simulating realistic distortions is more reliable than direct unpaired correction, we describe a weakly-supervised prototype flow matching algorithm to explicitly regularize generative trajectories towards distorted prototypes, producing realistic susceptibility artifacts that mimic clinical degradations. By synthesizing these realistic pairs, we enable a second IQT model to be trained in the forward direction for distortion correction. Experimental results demonstrate that our generated images successfully mimic the diagnostic interference of real-world artifacts, which leads to more capable distortion correction IQT models. In addition to qualitative comparisons, we also conduct exhaustive quantitative evaluations that compare our approach with existing unpaired approaches (e.g., CycleGAN, UNIT-DDPM, and OT-FM) - as either forward or reverse alternatives - by assessing clinical downstream task performance in PI-RADS and Gleason score classification, using both in-distribution and external data sets.

47. 【2606.18861】URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

链接https://arxiv.org/abs/2606.18861

作者:Xinze Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Reconstructing simulation-ready digital, simulation-ready digital twins, violate basic dynamic, basic dynamic invariants, sensor observations remains

备注

点击查看摘要

Abstract:Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

48. 【2606.18860】Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

链接https://arxiv.org/abs/2606.18860

作者:Hana Jebril,Thomas Pinetz,Günter Klambauer,Hrvoje Bogunović

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:transform clinical workflows, enabling high-fidelity longitudinal, high-fidelity longitudinal monitoring, Reliable pixel-level uncertainty, Reliable pixel-level

备注: Accepted at MICCAI 2026

点击查看摘要

Abstract:Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify "adversarially fragile" pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at this https URL

49. 【2606.18846】From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

链接https://arxiv.org/abs/2606.18846

作者:Like Zhang,Runliang Niu,Shiqi Wang,Xiyu Hu,Qianli Xing,Pan Wang,Qingzu He,Qi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sophisticated grounded structured, grounded structured visual, Vision-language models, structured visual reasoning, rapidly advancing

备注: 14 pages, 7 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: this https URL.

50. 【2606.18841】Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

链接https://arxiv.org/abs/2606.18841

作者:Zhoupeng Guo,Yunqi Zhu,Zhihe Fan,Xinjie Yao,Ruipu Zhao,Boan Tao,Yiming Sun,Zhen Wang,Pengfei Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world dynamic environments, robust visual understanding, dynamic environments, crucial for robust, robust visual

备注

点击查看摘要

Abstract:Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at this https URL.

51. 【2606.18839】Semantic Robustness Certification for Vision-Language Models

链接https://arxiv.org/abs/2606.18839

作者:Peiyu Yang,Paul Montague,Feng Liu,Andrew C. Cullen,Amardeep Kaur,Christopher Leckie,Sarah M. Erfani

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-language models, downstream tasks, Vision-language, semantic, Robustness

备注: Accepted to ICML

点击查看摘要

Abstract:Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

52. 【2606.18825】DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

链接https://arxiv.org/abs/2606.18825

作者:Luoyao Kang,Yuelin Zhang,Jiwei Shan,Haifan Gong,Qingpeng Ding,Shing Shin Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:volumes remains challenging, remains challenging due, speckle noise, surgical navigation, volumes remains

备注

点击查看摘要

Abstract:Ultrasound (US) is widely used for surgical navigation, yet real-time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action-dependent US acquisition. Existing methods are one-shot or short-horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on-screen feedback. We propose DreamReg, a belief-driven world-model framework that formulates 2D-3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe-motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u-RegPro datasets demonstrate improved robustness and competitive registration accuracy for real-time guidance compared with state-of-the-art methods.

53. 【2606.18824】Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

链接https://arxiv.org/abs/2606.18824

作者:Yuxuan Xie,Nicolas Pugeault,Chongfeng Wei,Hubert P. H. Shum,Edmond S. L. Ho

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:future trajectory distributions, ego-centric camera, camera is challenging, depends on complex, future trajectory

备注: Accepted at The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

点击查看摘要

Abstract:Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal 'mixed-mode' trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian's crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.

54. 【2606.18793】Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

链接https://arxiv.org/abs/2606.18793

作者:Dongbin Jiao,Yibo Lyu,Qiulu Wei,Fuxiang Lu,Shengcai Liu,Shi Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Data scarcity, distortion significantly limit, high-security authentication, Data, structural distortion significantly

备注

点击查看摘要

Abstract:Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ($\Delta$WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

55. 【2606.18788】HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

链接https://arxiv.org/abs/2606.18788

作者:Jaward Sesay,Yue Yu,Börje F. Karlsson

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Teaching machines, requires synthesizing stroke, single person handwriting, pressure and script, vary in shape

备注

点击查看摘要

Abstract:Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

56. 【2606.18787】Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

链接https://arxiv.org/abs/2606.18787

作者:Eito Ogawa,Hiroshi Watanabe

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unsigned Distance Field, Local-patch Unsigned Distance, important for consumer-grade, indoor scanning, Surface reconstruction

备注

点击查看摘要

Abstract:Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

57. 【2606.18783】SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

链接https://arxiv.org/abs/2606.18783

作者:Yunus Sevim,Behçet Uğur Töreyin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe background clutter, weak spatial responses, Reweighted Explicit-visibility Enhanced, Explicit-visibility Enhanced Modulation, characterize detection quality

备注: Accepted at CVPR 2026 Workshops (PBVS). Published version: [this https URL](https://openaccess.thecvf.com/content/CVPR2026W/PBVS/html/Sevim_SCR-Guided_Difficulty-Aware_Optimization_for_Infrared_Small_Target_Detection_CVPRW_2026_paper.html)

点击查看摘要

Abstract:Infrared small target detection remains challenging due to severe background clutter, low contrast, and weak spatial responses where geometric overlap alone is insufficient to characterize detection quality. In this work, we propose REEM (Reweighted Explicit-visibility Enhanced Modulation), a lightweight SCR-guided difficulty-aware optimization framework that incorporates Signal-to-Clutter Ratio (SCR) as a physically meaningful visibility prior during training. Instead of modifying the network architecture or directly optimizing SCR, REEM computes a ground-truth local SCR from the input image and applies a differentiable modulation to the soft-IoU learning signal, emphasizing low-visibility targets while preserving stable optimization and identical inference behavior. REEM is integrated into a U-Net-based MSHNet without introducing additional parameters, architectural modifications, or inference-time overhead. Extensive experiments demonstrate consistent improvements over the baseline, achieving higher IoU and detection probability (Pd) together with substantially reduced false alarms (FA), particularly under challenging low-visibility conditions. These results suggest that SCR-guided difficulty-aware optimization provides an effective and physically grounded complement to conventional overlap-based objectives for infrared small target detection. The code is available at https://github. com/yall-in-one/Reemm.

58. 【2606.18780】SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

链接https://arxiv.org/abs/2606.18780

作者:Quanjiang Guo,Chong Mu,Jiazhou Pan,Ming Jia,Ling Tian,Hui Gao,Zhao Kang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Named Entity Recognition, Multimodal Information Extraction, Multimodal Named Entity, Relation Extraction, Information Extraction

备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

59. 【2606.18765】SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

链接https://arxiv.org/abs/2606.18765

作者:Jiayu Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:MLP residual branch, flow-matching Diffusion Transformers, Transformers that adds, adds timestep-conditioned spectral, MLP residual

备注

点击查看摘要

Abstract:We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

60. 【2606.18753】SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

链接https://arxiv.org/abs/2606.18753

作者:John Kalkhof,Boris Gutman(IIT),Emile d'Angremont(Amsterdam UMC),Daniel C. Alexander(UCL),Marco Lorenzi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:spatio-temporal brain atlas, scalable spatio-temporal brain, introduce SMART, Neural Cellular Automata, SMART

备注

点击查看摘要

Abstract:We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

61. 【2606.18749】oward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

链接https://arxiv.org/abs/2606.18749

作者:Tai Le-Gia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:changing patient populations, heterogeneous acquisition protocols, handle heterogeneous acquisition, annotated training data, Zero-shot anomaly detection

备注

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

62. 【2606.18732】Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

链接https://arxiv.org/abs/2606.18732

作者:Guillermo Rojas,Gonzalo Soto,Daniel Yunge

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Dynamic Vision Sensor, spiking neural networks, convolutional neural networks, integrate spiking neural, Dynamic Vision

备注: 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

点击查看摘要

Abstract:This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS) generated from conventional smartphone videos. Aimed primarily at human fall detection, the approach leverages the energy efficiency and spatio-temporal processing capabilities of SNNs by converting video frames into event-based data. The proposed models are evaluated through simulations on multiple datasets, comparing their performance to that of traditional machine learning models. Results demonstrate significant gains in efficiency without sacrificing accuracy, underscoring the potential of combining SNNs and DVS technology for complex tasks in real-world environments.

63. 【2606.18723】Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

链接https://arxiv.org/abs/2606.18723

作者:Yunshu Chen,Litao Yang,Giuseppe Di Giovanni,Jordan Tan,Deval Mehta,Andrew Lin,Derek Chew,Masasi Fujino,Julie Butters,Stephen Nicholls,Zongyuan Ge,Kyung Hoon Cho

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:external elastic membrane, quantitative coronary plaque, Intravascular ultrasound, plaque burden assessment, coronary plaque burden

备注: MICCAI2026 Accepted

点击查看摘要

Abstract:Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

64. 【2606.18721】Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

链接https://arxiv.org/abs/2606.18721

作者:Hong-Jun Choi,Jongho Lee,Jaeyoung Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Table Structure Recognition, predicting HTML sequences, Table Structure, Structure Recognition, achieves impressive results

备注

点击查看摘要

Abstract:Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance = 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at this https URL

65. 【2606.18707】PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

链接https://arxiv.org/abs/2606.18707

作者:Asad Channa,Abdullah Khan,Asghar Ali Chandio,Aamir Akbar,Shahzad Memon,Aqib Hussain,Ameer Hamza

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:finding melanomas earlier, Automated segmentation, deep learning models, deep learning methods, deep learning

备注

点击查看摘要

Abstract:Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

66. 【2606.18702】UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

链接https://arxiv.org/abs/2606.18702

作者:Lin Zhang,Sicheng Mo,Zefan Cai,Jinhong Lin,Zihao Lin,Jiuxiang Gu,Krishna Kumar Singh,Yuheng Li,Yin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieving strong performance, generation, video diffusion models, video, achieving strong

备注

点击查看摘要

Abstract:Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: this https URL

67. 【2606.18687】Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

链接https://arxiv.org/abs/2606.18687

作者:Sagun Singh Shrestha,Samuel Harding,Abdelwahed Khamis,Saimunur Rahman,Peyman Moghadam

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:diverse hardware platforms, bridge diverse hardware, recognition increasingly relies, hardware platforms, increasingly relies

备注: IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

点击查看摘要

Abstract:Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

68. 【2606.18682】Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

链接https://arxiv.org/abs/2606.18682

作者:Asad Channa,Asghar Ali Chandio,Akhtar Hussain Jalbani,Mehwish Leghari,Shahzad Memon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:MRI images continues, accurately classifying brain, MRI images, classifying brain tumors, deep learning

备注

点击查看摘要

Abstract:Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

69. 【2606.18681】Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

链接https://arxiv.org/abs/2606.18681

作者:Jaeyeon Lee,Shunjie Wen,Dong-Wan Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Vision Language, Language Models, incur substantial computational, substantial computational overhead

备注: ECCV 2026 Under Review

点击查看摘要

Abstract:Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.

70. 【2606.18676】InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

链接https://arxiv.org/abs/2606.18676

作者:Qinqin Zhou,Fuhai Chen,Jipeng Wu,Zhiwei Chen,Zhikai Hu,Weiwei Cai

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:promises efficient discovery, Training-free neural architecture, search promises efficient, costly training, Training-free neural

备注

点击查看摘要

Abstract:Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we hypothesize is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.

71. 【2606.18675】BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

链接https://arxiv.org/abs/2606.18675

作者:Md Taimur Ahad,Bo Song,Yan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Resonance Imaging MRI, Magnetic Resonance Imaging, Convolutional Neural Networks, Gated Recurrent Units, Networks CNNs Vision

备注

点击查看摘要

Abstract:The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

72. 【2606.18661】LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

链接https://arxiv.org/abs/2606.18661

作者:Chengfu Liu,Dongyang Hou,Junwu Xiang,Cheng Yang,Xuezhi Cui,Zeyuan Wang,Liangtian Liu,Zelang Miao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:current paradigms struggle, simultaneously extract visual, extract visual features, Intelligent landslide hazard, complex geological scenarios

备注

点击查看摘要

Abstract:Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

73. 【2606.18658】On-Manifold Variational Learning with Heat-Kernel Priors

链接https://arxiv.org/abs/2606.18658

作者:Jiarui Xing,Tal Zeevi,Nian Wu,Jian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Learning unsupervised representations, true pathological heterogeneity, medical imaging cohorts, reveal clinically meaningful, capture true pathological

备注

点击查看摘要

Abstract:Learning unsupervised representations of medical imaging cohorts can reveal clinically meaningful prototypes without expert labels, which are often noisy and fail to capture true pathological heterogeneity. However, existing deep latent-variable models estimate Gaussian mixture priors via Euclidean averaging, producing prototypes that drift off the curved data manifold and degenerate as the number of sub-populations grows. We propose a manifold-anchored variational framework built on a geometry-aware Expectation-Maximization (EM) algorithm, whose M-step selects each sub-population prototype as the graph medoid with the highest diffusion centrality on a heat-kernel-weighted latent graph, ensuring that every prototype remains on-manifold. A Dirichlet energy regularizer enforces geometric smoothness of the latent space, and a per-sub-population uncertainty score enables label-free quality assessment. \rev{The manifold-anchored EM is a general-purpose geometric tool that extends standard EM and applies readily to other latent-variable models beyond this setting.} On cardiac scar and brain MRI benchmarks, our framework attains the highest accuracy among all compared methods, produces the sharpest prototypes reported to date, and remains stable at large sub-population counts where all baselines degenerate.

74. 【2606.18644】Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration

链接https://arxiv.org/abs/2606.18644

作者:Chen Zhao,Xiantao Hu,Song Wu,Qian Wang,Chen Wu,Rui Xie,Jian Yang,Ying Tai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:garnered significant interest, computer vision due, Spiking neural networks, neural networks, biological inspiration

备注: Accepted by Pattern Recognition

点击查看摘要

Abstract:Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

75. 【2606.18623】Intrinsic 4D Gaussian Segmentation from Scene Cues

链接https://arxiv.org/abs/2606.18623

作者:Hasan Yazar,Mohamed Rayan Barhdadi,Erchin Serpedin,Mehmet Tuncel,Hasan Kurban

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Gaussian Splatting reconstructs, Splatting reconstructs deforming, Gaussian Splatting, reconstructs deforming scenes, Splatting reconstructs

备注: 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

点击查看摘要

Abstract:Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

76. 【2606.18610】SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

链接https://arxiv.org/abs/2606.18610

作者:Wei-Cheng Tseng,Gashon Hussein,Yuzhu Dong,Allen Z. Ren,Lucy X. Shi,XuDong Wang,Sergey Levine,Zhaoshuo Li,Jinwei Gu,Florian Shkurti,Ming-Yu Liu,Quan Vuong

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Evaluating generalist robot, generalist robot manipulation, Evaluating generalist, robot manipulation policies, difficult to scale

备注

点击查看摘要

Abstract:Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

77. 【2606.18609】Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

链接https://arxiv.org/abs/2606.18609

作者:Nan Zhou,Ke Zou,Meng Liu,Linchao He,Jiaqi Zhu,Yi Zhang,Hu Chen,Huazhu Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenged by trust-undermining, Vision-Language models, CoEV, generated text, hallucinations

备注: MICCAI 2026 Accept. Submission Version

点击查看摘要

Abstract:Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in this http URL hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

78. 【2606.18591】Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

链接https://arxiv.org/abs/2606.18591

作者:Denis Savytski,Aiden Lei,Heding Liu,Warren Yang,Sihan Liang,Alexander Liu,Zhe Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:creation increasingly accessible, made content creation, content creation increasingly, lack narrative coherence, AI-generated videos lack

备注: Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

点击查看摘要

Abstract:Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

79. 【2606.18588】Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

链接https://arxiv.org/abs/2606.18588

作者:Wenqi Jia,Zhewen Hu,Ying Huang,Yu Gong,Stavros Kalafatis,Yuke Wang,Wei Niu,Chengming Zhang,Ang Li,Sheng Di,Yuede Ji,Bo Fang,Miao Yin

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)

关键词:requires optimizing hundreds, scenes requires optimizing, Gaussian Splatting, enables high-fidelity, high-fidelity and real-time

备注: 17 pages, 25 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables high-fidelity and real-time 3D scene reconstruction, but scaling training to large-scale scenes requires optimizing hundreds of millions of Gaussians across multiple GPUs. Existing distributed approaches either partition scenes into isolated regions, causing global inconsistency, or rely on global Gaussian-level exchanges, which lead to substantial growth in inter-GPU communication and quickly dominate iteration time. We propose Splaxel, a communication-efficient distributed 3DGS training framework based on pixel-level local rendering and global composition. Instead of synchronizing Gaussians, each GPU renders its local subset and exchanges only partial pixel values, maintaining mathematical consistency while keeping communication cost stable as the scene size increases. Splaxel further reduces pixel-level redundancy through geometric and transmittance visibility prediction and improves GPU utilization via conflict-free camera-view consolidation. Evaluated on large-scale datasets with up to 120M Gaussians, Splaxel achieves up to 7.6$\times$ speedup over the state-of-the-art distributed 3DGS framework while preserving high reconstruction quality.

Comments:
17 pages, 25 figures

Subjects:

Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.18588 [cs.DC]

(or
arXiv:2606.18588v1 [cs.DC] for this version)

https://doi.org/10.48550/arXiv.2606.18588

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
80. 【2606.18586】APT: Atomic Physical Transitions for Causal Video-Language Understanding

链接https://arxiv.org/abs/2606.18586

作者:Shang Wu,Haoran Lu,Songling Liu,Chenwei Xu,Lie Lu,Pranav Maneriker,Fan Du,Manling Li,Zhaoran Wang,Han Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:APT, Physical, Atomic Physical Transitions, APT chains, event

备注

点击查看摘要

Abstract:Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

81. 【2606.18583】Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

链接https://arxiv.org/abs/2606.18583

作者:Yandi Yang,Xianghong Zou,Jianping Li,Haofeng Xie,Saurav Uprety,Hongzhou Yang,Naser El-Sheimy

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:LiDAR place recognition, place recognition, place recognition determines, LiDAR place, aerial-ground LiDAR place

备注

点击查看摘要

Abstract:LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

82. 【2606.18582】chnical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

链接https://arxiv.org/abs/2606.18582

作者:Jaeil Park,Hyobin Choi,Sangjin Lee,Hyungtae Lim,Sung-Hoon Yoon

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:Field Robotics evaluates, Fine-Grained Semantic Segmentation, dense semantic segmentation, Semantic Segmentation Challenge, Robotics evaluates dense

备注: 5 pages, 4 figures

点击查看摘要

Abstract:The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: this http URL.

83. 【2606.18566】Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

链接https://arxiv.org/abs/2606.18566

作者:Hao-Yuan Ma,Li Zhang,Yushi Qiu,Jie Gao,Yan Zhang,Bangjun Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:low-light crowd counting, Crowd counting, low-light crowd, computer vision, Crowd

备注

点击查看摘要

Abstract:Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

84. 【2606.18565】Experimental Analysis of Neural Network-Based Image Classification on the CIFAR-10 Dataset

链接https://arxiv.org/abs/2606.18565

作者:Necati Kagan Erkek,Emre Balci,Berkin Halay

类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:neural image classification, convolutional network formulations, benchmark is presented, network formulations, investigation of neural

备注: 7 pages

点击查看摘要

Abstract:An experimental investigation of neural image classification on the CIFAR-10 benchmark is presented through fully connected and convolutional network formulations. The analysis emphasizes the complete learning pipeline: image vectorization, normalization, one-hot class encoding, supervised loss minimization, learning-rate selection, mini-batch training, convolutional feature extraction, max-pooling, and validation-based generalization assessment. A convolutional architecture with six convolutional layers and three max-pooling stages is evaluated for ten training epochs using a batch size of 128 and an Adam optimizer with a learning rate of 0.001. The validation accuracy reaches approximately 74.77%, while the validation loss begins to increase after the middle of training despite continued reduction in training loss. The resulting behavior illustrates the practical difference between representation learning and memorization, and it provides a compact experimental baseline for future studies on regularization, data augmentation, deeper architectures, and reproducible image-classification education.

85. 【2606.18558】MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

链接https://arxiv.org/abs/2606.18558

作者:Jianing Zhang,Chenhao Zheng,Yajun Yang,Max Argus,Rustin Soraki,Winson Han,Taira Anderson,Chun-Liang Li,Shuo Liu,Jiafei Duan,Zhongzheng Ren,Jieyu Zhang,Ranjay Krishna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:agents must anticipate, plan actions, reason about physical, physical interactions, move in order

备注

点击查看摘要

Abstract:Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

86. 【2606.18555】Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

链接https://arxiv.org/abs/2606.18555

作者:Trong-Vu Hoang,Quang-Binh Nguyen,Dinh-Khoi Vo,Hoai-Danh Vo,Minh-Triet Tran,Trung-Nghia Le

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:presents challenges due, recognition presents challenges, indoor image recognition, diverse object arrangements, image recognition presents

备注: MAPR 2024

点击查看摘要

Abstract:In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

87. 【2606.18554】Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

链接https://arxiv.org/abs/2606.18554

作者:Duc-Manh Phan,Quoc-Duy Tran,Duy-Khang Do,Anh-Tuan Vo,Hai-Dang Nguyen,Trong Le Do,Mai-Khiem Tran,Vinh-Tiep Nguyen,Tam V. Nguyen,Isao Echizen,Minh-Triet Tran,Trung-Nghia Le

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:distinguish authentic content, resemble real photographs, closely resemble real, highly photorealistic synthetic, making it increasingly

备注: SOICT 2025

点击查看摘要

Abstract:The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

88. 【2606.18553】Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

链接https://arxiv.org/abs/2606.18553

作者:Minh-Loi Nguyen,Xuan-Vu Le,Long-Bao Nguyen,Hoang-Bach Ngo,Trung-Nghia Le

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional image captioning, image captioning methods, Traditional image, methods often struggle, details not directly

备注: SOICT 2025

点击查看摘要

Abstract:Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at this https URL.

89. 【2606.18528】A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

链接https://arxiv.org/abs/2606.18528

作者:Kecia G. de Moura,Robert Sabourin,Rafael M. O. Cruz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Offline handwritten signature, Offline handwritten, handwritten signature verification, signature verification aims, static images

备注: Accepted for oral presentation at the International Conference on Pattern Recognition (ICPR) 2026

点击查看摘要

Abstract:Offline handwritten signature verification aims to distinguish genuine from forged signatures using static images. Since real forgeries are rarely available, negative samples are usually randomly drawn from genuine signatures of other users to create training data. However, this random selection often lacks diversity, increases redundancy, and escalates computational cost, leading to inefficient training. We propose a data-driven strategy to generate diverse, informative negative samples using prototypical signatures, which are compact, non-identifiable summaries of genuine signature features. Based on the experiments results, we conclude that (i) prototypical signatures yield more informative negative samples, improving the detection of skilled forgeries; (ii) the proposed approach is backbone-agnostic, showing robustness across architectures; and (iii) when combined with a primal-form linear SVM, it serves as an alternative to RBF-based models while significantly improving scalability and computational efficiency. Implementation of the method is available at this https URL.

90. 【2606.18510】Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

链接https://arxiv.org/abs/2606.18510

作者:Ngela Landon Ntung,Floride Tuyisenge,Jema David Ndibwile

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Presentation Attack Detection, Face Presentation Attack, Attack Detection, Presentation Attack, existing approaches exhibit

备注: 8 Pages, 4 Figures, 5 Tables

点击查看摘要

Abstract:Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

91. 【2606.18496】Neural Phase Correlation

链接https://arxiv.org/abs/2606.18496

作者:Cole Reynolds

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Correspondence is fundamentally, fundamentally relational, common scene, seeks the unknown, Correspondence

备注

点击查看摘要

Abstract:Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.18496 [cs.CV]

(or
arXiv:2606.18496v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.18496

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
92. 【2606.18484】Vines-DB: An RGB image dataset for multi-species ornamental vine segmentation

链接https://arxiv.org/abs/2606.18484

作者:Saroj Burlakoti,Utsav Bhandari,Aaron Etienne,Shital Poudyal(Utah State University)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Utah Agricultural Experiment, Agricultural Experiment Station, Experiment Station Greenville, Station Greenville Research, Greenville Research Farm

备注: 7 pages, 1 figure. Source data repository: OSF (DOI: [https://doi.org/10.17605/OSF.IO/YJHCK](https://doi.org/10.17605/OSF.IO/YJHCK) )

点击查看摘要

Abstract:The Vines-DB dataset contains 1,218 original high-resolution RGB images of seven ornamental vine species collected under field conditions at the Utah Agricultural Experiment Station's Greenville Research Farm in Logan, Utah, USA. The dataset was generated from 168 individual vine plants that were transplanted in 2022 and photographed repeatedly across multiple months during the 2023 and 2024 growing seasons (July-October). Images were captured with an iPhone 16 Pro equipped with a 48 MP camera between 10:00 AM and 12:00 PM under daylight. Vines were grown on 1.2m x 2.4m trellises and photographed from a distance of 1m against black or white Styrofoam backdrops to improve contrast and reduce background noise. The dataset includes Akebia quinata, Campsis radicans, Hydrangea anomala petiolaris, Lonicera x heckrottii, Campsis x tagliabuana 'Madame Galen', Parthenocissus quinquefolia, and Wisteria floribunda. All original images were manually annotated in Roboflow by trained annotators to produce polygon-based instance segmentation masks for eight classes, including seven species and background. After preprocessing and data augmentation, the working dataset was expanded to 2,307 images for model development and evaluation. The augmented dataset was divided into 2,019 training images, 192 validation images, and 96 test images using stratified sampling to maintain balanced representation. Vines-DB supports the development and evaluation of deep learning models for multi-class instance segmentation in precision horticulture and urban ecology. The dataset enables applications such as automated canopy cover estimation, species identification, and scalable field phenotyping. In addition, repeated monthly imaging of the plants captures temporal variation in canopy development and plant appearance, increasing the dataset's utility for segmentation benchmarking under realistic field conditions.

93. 【2606.18478】Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

链接https://arxiv.org/abs/2606.18478

作者:Siyi Chen,Shaowei Liu,Yixuan Jia,Zian Wang,Huan Ling,Qing Qu,Jun Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent progress, efficient few-step students, distilling multi-step video, multi-step video diffusion, Distribution Matching Distillation

备注

点击查看摘要

Abstract:Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

94. 【2606.18472】Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

链接https://arxiv.org/abs/2606.18472

作者:Sneha Paul,Zachary Patterson,Nizar Bouguila

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Domain adaptation remains, adaptation remains, remains a central, central challenge, visual and textual

备注: Accepted at Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

95. 【2606.18441】Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

链接https://arxiv.org/abs/2606.18441

作者:Chengwen Liu,Zhe Huang,Jisheng Dang,Hong Peng,Qi Tian,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, multimodal large language, language models, large language, Reinforcement learning

备注

点击查看摘要

Abstract:Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at this https URL.

96. 【2606.18439】RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

链接https://arxiv.org/abs/2606.18439

作者:Jinhao You(1),Shuo Lyu(1),Zhuohang Lyu(1),Tanxuan Li(1),Zibo Zhao(1),Jiaxiang Hu(2),Kai Tang(3),Yichen Guo(3) ((1) University of Pennsylvania, (2) University of California, Irvine, (3) Nanyang Technological University)

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Visual Geometry Grounded, Geometry Grounded Transformer, Grounded Transformer, Visual Geometry, forward pass

备注: 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

97. 【2606.18429】CAOA -- Completion-Assisted Object-CAD Alignment

链接https://arxiv.org/abs/2606.18429

作者:Hiranya Garbha Kumar,Minhas Kamal,Balakrishnan Prabhakaran

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Accurately aligning CAD, aligning CAD models, Accurately aligning, indoor RGB-D scans, RGB-D scans

备注: GitHub: [this https URL](https://github.com/MinhasKamal/CAOA)

点击查看摘要

Abstract:Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

98. 【2606.18318】Budget-Aware Adaptive Adversarial Patches for Black-Box Object Detection

链接https://arxiv.org/abs/2606.18318

作者:Pedram MohajerAnsari,Amir Salarpour,David Fernandez,Mert D. Pesé

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Adversarial patches pose, emph, Adversarial patches, pose a practical, practical threat

备注: Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026)

点击查看摘要

Abstract:Adversarial patches pose a practical threat to modern object detectors. Prior work shows vulnerability, but three gaps limit actionable insight: (i) few \emph{score-based black-box} attacks \emph{jointly} optimize patch \emph{location, texture, and size} under tight query budgets; (ii) success is rarely tied to the patch's \emph{visual footprint}; and (iii) evaluations often conflate EOT robustness with plain-view suppression. We present \method{}, a query-efficient, budget-adaptive black-box attack that couples a lightweight \emph{Contextual Thompson-Sampling} placer with NES-style pixel updates, growing the patch only when progress stalls. Reporting is anchored by a \emph{strict plain-image} suppression test; EOT is audited but never used as a substitute for success, and optional appearance/printability weights expose strength--visibility trade-offs. Across YOLOv5, Faster R-CNN, and YOLOS, \method{} achieves strong suppression on CNN-based detectors and substantial suppression on the transformer-based detector, using compact patches and exposing clear query--footprint trade-offs relative to fixed-size and heuristic baselines. A print--capture pilot further shows transfer across unseen physical objects and viewpoints.

99. 【2606.18826】EDoF-NeRF: extended depth-of-field neural radiance fields using a coded aperture camera

链接https://arxiv.org/abs/2606.18826

作者:Yoshiyuki Shirasaki,Ryoichi Horisaki

类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:neural radiance fields, implicit neural representations, high-fidelity neural radiance, construct high-fidelity neural, neural representations

备注

点击查看摘要

Abstract:We propose a method for extending the depth-of-field (DoF) to construct high-fidelity neural radiance fields (NeRF) -- an emerging technique for rendering photorealistic novel views from a dataset of images captured at different viewpoints, based on implicit neural representations. The trade-off between DoF and light quantity is inherent not only in conventional cameras but also in NeRF, since the datasets used by NeRF are captured by these cameras. To address this issue, we introduce a coded aperture placed at the camera pupil, preserving spatial frequency components under defocused conditions. We develop a camera model incorporating coded apertures into NeRF, allowing direct input of coded images and enabling the generation of novel views with an extended DoF. We validate the proposed method, termed extended DoF-NeRF (EDoF-NeRF), through simulations and experiments, demonstrating its superior performance compared to conventional aperture cameras.

100. 【2606.18523】DART: A design-aware microfluidic chip paradigm for real-time live-cell image analysis

链接https://arxiv.org/abs/2606.18523

作者:Johannes Seiffarth,Matthias Pesch,Lukas Scholtes,Dietrich Kohlheyer,Hanno Scharr,Katharina Nöh

类目:Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)

关键词:High-throughput microfluidic live-cell, rich single-cell data, live-cell imaging generates, imaging generates rich, generates rich single-cell

备注

点击查看摘要

Abstract:High-throughput microfluidic live-cell imaging generates rich single-cell data. Yet semi-automated procedures for locating regions of interest (RoIs), each containing one cell population, and removing surrounding microfluidic structures from recorded images, scale with the number of RoIs. This prevents real-time image analysis and delays time-to-insight by hours to days. We introduce the Design-Aware and Real-Time capable (DART) paradigm for microfluidic cultivation chips, which aligns the CAD blueprint with the physical chip and thereby enables throughput-independent localization of all RoIs and fully automated image processing across diverse RoI geometries and chip layouts. DART establishes this alignment through embedded fiducial markers and deep-learning-based marker detection. We validate DART using the Swiss Army Knife chip, which combines eight structurally distinct RoI designs across 1164 RoI locations. DART localizes all RoIs in five minutes, removes microfluidic structures from raw microscopy images in 40 ms, and performs fully automated image analysis, including cell segmentation, in under 1.1 s per image. Together, these capabilities establish DART as an end-to-end hardware-software paradigm with real-time-capable analysis that paves the way toward closed-loop and outcome-driven smart microscopy.