本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1376篇论文,其中:

  • 自然语言处理186
  • 信息检索35
  • 计算机视觉291

自然语言处理

1. 【2606.17056】he Value Axis: Language Models Encode Whether They're on the Right Track

链接https://arxiv.org/abs/2606.17056

作者:Nick Jiang,Isaac Kauvar,Jack Lindsey

类目:Computation and Language (cs.CL)

关键词:models internally track, current trajectory, internally track, ongoing strategy, strategy will achieve

备注: Code repository: [this https URL](https://github.com/nickjiang2378/value-axis)

点击查看摘要

Abstract:We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.

2. 【2606.17053】Context-Aware RL for Agentic and Multimodal LLMs

链接https://arxiv.org/abs/2606.17053

作者:Peiyang Xu,Bangzheng Li,Sijia Liu,Karthik R. Narasimhan,Pramod Viswanath,Prateek Mittal,Xingyu Fu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large language models, Large language, answering requires identifying, requires identifying, identifying a small

备注: 29 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

3. 【2606.17041】Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

链接https://arxiv.org/abs/2606.17041

作者:Anzhe Xie,Weihang Su,Yujia Zhou,Yiqun Liu,Qingyao Ai

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:ECO-guided study selection, ECO-guided study, study selection, statistical aggregation, demanding form

备注: 13 pages, 7 figures, preprint for arXiv, dataset and code available at [this https URL](https://github.com/BFTree/MetaSyn)

点击查看摘要

Abstract:Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

Comments:
13 pages, 7 figures, preprint for arXiv, dataset and code available at this https URL

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

ACMclasses:
H.3.3; I.2.7; H.3.7

Cite as:
arXiv:2606.17041 [cs.CL]

(or
arXiv:2606.17041v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.17041

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2606.17034】KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing

链接https://arxiv.org/abs/2606.17034

作者:Mufei Li,Shikun Liu,Dongqi Fu,Haoyu Wang,Yinglong Xia,Hong Li,Hong Yan,Pan Li

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Post-hoc context erasing, Post-hoc context, global consequence, local edit, long-context LLM applications

备注: Oral at the ICML 2026 Workshop on the Impact of Memorization on Trustworthy Foundation Models

点击查看摘要

Abstract:Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, making its computational cost depend on suffix length rather than erased-span length. We introduce KVEraser, a learned KV-cache editing method for efficient localized context erasing. Given a processed context and a span to remove, KVEraser replaces only the KV states of the erased interval with learned steering states while reusing the remaining cache unchanged. To learn a transferable erasing mechanism, we build a two-stage training pipeline: generic span-neighbor pre-training teaches the eraser to suppress the influence of the erased span, while task-specific fine-tuning adapts this capability to downstream scenarios. Experiments show that KVEraser nearly matches full recomputation in post-erasure performance on in-domain tasks across 1K--32K context lengths, while its latency increases by only 24% compared with a 17.6x increase for full recomputation. KVEraser also generalizes to unseen long-document QA tasks with harmful factual distractors, achieving the best performance among approximate baselines with a 3--4x speedup over full recomputation.

5. 【2606.17029】DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

链接https://arxiv.org/abs/2606.17029

作者:Minghang Zhu,Chuyang Wei,Junhao Xu,Yilin Cheng,Zhumin Chen,Jiyan He

类目:Computation and Language (cs.CL)

关键词:agents synthesize long-form, searching and reasoning, reasoning over retrieved, synthesize long-form reports, Deep research agents

备注

点击查看摘要

Abstract:Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.

6. 【2606.17016】okenPilot: Cache-Efficient Context Management for LLM Agents

链接https://arxiv.org/abs/2606.17016

作者:Buqiang Xu,Zirui Xue,Dianmou Chen,Chenyang Fu,Chiyu Wu,Caiying Huang,Chen Jiang,Jizhan Fang,Xinle Deng,Yijun Chen,Yunzhi Yao,Xuehai Wang,Jin Shang,Gong Yu,Ningyu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:LLM agents, context accumulation drives, long-horizon sessions, agents are deployed, deployed in long-horizon

备注: LightMem Series: Work in Progress

点击查看摘要

Abstract:As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at this https URL.

7. 【2606.16999】Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

链接https://arxiv.org/abs/2606.16999

作者:Mehmet Iscan

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Frozen small code, small code models, Frozen small, run locally, locally without fine-tuning

备注: 33 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Frozen small code models (=1.5B parameters, run locally without fine-tuning) suit offline and privacy-constrained use, but often emit plausible-but-wrong programs. A natural remedy is a post-hoc operator that selects, verifies, repairs, or re-processes the model's samples without retraining; in principled form it is Popperian: attack each candidate with a severe test, keep what survives. We measure whether such operators help. Under one deterministic execution oracle and a leakage-free, matched-compute protocol, 26 semantic post-hoc operators (selection, verification, repair, elimination, portfolios, sound vetoes, generation conditioning) are evaluated against Best-of-N (BoN); on the cells and benchmarks tested, none improves held-out accuracy over BoN. The negative is mechanistic: a coverage wall (systematic hard-task failures deeper sampling does not rescue), a capability scissors (a competent generator leaves almost no discriminable error among visible-test passers), and a near-empty consensus trap (the visible-pass-but-hidden-wrong majority a leakage-free selector needs rarely co-occurs with a correct alternative). A distribution-free do-no-harm bound cannot certify a harm rate =alpha at zero observed harm unless n=45. Two operators help on a different axis, outside the semantic output space. An expression-layer recovery (M1), the only accuracy gain here, recovers correct programs the standard extractor discards (robust extraction and public-test signature alignment); it does no harm (b10=0), is leakage-free, and lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4). An adaptive consensus early-stop (ACE) is a calibrated compute-saving control (~19% saving, zero harm). M1 and the selection negative replicate on HumanEval+ and MBPP+ across three model cells. The lesson: fix the harness and measure coverage before blaming semantic post-hoc reasoning.

8. 【2606.16934】Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

链接https://arxiv.org/abs/2606.16934

作者:Patomporn Payoungkhamdee,Napat Laosaengpha,Jenta Wonglertsakul,Pittawat Taveekitworachai,Pume Tuchinda,Panjapong Poobanchuen,Ekapol Chuangsuwanich,Can Udomcharoenchaikit,Samuel Cahyawijaya,Peerat Limkonchotiwat,Sarana Nutanong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Code Interpreter, Reasoning, effective code reasoning, paradigm for enhancing, executable computation

备注

点击查看摘要

Abstract:Reasoning with a Code Interpreter (CI) has emerged as an effective paradigm for enhancing the reasoning capabilities of large language models (LLMs) through executable computation and iterative verification. Despite its growing adoption, the behavioral properties underlying effective code reasoning remain largely underexplored. In this work, we investigate code reasoning from two distinct perspectives inspired by prior studies of natural language reasoning: extrinsic properties, represented by crucial tokens, and intrinsic properties, represented by code-specific cognitive behaviors. Across multiple LLMs, we find that stronger CI reasoning models consistently exhibit a higher prevalence of crucial tokens and cognitive behaviors, particularly verification, backtracking, and backward chaining. Building on these observations, we examine how these properties can be leveraged during both inference and training. At inference time, appending code-specific crucial tokens improves performance on several reasoning capabilities, including mathematical, ordering, and optimization, while yielding limited benefits elsewhere. At training time, augmenting a state-of-the-art framework with code-specific cognitive behaviors improves supervised fine-tuning and reinforcement learning performance in two of three evaluated models. Further analysis shows that these behaviors reduce overthinking in incorrect responses and improve token efficiency, while also revealing factors that limit gains in a certain model. Our findings provide the first systematic characterization of effective reasoning with CI and demonstrate both the potential and limitations of leveraging key properties to improve CI-based reasoning.

9. 【2606.16910】IMPACTeen: Intentions, Manipulation, Persuasion, Annotations, and Consequences in Teen Communication Dataset

链接https://arxiv.org/abs/2606.16910

作者:Aleksander Szczęsny,Wiktoria Mieleszczenko-Kowszewicz,Maciej Markiewicz,Beata Bajcar,Tomasz Adamczyk,Jolanta Babiak,Grzegorz Chodak,Przemysław Kazienko

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:scenarios spanning interpersonal, influence scenarios spanning, textual social influence, social influence scenarios, spanning interpersonal

备注

点击查看摘要

Abstract:IMPACTeen is a dataset of textual social influence scenarios spanning interpersonal, media-based, and digital settings in an adolescent context. It contains 1,021 texts, 5,100 individual annotation records, and gold labels for social influence techniques, with each text annotated from five distinct perspectives: teenagers, parents, psychologists, communication experts, and teachers. The resource was constructed through constrained LLM generation, followed by a two-step human editing and validation phase aimed at ensuring youth-context realism. A multi-dimensional annotation covered influence presence, techniques, intentions, consequences, resistance, reactions, and annotation confidence. The dataset supports research on social influence detection, annotator disagreement, cross-lingual modeling, and the training and evaluation of language models. The dataset was created in Polish and is accompanied by a corresponding English version.

10. 【2606.16908】LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

链接https://arxiv.org/abs/2606.16908

作者:Amr Mohamed,Guokan Shang,Michalis Vazirgiannis

类目:Computation and Language (cs.CL)

关键词:large language models, Diffusion large language, refining masked sequences, enabling parallel token, iteratively refining masked

备注

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textsc{LESS}, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textsc{LESS} implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top-$K$ inter-step Jensen--Shannon divergence. We evaluate \textsc{LESS} on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textsc{LESS} improves average accuracy over strong training-free adaptive samplers while using $72.1\%$ fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

11. 【2606.16905】Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

链接https://arxiv.org/abs/2606.16905

作者:Mingyang Li,Yurou Liu,Jieping Ye,Bing Su,Ji-Rong Wen,Zheng Wang

类目:Computation and Language (cs.CL)

关键词:single autoregressive framework, autoregressive framework based, scientific generative language, Generative Objects, unifies heterogeneous tasks

备注

点击查看摘要

Abstract:In this report, we present LOGOS (Language Of Generative Objects in Science), a scientific generative language model that unifies heterogeneous tasks across the natural sciences within a single autoregressive framework based on a shared scientific grammar. It encodes diverse scientific objects and their spatial interactions as token sequences over a common vocabulary. By representing spatial contact and constraint patterns as discrete tokens, the model captures complex structural interactions in a purely sequential manner, without relying on explicit coordinates or geometric neural networks. This unified representation enables a wide range of downstream tasks to be formulated consistently as next-token prediction in the same grammar space, creating strong alignment between continued multi-domain pre-training and downstream objectives. Across diverse tasks, LOGOS consistently matches or outperforms domain-specific baselines, providing preliminary evidence for the feasibility of "one model fits all" in the natural sciences. We train LOGOS models at different scales (1B, 3B, and 8B parameters) and find a consistent positive correlation between model size and performance. This suggests that the future of AI for Science (AI4S) may not lie in building an independent technical stack that is separated from large language models (LLMs). Instead, it may depend on deeply aligning scientific foundation models with LLMs through shared architectures, shared training paradigms, and shared inference infrastructure, so that LLMs can truly become a new entry point for AI4S. We release the model weights and associated resources to facilitate further research.

12. 【2606.16897】Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures

链接https://arxiv.org/abs/2606.16897

作者:Xueping Gao

类目:Computation and Language (cs.CL)

关键词:LLM architectures encode, architectures encode high-level, LLM architectures, encode high-level concepts, architectures encode

备注

点击查看摘要

Abstract:Do different LLM architectures encode high-level concepts in structurally compatible ways? We systematically characterize a geometric-functional universality dissociation: across multiple concept domains and architectural families, moderate geometric convergence coexists with near-perfect functional transfer. Using contrastive-difference CKA (CKA_Delta), a training-free diagnostic that computes kernel alignment on per-sample contrastive differences, we isolate concept-specific convergence from generic similarity -- achieving significant discrimination where standard CKA cannot. The dissociation replicates across all six concept domains we test (five with p = 0.017 geometric discrimination and safety as a converging-functional trend, p = 0.08), including two non-instruction concepts (code-vs-NL, reasoning-vs-recall) validated without system prompts; a single 70B--70B pair provides an observational note that universality may strengthen with scale, requiring replication with additional =70B models. We position CKA_Delta as a practical regime classifier and architectural outlier detector (Gemma: d = 1.08, AUC = 0.79) rather than an absolute transfer-accuracy predictor, providing a training-free diagnostic for cross-architecture concept monitoring.

13. 【2606.16893】Symbolic Informalization: Fluent, Productive, Multilingual

链接https://arxiv.org/abs/2606.16893

作者:Aarne Ranta

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:Symbolic informalization enables, Symbolic informalization, enables a reliable, reliable conversion, informalization enables

备注

点击查看摘要

Abstract:Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

14. 【2606.16890】Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

链接https://arxiv.org/abs/2606.16890

作者:Sanjay Basu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Aggregate accuracy benchmarks, electronic health record, inferential steps produce, steps produce disproportionately, language models fail

备注: 20 pages, 5 figures. Code: [this https URL](https://github.com/sanjaybasu/compositional-depth-clinical-ehr)

点击查看摘要

Abstract:Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

15. 【2606.16874】Understanding Scam Trends and Rail Paths from Reddit Self-Disclosure Narratives

链接https://arxiv.org/abs/2606.16874

作者:Yangjun Zhang,Mirko Bottarelli,Mark Hooper,Carsten Maple

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)

关键词:lifecycle includes temporally, includes temporally ordered, Online scam behavior, temporally ordered rails, Online scam

备注: 6 pages, International Conference on AI and the Digital Economy (CADE) 2026

点击查看摘要

Abstract:Online scam behavior is inherently multi-stage, and the lifecycle includes temporally ordered rails and events rather than isolated signals. Existing works analyze characteristics of scam types and rails, but they do not track scam trends across years. Moreover, the work on the relations between rails is hampered due to the lack of open-source datasets with annotations and coverage of different scam types. To address these gaps, we build a dataset to analyze the yearly trend of scam characteristics and rail paths using Reddit self-disclosure narratives from 2023 to 2025. We collect 21,304 posts from scam-related subreddits with at least one rail among identity, communication, platform, and payment for trend analysis by heuristic annotation. Then, we label 1,800 posts containing explicit or recoverable scam chains by an LLM-assisted method for scam path analysis. The method is evaluated with human annotation. Lastly, we run a topic model on the comments of the posts to analyze the community support behavior. The results reveal that scam processes are predominantly multi-rail. Across years, different scam types and rail components dominate. Different scam types vary systematically in path complexity. Reddit support behaviors have become more detailed over time. This work supports synthetic scam chain data simulation and AI-related scam risk assessment, though findings may not generalise to other platforms.

16. 【2606.16867】Revisiting the Systematicity in Negation in the Era of In-Context Learning

链接https://arxiv.org/abs/2606.16867

作者:Hitomi Yanaka,Taisei Yamamoto

类目:Computation and Language (cs.CL)

关键词:large language models, negated sentences remains, large language, language models, meaning of negated

备注: Accepted to the 6th Workshop Natural Language Meets Logic and Machine Learning (NALOMA2026) at ESSLLI2026

点击查看摘要

Abstract:Understanding the meaning of negated sentences remains one of the challenges for language models, even in the era of large language models (LLMs). We analyze systematicity regarding LLM understanding of negation from two perspectives: behavioral systematicity and representational systematicity. For behavioral systematicity, we confirm that through demonstrations and in-context learning, LLMs can recognize negation expressions and scope within sentences to some extent, but they fail to achieve perfect performance. In particular, the difficulty of the negation scope recognition for models varies depending on the output format. For representational systematicity, we analyze the extent to which function vectors can be robustly constructed from in-context examples for tasks that are essential to understanding negation. The experiments suggest that while function vectors can be composed for negation cue extraction tasks, extracting function vectors for recognizing scope is more challenging.

17. 【2606.16847】Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

链接https://arxiv.org/abs/2606.16847

作者:Yizhen Yao,Qinglin Zhu,Runcong Zhao,Xiangxiang Dai,Yanzheng Xiang,Yulan He,Lin Gui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Diffusion Large Language, Diffusion Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) offer a promising avenue for parallel generation but face a trade-off between decoding speed and quality. While revocable decoding strategies attempt to mitigate errors by verifying and remasking tokens, they typically operate within a mixed-quality context. This leads to two critical failures: \textit{Error Propagation}, where new tokens absorb toxic information from erroneous context, and \textit{Local Error Reinforcement}, where errors mutually reinforce each other to evade detection. To alleviate these challenges, we propose ASRD (Anchor Supervised Revocable Decoding), a training-free framework that operates within the embedding space. ASRD explicitly decouples the decoding context into trusted \textit{Anchor Tokens}, which are identified via temporal consistency, and uncertain candidates. Leveraging a dynamic Anchor Tokens Cache, we introduce two complementary mechanisms: (1) Anchor-Guided Generation, which injects entropy-weighted anchor signals into masked positions to implicitly rectify attention toward the reliable global skeleton; and (2) Anchor-Perturbed Verification, which applies orthogonal perturbations to uncertain candidate tokens, destabilizing and remasking errors driven by fragile local consensus. Extensive experiments on math and coding benchmarks demonstrate that ASRD outperforms recent remasking baselines, achieving accuracy improvements of up to 6.4\% while accelerating inference throughput by up to 7.2$\times$.

18. 【2606.16845】Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts

链接https://arxiv.org/abs/2606.16845

作者:Ankit Bhattacharjee,Krityapriya Bhaumik

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, literal semantic interpretations, Large Language, Language Models, making zero-shot irony

备注: 11 pages total, 10 figures

点击查看摘要

Abstract:Large Language Models (LLMs) natively default to literal semantic interpretations, making zero-shot irony detection a persistent challenge. We introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought (CoT) reasoning trajectories without Supervised Fine-Tuning (SFT). Evaluated on a strictly held-out TweetEval test set (N=734), RDS achieves 78.1% accuracy and a Macro F1 of 0.777, matching the absolute performance ceiling of the fine-tuned BERTweet. On the heavily imbalanced iSarcasm dataset, the frozen CoT pipeline filters 22.5% of out-of-distribution hallucinations, yielding a zero-shot Macro F1 of 0.6726 and Ironic F1 of 0.4821, outperforming multiple heavily supervised SemEval transformer ensembles. A statistical ablation confirms this structural synergy: adding the symbolic prior to the neural baseline yields no significant gain (p = 0.242), and the marginal benefit of adding the CoT pipeline to that prior is heavily compressed (p = 0.149). Only the complete, concurrent fusion of all three signals achieves a statistically validated improvement over the baseline (p = 0.005).

19. 【2606.16843】Data-Driven Decoding of Russell's Circumplex Model of Affect

链接https://arxiv.org/abs/2606.16843

作者:Amdjed Belaref,Samir Sadok,Zineb Noumir,Renaud Seguier

类目:Computation and Language (cs.CL)

关键词:Affective computing increasingly, high-dimensional black boxes, computing increasingly relies, Affective computing, remain opaque

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Affective computing increasingly relies on deep learning to represent emotions, yet latent spaces often remain opaque, high-dimensional black boxes. This paper investigates whether Transformers' embeddings recover the geometric regularities of Russell's circumplex model. We unify two complementary experiments testing the hypothesis that, after training models on text and speech, their resulting latent spaces encode a topology consistent with valence-arousal and reproduce human-like neighborhood relations. Specifically, we evaluate deep representations extracted from Transformer-based text (RoBERTa) and speech (wav2vec 2.0) encoders, along with a multimodal Transformer fusion architecture, across naturalistic datasets like MSP-Podcast and controlled LLM-generated stimuli. Our analysis reveals that multimodal fusion of text and audio yields perfect topological alignment with Russell's primary emotion ordering. Furthermore, in a zero-shot setting using generic text embeddings, projected fine-grained emotion terms fall close to their established human-mapped coordinates. Our contribution is a novel, data-driven framework for validating emotion models, demonstrating that Russell's circumplex structure is intrinsically encoded in the embeddings of these modalities rather than being solely an artifact of human labeling, thereby bridging the gap between psychological theory and representation learning.

20. 【2606.16836】Does Traversal Order Matter? A Systematic Study of Tree Traversal Methods in Transformer Grammars

链接https://arxiv.org/abs/2606.16836

作者:Zongru Liu,Pengyu Ji,Pengcheng Wang,Kewei Tu

类目:Computation and Language (cs.CL)

关键词:syntactic tree structures, Transformer Grammars, incorporating syntactic tree, enhance language modeling, task-aware Transformer Grammars

备注

点击查看摘要

Abstract:Transformer Grammars (TGs) enhance language modeling by incorporating syntactic tree structures. Despite the potentially significant impact on model performance of how syntactic trees are linearized in TGs, existing studies rely solely on Depth-First Traversal (DFT) for linearization. In this paper, we expand the traversal design space by exploring Breadth-First Traversal (BFT) and a novel hybrid traversal strategy, Production-Rule Traversal (PRT), which combines the structural lookahead of BFT with the early lexical generation of DFT. We integrate these traversal methods with varying tree configurations and masking strategies, and empirically evaluate their performance on language modeling, syntactic generalization and summarization. We reveal the inherent trade-offs between nested composition and global lookahead, providing actionable recommendations for designing task-aware Transformer Grammars.

21. 【2606.16825】ying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models

链接https://arxiv.org/abs/2606.16825

作者:Martin Jaggi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, efficiently scale Large, scale Large Language, architectures efficiently scale, full parameter count

备注: Code available at [this https URL](https://github.com/epfml/looped-moe)

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. By exploiting the parameter redundancy inherent in MoE pathways, our method provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs.

Comments:
Code available at this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2606.16825 [cs.CL]

(or
arXiv:2606.16825v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.16825

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
22. 【2606.16821】How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation

链接https://arxiv.org/abs/2606.16821

作者:Yimeng Chen,Zhe Ren,Firas Laakom,Yu Li,Dandan Guo,Jürgen Schmidhuber

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词:Large language model, agents synthesize open-web, Large language, synthesize open-web content, behalf of users

备注: 23 pages, 3 figures

点击查看摘要

Abstract:Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline, a five-mode attack taxonomy, and multiple output-level metrics. We evaluate 13 LLM backends on 308 cases each. Results show that vulnerability patterns vary across backends: overall attack success rate (ASR) ranges from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, the strongest attack mode differs by model family, and the same deployment scaffold could amplify or decrease ASR on different backends. An auxiliary agent-skill probe, where endorsement becomes an install command, exposes a sharp split among otherwise robust backends: Claude over-rejects while GPT over-trusts. These findings argue for treating recommendation reliability under adversarial search content as a first-class dimension of backend safety evaluation.

23. 【2606.16817】Understanding the Behaviors of Environment-aware Information Retrieval

链接https://arxiv.org/abs/2606.16817

作者:Ruifeng Yuan,Chaohao Yuan,David Dai,Yu Rong,Hong Cheng,Hou Pong Chan,Chenghao Xiao

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Recent retrieval-augmented generation, demonstrated strong capability, current research overlooks, Recent retrieval-augmented, query formulation strategies

备注: ACL 2026 Main

点击查看摘要

Abstract:Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we present the first systematic analysis of how LLMs can learn to adapt their query formulation strategies for different retrievers via reinforcement learning (RL). Our empirical study reveals that RL effectively teaches an LLM to tailor its queries to specific retriever characteristics. We discover that different retrievers exhibit surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), suggesting strategies learned for one retriever ineffective for another. We further show that performance can be enhanced by incorporating retriever-specific human guidance and by scaling model size. To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. Our work provides the first empirical evidence and actionable insights for building truly retriever-aware RAG systems. Code and resources are available at this https URL.

24. 【2606.16811】Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

链接https://arxiv.org/abs/2606.16811

作者:Keizo Kato,Chenhui Chu,Yugo Murawaki,Sado Kurohashi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:shown remarkable progress, Large language models, generating pseudo intermediate, recent approaches, remarkable progress

备注: LREC 2026. Section 3.3 is updated

点击查看摘要

Abstract:For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

25. 【2606.16807】Connecting Speech to Words through Images

链接https://arxiv.org/abs/2606.16807

作者:Gabriel Pirlogeanu,Dan Oneata,Horia Cucu,Herman Kamper

类目:Computation and Language (cs.CL)

关键词:explicit textual supervision, learn the mapping, absence of explicit, explicit textual, written words

备注: Accepted at EUSIPCO 2026 - 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:How can we learn the mapping between written words and their spoken counterparts in the absence of explicit textual supervision? We present a visually grounded method for building a vocabulary of spoken words using only images and their spoken descriptions. First, image captioning systems are used to build a vocabulary of written words representing salient visual concepts in the images. For each word, we then find utterances whose image captions contain that word. Then we use an unsupervised word discovery technique to align these utterances to locate instances of the target word. The result is spoken word segments that are linked to written words -- all accomplished without any text supervision. In spoken word retrieval and keyword spotting experiments, the proposed approach outperforms a strong neural baseline while being more interpretable. These results demonstrate the feasibility of the approach in English and motivate future work on low-resource languages without transcripts.

26. 【2606.16806】LLM-based Visual Code Completion for Aerospace Geometric Design

链接https://arxiv.org/abs/2606.16806

作者:Hau Kit Yong,Robert Marsh,Edmar A. Silva,András Sóbester,Stuart E. Middleton

类目:Computation and Language (cs.CL)

关键词:Vision Language Models, Original Equipment Manufacturers, Large Language Models, Language Models, aerospace Original Equipment

备注

点击查看摘要

Abstract:Recent advances in both Large Language Models (LLMs) and Vision Language Models (VLMs) have seen a step change in their ability to perform visual code completion, but the aerospace industry, which prioritizes safety and explainabilty over rapid LLM adoption, currently has no publicly announced LLM-based geometric design copilot systems in commercial use by aerospace Original Equipment Manufacturers (OEMs). This paper presents a LLM-based visual programming copilot application for aerospace engineering design tasks, using a visual programming variant of the ReAct methodology and GPT 5.4. In addition to the copilot, we describe Wingbuilder, a new Grasshopper plugin library with custom components for aerospace-specific geometry abstraction, and an associated Aerospace Visual Programming Dataset (AVPD) with 18 aerospace expert designed tasks at different levels of difficulty alongside ground truth solutions. We evaluate our copilot application with a user trial involving two experienced aerospace engineers from a large aircraft manufacturing company. We find our copilot visual programming ReAct methodology was successful in generating suggestions that participants found helpful, but slow ReAct inference times limit its usefulness to more complex time-consuming tasks where waiting for good copilot solution suggestion was worthwhile. Participants reported they liked the tool and would be willing to use it in the future.

27. 【2606.16801】he Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models

链接https://arxiv.org/abs/2606.16801

作者:Chen Chen,Xiang Gao,Xianshun Wang,Chengran Li,Shengyu Xia,Xueluan Gong,Linru Zhang,Qian Wang,Kwok-Yan Lam

类目:Computation and Language (cs.CL)

关键词:train Large Language, Large Language Models, Large Language, offloading computation-intensive layers, train Large

备注: 19 pages, 5 figures

点击查看摘要

Abstract:Split learning provides a practical paradigm for resource-constrained users to train Large Language Models (LLMs) by offloading computation-intensive layers to a server while keeping raw data local. However, existing privacy-preserving split learning methods still face a difficult trade-off among utility, privacy, efficiency, and stability. Specifically, these methods often suffer from substantial utility degradation, remain vulnerable to advanced data reconstruction attacks, incur prohibitive computational and communication overhead, or exhibit unstable performance across different tasks. In this paper, we propose MIXGUARD, a novel mixup-based privacy-preserving split learning framework for LLMs. MIXGUARD introduces token-level obfuscation, representation-level obfuscation, and adaptive gradient perturbation mechanisms, which operate jointly to preserve useful learning signals while preventing privacy leakage to the server. Technically, MIXGUARD first constructs a lightweight calibration model on a public dataset to refine the approximated target representation, and then applies this model during privacy-preserving fine-tuning on private data. We conduct extensive experiments on four classification tasks and four text generation tasks across multiple LLM families, model sizes, architectures, and fine-tuning strategies. The results show that MIXGUARD preserves model utility comparable to non-split training baselines, consistently achieves stronger privacy protection than existing split learning defense methods against state-of-the-art data reconstruction attacks, and remains robust under adaptive attack settings.

28. 【2606.16774】OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

链接https://arxiv.org/abs/2606.16774

作者:Tianyi Lin,Chuanyu Sun,Jingyi Zhang,Changxu Wei,Huanjin Yao,Shunyu Liu,Xikun Zhang,Liu Liu,Jiaxing Huang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Equipping Large Language, Large Language Model, Equipping Large, Large Language, solving complex tasks

备注: 13 pages, 2 figures

点击查看摘要

Abstract:Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

29. 【2606.16753】P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs

链接https://arxiv.org/abs/2606.16753

作者:Rafael Ferreira,Inês Vieira,Inês Calvo,James Furtado,Iago Paulo,Diogo Tavares,Diogo Glória-Silva,David Semedo,João Magalhães

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:capturing regional linguistic, Large Language Models, regional linguistic variation, everyday communication, capturing regional

备注: Accepted at MeLLM Workshop at ACL 2026

点击查看摘要

Abstract:As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored. To address this gap, we introduce P3B3, an expert-curated language variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability. Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

30. 【2606.16748】MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

链接https://arxiv.org/abs/2606.16748

作者:Lawrence Keunho Jang,Andrew Keunwoo Jang,Jing Yu Koh,Ruslan Salakhutdinov

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Current benchmarks, computer-use agents evaluate, Current, agents evaluate models, impersonal environments

备注

点击查看摘要

Abstract:Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at this https URL.

31. 【2606.16710】Misinformation Propagation in Benign Multi-Agent Systems

链接https://arxiv.org/abs/2606.16710

作者:Jonas Becker,Jan Philip Wahle,Terry Ruas,Bela Gipp

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:multiple large language, legal analysis, large language model, agents solve problems, medical diagnosis

备注: 20 pages, 8 figures, 1 table

点击查看摘要

Abstract:Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

32. 【2606.16700】Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

链接https://arxiv.org/abs/2606.16700

作者:Yanming Zhang,Yihan Bian,Jingyuan Qi,Yuguang Yao,Lifu Huang,Tianyi Zhou

类目:Computation and Language (cs.CL)

关键词:Mask Diffusion Models, fully sequential generation, relies on fully, fully sequential, Diffusion Models

备注: 22 pages, 6 figures, 5 tables

点击查看摘要

Abstract:While reasoning on autoregressive (AR) models is often performed by chain-of-thought reasoning and reflection, their refinement of previous outputs still relies on fully sequential generation, even when only local edits are needed. In contrast, the masking mechanism in Mask Diffusion Models (MDMs) naturally supports explicit local edits on previous outputs, allowing selective refinement without discarding previous answers and generating another from scratch. While this property more closely aligns with how humans correct mistakes by iterative local refinement, existing MDMs do not support multi-turn masking and denoising. We propose Reflective Masking (RM), which elicits such an intrinsic reasoning capability in MDMs via lightweight post-training. RM provides a native test-time scaling, where an MDM iteratively revisits and revises its prior outputs based on evolving context. To exploit insights from previous turns like AR reasoning, we further introduce History Reference, a parameter-free mechanism that leverages intermediate denoising states during revision. Our approach requires no architectural changes and is easily applicable to existing MDMs. Across diverse tasks and modalities, including text generation, Sudoku, and image editing, Reflective Masking consistently outperforms standard masking-based baselines and demonstrates strong generality, positioning RM as a fundamental primitive for reasoning on MDMs.

33. 【2606.16687】From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text

链接https://arxiv.org/abs/2606.16687

作者:Sadia Noor,Seemab Latif,Raja Khurram Shahzad,Mehwish Fatima

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Modeling dimensional affect, Modeling dimensional, text requires distinguishing, requires distinguishing current, affective change

备注

点击查看摘要

Abstract:Modeling dimensional affect in longitudinal text requires distinguishing current affect estimation from future affective change forecasting. Existing approaches often treat each text as an independent observation and apply similar assumptions to both tasks, without testing whether they rely on different information sources. This paper investigates that distinction using longitudinal self-reported ecological essays and feeling-word entries. We propose the Trait--State Affective Prediction (TSAP) framework and its temporal extension E-TSAP for per-text valence and arousal prediction, evaluated on a held-out prediction test set of 1,737 entries from 91 users. We further propose the Affective Change Forecaster Hybrid (ACF-Hybrid) for next-step affective change forecasting, evaluated on a held-out forecasting test set of 46 users. For prediction, E-TSAP achieves composite Pearson correlations of 0.670 for valence and 0.449 for arousal. For forecasting, textual representations perform worse than compact numeric trajectory baselines: the text-inclusive model achieves only r=0.316 for valence and r=0.284 for arousal, whereas a simple prior-state baseline reaches r=0.615 and r=0.670, respectively. ACF-Hybrid, using dimension-specific numeric trajectory features, achieves r=0.659 for valence and $r=0.658$ for arousal. These results show that textual semantics support current affect prediction, whereas future affective change is better captured through prior numeric trajectory dynamics.

34. 【2606.16684】Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

链接https://arxiv.org/abs/2606.16684

作者:Jinghan Wang,Gaoliang Peng,Yanjun Chen,Wei Zhang,Wentao Wu,Tianchen Liu

类目:Computation and Language (cs.CL)

关键词:diagnosis requires resolving, transient signal fidelity, local transient signal, Vibration-based bearing fault, underlying fault physics

备注

点击查看摘要

Abstract:Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

35. 【2606.16682】Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

链接https://arxiv.org/abs/2606.16682

作者:Zewen Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:systematic biases emerge, feedback loop, systematic biases, biases emerge, agents use language

备注: 19 pages, 0 figures

点击查看摘要

Abstract:When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across four evaluator configurations (N=53 total independent repetitions, 15,592 API calls) reveals a clear hierarchy: cross-model evaluation (GPT-4o, N=8) produces strong but symmetric bidirectional contagion (mean gamma_{T-V}=1.176, gamma_{V-T}=1.089, Delta=-0.088, p=0.575, Cohen's d=0.29); high round counts (DashScope, 50 rounds) cause collapse to single-strategy dominance (70% zero contagion); and self-evaluation provides near-complete immunity -- 97% of runs (N=30, DeepSeek-chat) yield exactly zero contagion (mean gamma=0.033, 95% CI [-0.031, 0.010], p=0.642, d=0.07). No evaluator condition shows statistically significant directional asymmetry. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC experimental framework, and identify cross-model evaluator architecture as the primary risk factor for preference contagion.

36. 【2606.16661】SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAG

链接https://arxiv.org/abs/2606.16661

作者:Nathanaël Langlois

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Fixed-length chunking, boundary fragmentation, split across segments, degrading retrieval recall, chunking in Retrieval-Augmented

备注: 5 pages, 1 figure

点击查看摘要

Abstract:Fixed-length chunking in Retrieval-Augmented Generation (RAG) often leads to boundary fragmentation, where critical evidence is split across segments, degrading retrieval recall. While static windowing and parent retrieval improve recall, they introduce significant token overhead. We propose SCAR (Semantic Continuity-Aware Retrieval), an adaptive retrieval policy that selectively expands neighboring chunks by weighing query-neighbor relevance against a structural continuity penalty. SCAR uses a relative expansion threshold tied to each retrieved chunk's own query-relevance, yielding an approximately scale-invariant decision rule that transfers across embedding models without recalibration. Across four diverse corpora (RFC, GDPR, a 10-K report, and a Merger agreement; N=320 queries; 160 boundary-fragmented), SCAR achieves 92.8% recall on boundary-fragmented queries with only 7.84 chunks, a 22.9% reduction compared to static windowing (10.16 chunks). Paired bootstrap tests (B=10,000) confirm the chunk reduction is highly significant (p0.0001, Cohen's d=-1.49, large effect), with a small recall difference (Cohen's d=-0.33). The policy transfers across three embedding models (text-embedding-3-large, BGE-large-en-v1.5, zembed-1) using the same single hyperparameter setting, and downstream RAGAS evaluation on the 10-K corpus confirms SCAR preserves generation faithfulness while reducing context tokens by 27.1%.

37. 【2606.16659】FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection

链接https://arxiv.org/abs/2606.16659

作者:Y. H. Zhou,Z. M. Ma,Y. J. Zhou,Y. T. Li,H. X. Xiang,Y. M. Cheng,T. L. Chen,K. J. Zhang,Z. H. Nan,J. H. Ni,Z. Wu,Q. Y. Pan,S. Zhang,S. Cheng,M. Y. Luo

类目:Computation and Language (cs.CL)

关键词:requested user action, final risk depends, SMS claim aligns, user action, requested user

备注

点击查看摘要

Abstract:SMS fraud is increasingly cross-channel: a message directs the user to a webpage, and the final risk depends on how the SMS claim aligns with the page content and requested user action. However, existing evaluations either focus on message-only smishing classification or expose URL and domain cues that allow models to rely on reputation shortcuts. To address this gap, we introduce \textbf{FraudSMSWalker}, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. FraudSMSWalker contains 699 bilingual chains, including 332 fraudulent and 367 benign cases, across ten service scenarios. The model-visible input consists of the SMS context and sanitized webpage evidence, while raw URLs, hosts, domains, IPs, redirects, and reputation metadata are withheld. The benchmark further includes hard benign cases whose pages contain login, payment, verification, or account-management elements that are plausible under the service context but also appear in scam flows. We evaluate nine web agents under masked browser-agent protocols and conduct URL-visibility ablations. The results show that current agents can detect suspicious cues, but struggle to preserve benign recall and often produce positive predictions that are weakly supported by the observed evidence. These findings position FraudSMSWalker as a benchmark for measuring whether web agents can make fraud judgments that remain both accurate and evidence-grounded when direct reputation shortcuts are suppressed. The associated code and dataset are accessible at the \href{this https URL}{anonymous link}.

38. 【2606.16629】Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

链接https://arxiv.org/abs/2606.16629

作者:Mohammed Amine Mouhoub

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, knowledge-intensive question answering, Islamic, including religious

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for knowledge-intensive question answering, including religious and legal questions. Islamic knowledge is a particularly demanding setting: answers are expected to be grounded in authoritative sources, citations must be exact, Arabic varieties differ substantially from the language of classical sources, and legitimate jurisprudential disagreement must be represented rather than collapsed into a single answer. This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI. We organize the literature around Arabic NLP and Arabic-centric LLMs, Islamic NLP resources, Qur'anic question answering, Islamic knowledge benchmarks, retrieval-augmented generation, Islamic legal reasoning, inheritance reasoning, hallucination evaluation, and trustworthiness. We argue that fluency in Arabic is not sufficient for Islamic AI. Reliable systems require curated sources, retrieval and verification modules, citation-aware generation, madhhab-aware reasoning, human expert evaluation, and benchmarks that measure not only answer accuracy but also faithfulness, source validity, and reasoning quality. The survey concludes with a research agenda for hallucination-resistant Islamic AI systems.

39. 【2606.16617】Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

链接https://arxiv.org/abs/2606.16617

作者:Ferdinand M. Schessl

类目:Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)

关键词:boundaries remains low, construct boundaries remains, Sycophancy in LLMs, remains low, LLMs is documented

备注: 12 pages, 3 figures. Code, data, and pre-registrations: [this https URL](https://github.com/FerdinandSchessl/sycophancy-note-companion)

点击查看摘要

Abstract:Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($\sigma = E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $\kappa = 0.88$) while false-presupposition scoring is judge-sensitive ($\kappa = 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

40. 【2606.16603】VeriGraph: Towards Verifiable Data-Analytic Agents

链接https://arxiv.org/abs/2606.16603

作者:Jiajie Jin,Zhao Yang,Wenle Liao,Yuyang Hu,Guanting Dong,Xiaoxi Li,Yutao Zhu,Zhicheng Dou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:data-intensive analytical tasks, demonstrated strong capabilities, linear text trajectories, text trajectories makes, analytical tasks

备注: 10 pages

点击查看摘要

Abstract:LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61\% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at this https URL.

41. 【2606.16596】How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups

链接https://arxiv.org/abs/2606.16596

作者:Wafaa Mohammed,Kata Naszadi,Vlad Niculae

类目:Computation and Language (cs.CL)

关键词:Existing machine translation, Existing machine, discourse-focused evaluations primarily, evaluations primarily assess, primarily assess translation

备注

点击查看摘要

Abstract:Existing machine translation (MT) metrics and discourse-focused evaluations primarily assess translation quality intrinsically, without measuring the downstream consequences of translation errors. In this work, we focus on extrinsic discourse evaluation of machine translation under two distinct regimes: static and interactive. Under the static regime, we propose an entity counting task as a probe of referential consistency in discourse. We show that high intrinsic MT quality does not reliably predict downstream discourse success and strong MT systems still produce referential inconsistencies. For the interactive regime, we study the goal-oriented multi-agent Welfare Diplomacy game as a probe of long-horizon communication and coordination. We find that interaction-specific translation failures impact downstream coordination. Our results highlight goal-oriented environments as a viable framework for discourse-sensitive extrinsic MT evaluation.

42. 【2606.16591】SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

链接https://arxiv.org/abs/2606.16591

作者:Qiao Xiao,Haochen Shi,Yisen Gao,Wenbin Hu,Huihao Jing,Tianshi Zheng,Baixuan Xu,Ziheng Zhang,Weiqi Wang,Haoran Li,Jiaxin Bai,Yangqiu Song

类目:Computation and Language (cs.CL)

关键词:Large language model, realistic digital environments, agents increasingly rely, Large language, language model

备注

点击查看摘要

Abstract:Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

43. 【2606.16583】Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?

链接https://arxiv.org/abs/2606.16583

作者:Arnisa Fazla,Alberto Testoni,Ameen Abu-Hanna,Barbara Plank,Iacer Calixto

类目:Computation and Language (cs.CL)

关键词:reliable uncertainty estimation, requires reliable uncertainty, requires reliable, trusted or escalated, clinical vision-language models

备注: 17 pages, 4 figures

点击查看摘要

Abstract:Safe deployment of clinical vision-language models (VLMs) requires reliable uncertainty estimation (UE): a signal indicating when predictions should be trusted or escalated to a clinician. We test whether current UE methods actually deliver this signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering (VQA), we find that UE quality is not an intrinsic property of the UE method: it tracks model accuracy, degrading precisely where the model performance is weakest, and therefore where reliability is most needed. When we stress-test models by hiding the correct option among the multiple-choice answers (NOTA perturbations), accuracy collapses while uncertainty barely changes, leaving models systematically miscalibrated. Yet, we find that uncertainty on the unperturbed input reliably anticipates which predictions will collapse under NOTA, indicating that UE in current VLMs carries diagnostic information about model fragility. Our results position UE as a diagnostic tool for identifying fragile predictions and motivate perturbation-based evaluation as a path toward safe clinical deployment.

44. 【2606.16576】Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

链接https://arxiv.org/abs/2606.16576

作者:Reef Menaged,Gili Lior,Shauli Ravfogel,Roee Aharoni,Gabriel Stanovsky

类目:Computation and Language (cs.CL)

关键词:propose agentic automata, agentic automata learning, tool-calling LLM agents, uncover hidden environments, propose agentic

备注

点击查看摘要

Abstract:We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

45. 【2606.16568】Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

链接https://arxiv.org/abs/2606.16568

作者:Rutherford A. Patamia,Ming Liu,Wei Luo,Favour Ekong,Akan Cosgun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:spoken dialogue systems, Reliable turn-taking, dialogue systems, essential for spoken, spoken dialogue

备注

点击查看摘要

Abstract:Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

46. 【2606.16560】he BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage

链接https://arxiv.org/abs/2606.16560

作者:Afnan Aloraini,Viktor Schlegel,Goran Nenadic,Riza Batista-Navarro

类目:Computation and Language (cs.CL)

关键词:Automatic semantic change, semantic change, lexical semantic change, Automatic semantic, semantic change detection

备注

点击查看摘要

Abstract:Automatic semantic change detection aims to identify how word meanings shift over time, offering insights into both linguistic and societal change. Despite recent progress in computational lexical semantic change (LSC), existing benchmarks and methods struggle to capture bi-directional semantic change, particularly cases where words simultaneously gain and lose senses. This problem is especially challenging for words that have both slang and standard meanings. To address these gaps, we introduce two complementary benchmark datasets. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, sense loss, and stability across three time periods, enabling the study of complex semantic trajectories. The SlangTrack Word Sense Disambiguation (ST-WSD) dataset provides fine-grained, instance-level sense annotations for words combining slang and standard usages, supporting systematic benchmarking of WSD and semantic change detection models. Using these benchmarks, we systematically evaluate models across different methodological families: unsupervised clustering using contextualised embeddings, supervised machine learning, transformer-based models, and state-of-the-art large language models. Among the evaluated systems, the few-shot GPT-4o model achieved the strongest aggregate performance on Exact Sense Match (ESM) and multi-label accuracy; however, Macro-F1 scores near 0.5 across all systems show that rare slang senses remain difficult, which we identify as the central open challenge.

47. 【2606.16545】Can LLM Coding Agents Reason About Time Series?

链接https://arxiv.org/abs/2606.16545

作者:Filip Rechtorík,Ondřej Dušek,Zdeněk Kasner

类目:Computation and Language (cs.CL)

关键词:Large language models, automated decision-making systems, Large language, systems in finance, environmental monitoring

备注: 17 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.

48. 【2606.16527】DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

链接https://arxiv.org/abs/2606.16527

作者:Xuanyu Yin,Yilin Jiang,Jun Zhou,Kai Chen,Zhengfu Cao,Xiaolei Dong

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:large language models, black-box jailbreak defense, important practical problem, language models, user-facing systems

备注: 25 pages, 5 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to express and execute it, thereby evading safety alignment while remaining recoverable during generation. Motivated by this observation, we propose DoubtProbe, a dual-branch inference-time defense framework that combines structural verification with semantic auditing and formulates black-box jailbreak defense as consistency checking under controlled transformation. The structural branch extracts a structured representation from the original request, reconstructs the request under representation constraints, and detects information-preservation failures between the original and reconstructed requests; the semantic branch audits the original prompt directly. We evaluate DoubtProbe against representative black-box defenses on jailbreak and benign-request benchmarks, and further test backbone transfer from Qwen2.5-72B to Llama-3.1-70B. Results show that DoubtProbe achieves a stronger and more stable defense-utility trade-off: on Qwen2.5-72B, it reduces the JBB attack success rate from 0.293 to 0.100 and the CodeAttack attack success rate from 0.152 to 0.001, while maintaining false positive rates of 0.022 and 0.016 on AlpacaEval and OR-Bench; the same pattern remains stable on Llama-3.1-70B. These findings show that structural inconsistency signals provide a practical and generalizable basis for black-box jailbreak defense, especially when combined with semantic auditing.

49. 【2606.16523】SkillWiki: A Living Knowledge Infrastructure for Agent Skills

链接https://arxiv.org/abs/2606.16523

作者:Dingcheng Huang,Yuda Ding,Bingshuo Liu,Qingbin Liu,Xi Chen,Jiang Bian,Hongliang Sun,Zhiying Tu,Dianhui Chu,Xiaoyan Yu,Dianbo Sui

类目:Computation and Language (cs.CL)

关键词:managed through Wikipedia, Wikipedia and software, software through GitHub, agent skills, Wikipedia

备注

点击查看摘要

Abstract:While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evolution of agent skills by transforming heterogeneous knowledge into reusable skill assets linked to their originating evidence. Our demonstration presents the complete skill lifecycle, from knowledge ingestion and skill production to provenance-aware exploration, governance, and execution-driven evolution. SkillWiki highlights a future in which knowledge, skills, and execution experience co-evolve within a shared infrastructure. The live demonstration and source code are publicly available at this https URL.

50. 【2606.16497】daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

链接https://arxiv.org/abs/2606.16497

作者:Dayuan Fu,Mohan Jiang,Tongyu Wang,Dian Yang,Jiarui Hu,Liming Liu,Jinlong Hou,Pengfei Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:GPU kernel optimization, kernel optimization represents, GPU kernel, optimization represents, represents a paradigm

备注

点击查看摘要

Abstract:GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, this http URL-14B.

51. 【2606.16496】REFLEX: Reflective Evolution from LLM Experience

链接https://arxiv.org/abs/2606.16496

作者:Pan Wang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large multimodal language, multimodal language models, Large multimodal, multimodal language, emerged as powerful

备注

点击查看摘要

Abstract:Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize corrective code. This diagnosis-repair entanglement creates an opaque feedback loop, obscuring the rationale behind mutations and preventing the retention of algorithmic insights across independent runs. To achieve auditable and efficient policy search, we argue that visual diagnosis must be structurally decoupled from code generation. We present REFLEX, a train-free evolutionary framework that operationalizes this decoupling. In REFLEX, a vision-enabled Critic first distills task-specific behavioral evidence into structured, auditable diagnoses. Subsequently, a text-optimized Actor synthesizes child policies using these diagnoses alongside a persistent, self-evolving Skill Memory of reusable code snippets. This architecture not only provides transparent mutation traces but also enables cross-run programmatic knowledge transfer. Extensive evaluations across control benchmarks (Lunar Lander, Acrobot, Pendulum) and a 36-dimensional antenna array synthesis task demonstrate exceptional sample efficiency. Notably, REFLEX solves Acrobot and Pendulum in under 10 LLM calls and reaches a best Normalized Weighted Score of 1.092 on Lunar Lander, achieving highly competitive final performance while significantly accelerating the early-stage discovery of transparent policies.

52. 【2606.16494】Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

链接https://arxiv.org/abs/2606.16494

作者:Jieyuan Liu,Jianyang Gu,Shijie Chen,Jefferson Chen,Zhen Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Wikipedia-scale knowledge base, Knowledge-based visual question, vision-language systems answer, Wikipedia-scale knowledge, visual question answering

备注: 15 pages, 9 figures. Under review at EMNLP 2026

点击查看摘要

Abstract:Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

53. 【2606.16472】From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

链接https://arxiv.org/abs/2606.16472

作者:Che Hyun Lee,Heeseung Kim,Sungroh Yoon

类目:Computation and Language (cs.CL)

关键词:spoken dialogue systems, multi-round conversations remains, remains a challenge, multi-round conversations, conversations remains

备注: Interspeech 2026 Main Track

点击查看摘要

Abstract:Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

54. 【2606.16432】ACCORD: Action-Conditioned Contextual Grounding for Language Agents

链接https://arxiv.org/abs/2606.16432

作者:Lai Jiang,Cheng Qian,Zhenhailong Wang,Pan Lu,Heng Ji,Hao Peng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:User instructions, underspecified because humans, humans rely, rely on implicit, implicit assumptions

备注

点击查看摘要

Abstract:User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry it forward into subsequent actions. We show that current agents often fail to do so. They act from assumed rather than observed specifics, overlook information they could have gathered, and fail to incorporate evidence that has already been returned. Building on this insight, we propose ACCORD (Action-Conditioned Contextual Grounding), a simple and effective agent framework for adaptive grounding. Before each action, ACCORD actively probes the environment for missing information and integrates relevant context from the agent's trajectory that would otherwise be overlooked. Requiring no additional training or task-success signals, ACCORD improves task-goal completion on AppWorld by up to +20.6 points with GPT-5-mini, from 42.0% to 62.6%, compared to strong baselines. These gains persist with a substantially stronger base model (+10.8 with Claude-4.5-sonnet), an open-weight model (+10.1 with Qwen3.5-27B-FP8), and on the embodied AlfWorld benchmark (+7.4 success rate with GPT-5-mini).

55. 【2606.16429】aylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

链接https://arxiv.org/abs/2606.16429

作者:Zhongzhu Zhou,Qingyang Wu,Junxiong Wang,Mayank Mishra,Shuaiwen Leon Song,Ben Athiwaratkun,Chenfeng Xu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:faster long-context inference, full softmax attention, Hybrid linear attention, linear attention models, attention models offer

备注: 24 pages, 9 figures

点击查看摘要

Abstract:Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.

56. 【2606.16428】LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

链接https://arxiv.org/abs/2606.16428

作者:Jaward Sesay,Yue Yu,Siwei Dong,Yemin Shi,Guangyao Chen,Börje F. Karlsson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Effective personalized AI-assisted, accurate learner-specific educational, AI-assisted learning demands, learning demands systems, generate accurate learner-specific

备注

点击查看摘要

Abstract:Effective personalized AI-assisted learning demands systems that can not only generate accurate learner-specific educational materials, but also dynamically adapt their instruction to diverse learners. However, existing educational agents have primarily focused on lecture content automation and simulations, which often fall short of modelling multimodal and embodied instructional methods tailored for the individual learner. To this end, we propose LectūraAgents - a multi-agent framework that enables personalized learning through end-to-end adaptive embodied teaching. At its core, LectūraAgents mirrors a professor-student relationship, in which a ProfessorAgent leads a collaborative team of specialized subordinate agents through research, planning, review, and embodied delivery of lecture contents that adapt to a learner's needs. The framework offers three main contributions: (1) a hierarchical multi-agent architecture for end-to-end personalized learning; (2) an adaptive embodied teaching mechanism, wherein the ProfessorAgent executes visible and pedagogically motivated teaching actions (e.g., handwrite, highlight, underline, etc.) over contents in a teaching environment; and (3) a Teaching Action-Speech Alignment (TASA) algorithm that employs salience-based heuristics and temporal semantic segmentation to generate coherent teaching action sequences aligned with learner profiles. We evaluate LectūraAgents on diverse courses at high school, undergraduate, and graduate levels using sample-specific rubric-based analysis; with generated lecture materials and teaching actions assessed and validated by expert educators. Experimental results show consistent gains in lecture content quality, embodied teaching quality, assessment, and personalization over existing approaches, positioning LectūraAgents as a pedagogically well-grounded framework for personalized learning at scale.

57. 【2606.16409】PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation

链接https://arxiv.org/abs/2606.16409

作者:Bo Wang,Heyan Huang,Yaolin Li,Wei Tang,Yuan Zhang,Wenbo Li,Mingze Gao,Ge Shi,Chong Feng

类目:Computation and Language (cs.CL)

关键词:complex information networks, trains language-model agents, efficiently navigating complex, navigating complex information, GraphRAG trains language-model

备注

点击查看摘要

Abstract:Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from \textit{\textbf{answer-path reward aliasing}}, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits \textit{\textbf{search-update ambiguity}}, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.

58. 【2606.16407】A Mechanistic Understanding of Pronoun Fidelity in LLMs

链接https://arxiv.org/abs/2606.16407

作者:Katharina Trinley,Jesujoba O. Alabi,Dietrich Klakow,Vagrant Gautam

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Faithful and robust, models largely fail, coherent generations, large language models, language models largely

备注

点击查看摘要

Abstract:Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behavioural approaches, which may not reflect a model's internal workings. Therefore, we provide a mechanistic, model-internal perspective on pronoun fidelity, testing whether three mechanisms -- group entity binding (G), recency bias (R), and stereotypical bias (S) -- are causally implemented across several SOTA language models. Using Boundless Distributed Alignment Search, we find all three coexist as causal subspaces distributed across network depth. No single mechanism fully explains model behaviour, but a combination of the three consistently accounts for 91-99.5%. An attention head analysis further reveals two competing copying routes; group binding and stereotype share a localized concept-level route that retrieves a bound occupation-pronoun unit, while recency uses a distributed token-level route that repeats surface forms. In sum, pronoun fidelity arises from competition between simultaneously active causal subspaces.

59. 【2606.16383】Surpassing Scale by Efficiency: A Compact 135M Parameter Foundational LLM Natively Adapted for the Bangla Language

链接https://arxiv.org/abs/2606.16383

作者:Rabindra Nath Nandi

类目:Computation and Language (cs.CL)

关键词:decentralized local hardware, remains computationally prohibitive, non-Latin scripts remains, NLP landscape, multi-billion parameter architectures

备注: Submitted to a Workshop

点击查看摘要

Abstract:While the NLP landscape is dominated by multi-billion parameter architectures, their deployment in low-resource, non-Latin scripts remains computationally prohibitive for edge configurations, mobile systems, and decentralized local hardware. This paper presents bangla-smollm-135m, a highly compact 135-million parameter decoder-only foundational model engineered explicitly for high-efficiency language modeling in the Bangla script. By leveraging a deterministic intersect-and-append token merging strategy between TituLLMs and SmolLM2-135M, the model overcomes subword script fragmentation without destabilizing early pretrained parameter states. In zero-shot multi-task benchmark evaluations (PIQA_bn, OpenBookQA_bn, CommonsenseQA_bn, and Bangla_MMLU), bangla-smollm-135m matches or outperforms models twice its size (Gemma-3-270m) and achieves parity with models in the 1B parameter tier. The model is available at rnnandi/bangla-smollm-135m

60. 【2606.16368】Evaluating LLM Personalization via Semantic Constraint Verification

链接https://arxiv.org/abs/2606.16368

作者:Xuran Li,Guanqin Zhang,Imran Razzak,Hakim Hacid,Eleanna Kafeza,Hao Xue,Flora D. Salim

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Model, Natural Language Inference, Current evaluation paradigms, brittle surface-matching metrics, Large Language

备注

点击查看摘要

Abstract:Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inference Constraint Verification (NLICV), a scalable, semantically invariant framework that maps sentence meanings to truth-condition sets to verify personalization constraints via a Natural Language Inference (NLI) model. Moving beyond binary scoring, NLICV categorizes LLM behaviors into four distinct modes: personalization, generalization, sycophancy, and failure. Extensive experiments demonstrate that NLICV aligns closely with human annotations while drastically reducing the latency and token costs associated with LLM judges (up to 2100 inference speedup). Finally, through an ablation-based procedure, NLICV pinpoints the exact sentences driving the constraint verification, yielding faithful, understandable evidence for its evaluations.

61. 【2606.16360】yler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

链接https://arxiv.org/abs/2606.16360

作者:Hanyu Lin,Min Cai,Jiawei Wen,Haodi Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, externalizing intermediate computation, language models, inference overhead, large language

备注: website: [this https URL](https://typed-latent-reasoning.github.io)

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead. Latent reasoning offers a promising alternative by carrying part of the computation in continuous representations. However, existing methods typically predefine when latent computation is invoked and how it is allocated during decoding, leaving a key problem unresolved: when to invoke latent computation, what type of computation to perform, and how much budget to allocate. We propose \textbf{Ty}ped \textbf{L}at\textbf{e}nt \textbf{R}easoning (Tyler), a typed and budget-aware framework for latent reasoning during autoregressive decoding. Tyler learns a policy that, at each decoding step, chooses between emitting a text token and switching to a latent computation module specialized for a particular reasoning function. Once invoked, an operator maps the current reasoning state into latent tokens that support global planning, local state updates, or reusable procedural abstraction. Across extensive experiments on three backbone LLMs, Tyler improves accuracy by up to 14.49 points over CoT and by up to 4.30 points over the strongest competing baseline. It further generalizes across diverse reasoning domains and achieves the best final-stage performance with the lowest forgetting.

62. 【2606.16351】MASC: Transmasculine Attitude and Speech Corpus

链接https://arxiv.org/abs/2606.16351

作者:Sidney Wong

类目:Computation and Language (cs.CL)

关键词:Attitudes and Speech, including questionnaire responses, transmasculine individuals, Transmasculine Attitudes, Speech Corpus

备注: Accepted to Interspeech 2026 Main Track

点击查看摘要

Abstract:We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the vocal health of transmasculine individuals. The audio recordings include cough and throat-clearing samples, a reading passage, and additional session-specific questions. This paper outlines the development of this corpus and the data collection procedures. To illustrate the utility of this corpus, we present three case studies demonstrating how this crowd-sourced multimodal corpus can be used to support transmasculine individuals. These include the integration of perceptual and acoustic data, the identification of group-level characteristics, and the calibration of acoustic measurements.

63. 【2606.16344】Whose hotel does the AI recommend? An algorithm audit of reputation signals in LLM-assisted hotel selection

链接https://arxiv.org/abs/2606.16344

作者:Mirza Samad Ahmed Baig,Syeda Anshrah Gillani,Asher Ali

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:large language model, Travelers increasingly, making these systems, property visibility, increasingly ask large

备注: 32 Pages

点击查看摘要

Abstract:Travelers increasingly ask large language model (LLM) assistants which hotel to book, making these systems gatekeepers of property visibility -- yet what moves their recommendations is undocumented. We conduct a pre-specified algorithm audit using a randomized choice-based conjoint: across personas, prompt templates, and twelve open-weight and proprietary models, assistants choose among five hotels whose guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, and list position are independently randomized. We estimate the average marginal component effect of each signal on the probability of recommendation. Guest rating and price dominate (a top rating raises selection by 31.6 percentage points; a high price lowers it by 30.0), reproducing human valence-and-price primacy but over-weighting eco-certification and ignoring management response. List position -- a content-free artifact -- shifts recommendations causally, worth about \$12 per night. Stated reasons track revealed weights imperfectly. The findings ground generative engine optimization and the accountability of AI infomediaries in causal evidence.

64. 【2606.16322】PaperJury: Due-Process Review for Bounded LaTeX Revision

链接https://arxiv.org/abs/2606.16322

作者:Yiran Wang,Ruixuan An,Biao Wu,Wenhao Wang

类目:Computation and Language (cs.CL)

关键词:human-authored LaTeX computer, LaTeX computer science, requires adversarial whole-paper, adversarial whole-paper review, bounded artifact-safe revision

备注: 10 pages, 5 figures

点击查看摘要

Abstract:Pre-submission hardening of human-authored LaTeX computer science papers differs from drafting assistance because it requires adversarial whole-paper review, explicit no-fix outcomes, and bounded artifact-safe revision. Existing writing assistants, critique generators, and judge-centered loops lack durable issue identity across rounds, deterministic routing from critique to adjudication, and manuscript control that can reject invalid concerns or defer author-dependent ones. We present PaperJury, a closed-loop review-verdict-revise-verify system built on a deterministic-versus-semantic split: deterministic orchestration manages decomposition, a frozen claim spine, a durable ledger, routing, stopping, and exact-once patch application, while semantic agents are limited to bounded review, judgment, and repair. PaperJury combines bounded holistic review, contestability-based routing, a due-process trial, and risk-proportional guard chains for anchor-bounded edits, yielding terminal outcomes of invalid-drop, valid-fixable, and author-required. In a two-arm expert-review evaluation on held-out Vision, natural language processing, and machine learning papers against four baselines, we assess issue quality, verdict and routing quality, edit safety, convergence behavior, and cost, supporting the thesis that load-bearing safety and completion logic should reside in deterministic orchestration rather than model discretion. PaperJury is available at this https URL.

65. 【2606.16310】QK-Normed MLA: QK normalization without full key caching

链接https://arxiv.org/abs/2606.16310

作者:Yizhou Han,Yao Zhao,Jun Zhou,Longfei Li,Ruoyu Sun

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Multi-head Latent Attention, normalization stabilizes attention, compatible with Multi-head, stabilizes attention, Multi-head Latent

备注: 13 pages, 5 figures, conference-style manuscript

点击查看摘要

Abstract:Query-key (QK) normalization stabilizes attention by controlling the scale of queries and keys before the dot product, but is not immediately compatible with Multi-head Latent Attention (MLA). MLA achieves efficient decoding by caching low-dimensional latent states instead of full keys, whereas post-projection QK RMSNorm appears to require the fully projected key for every cached token. We show this apparent incompatibility is an implementation artifact, not an architectural constraint. RMSNorm decomposes into a static affine weight and a dynamic scalar RMS statistic. The static key-side weight can be absorbed into the MLA query-side projection; the dynamic key statistic reduces to one inverse-RMS scalar per token and KV group. The resulting formulation is exactly equivalent to explicit post-projection QK RMSNorm in exact arithmetic and preserves MLA's latent decode path. In our 400M runs trained for up to 100B tokens, QK-Normed MLA achieves lower training loss and better downstream accuracy than QK clipping, while H800 decode benchmarks show less than 2% latency overhead up to 256k context. These results make QK normalization a practical stabilization option for MLA models without requiring full-key caching.

66. 【2606.16307】State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs

链接https://arxiv.org/abs/2606.16307

作者:Rahul Khedar,Eshita,Sneha Teja Sree Reddy Thondapu,Mayank Malhotra,Arup Das,Jitesh Chandra,Yun-Shiuan Chuang,Chaitanya Kulkarni,Arun Menon,Linsey Pang,Avinash Karn,Mouli V,Prakhar Mehrotra

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Training tool-augmented LLM, tool-augmented LLM agents, LLM agents requires, agents requires large, tool-grounded conversational data

备注: 9 pages, 5 figures, 6 tables, 1 algorithm

点击查看摘要

Abstract:Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.

67. 【2606.16295】VisualClaw: A Real-Time, Personalized Agent for the Physical World

链接https://arxiv.org/abs/2606.16295

作者:Haoqin Tu,Jianwen Chen,Zijun Wang,Siwei Han,Juncheng Wu,Hardy Chen,Haonian Ji,Kaiwen Xiong,Jiaqi Liu,Peng Xia,Jieru Mei,Hongliang Fei,Jason Eshraghian,Zeyu Zheng,Yuyin Zhou,Huaxiu Yao,Cihang Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision language models, Vision language, complex multimodal tasks, serving as general-purpose, general-purpose interfaces

备注: H. T. and J. C. contribute to this project equally

点击查看摘要

Abstract:Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

68. 【2606.16285】HiMPO: Hindsight-Informed Memory Policy Optimization for Less-Entangled Credit in Long-Horizon Agents

链接https://arxiv.org/abs/2606.16285

作者:Jiangze Yan,Yi Shen,Wenjing Zhang,Jieyun Huang,Zhaoxiang Liu,Ning Wang,Kai Wang,Shiguo Lian

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:compress interaction history, downstream tool failures, credit assignment challenge, distinct credit assignment, Long-horizon agents rely

备注: Preprint. 2 figures

点击查看摘要

Abstract:Long-horizon agents rely on memory mechanisms to compress interaction history, but optimizing memory writing faces a distinct credit assignment challenge: a memory update may be rewarded or penalized due to downstream tool failures, noisy observations, or reasoning errors rather than its own contribution. This causally entangled credit can lead agents to discard useful evidence or preserve irrelevant information. We propose HiMPO, a Hindsight-Informed Memory Policy Optimization framework for assigning less-entangled credit to memory-writing actions in long-horizon agents. HiMPO first estimates the local utility of a memory update by comparing the task-relevant information recoverable from the previous and updated memories under the same pre-write state. It then uses hindsight relevance as a bounded retrospective filter that attenuates memory credit when local utility is not supported by the target outcome. The resulting memory-specific advantage is applied only to memory tokens, while trajectory-level rewards optimize the rest of the agent behavior. Across judge-based open-domain tasks and objective compressive-memory QA, HiMPO improves over strong memory-based and RL-based baselines while preserving compressed-context efficiency. Controlled interventions further show that HiMPO reduces blame leakage from tool-induced errors and improves attribution fidelity of memory updates.

69. 【2606.16281】Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

链接https://arxiv.org/abs/2606.16281

作者:Heecheol Yun,Joonhyung Park,Joowon Kim,Eunho Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Masked Diffusion Language, Diffusion Language Models, Masked Diffusion, Diffusion Language, Language Models

备注: preprint

点击查看摘要

Abstract:Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We find that successful generations exhibit stable confidence dynamics over answer-relevant positions, while unreliable trajectories can often be corrected by injecting promising intermediate states from other models. Guided by this observation, we propose $\textbf{TIE}$ ($\textbf{T}$rajectory-based $\textbf{I}$terative $\textbf{E}$nsembling), a knowledge fusion framework in which MDLMs iteratively identify reliable decoding trajectories and relay them across models. TIE tracks confidence dynamics over answer-relevant positions to determine which model currently follows a more reliable trajectory and selectively transfers partially denoised sequences across models. As the model on the more promising trajectory often changes across denoising steps, TIE allows different models to contribute complementary strengths at different stages of generation. Strong performance across diverse reasoning tasks, along with our analyses, suggests that TIE offers a practical approach to the underexplored problem of MDLM ensembling.

70. 【2606.16246】Data Augmentations for Data-Constrained Language Model Pretraining

链接https://arxiv.org/abs/2606.16246

作者:Michael K. Chen,Xikun Zhang,Zhen Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:high-quality text generation, compute capacity outpaces, language model pretraining, demands productive multi-epoch, text generation

备注

点击查看摘要

Abstract:As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at this https URL

71. 【2606.16243】LiFT: Local Search via Linear Programming for Overfitting-Controlled Transformers

链接https://arxiv.org/abs/2606.16243

作者:Abhishek Shukla,Anikeit Khanna,Ankur Sinha,Faiz Hamid

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Linear Programming, paper proposes, explicit control, pretrained transformer models, Programming

备注: 22 pages, 6 figures, published in The 20th Learning and Intelligent Optimization Conference (LION 2026)

点击查看摘要

Abstract:This paper proposes a Linear Programming (LP)-based local search framework for fine-tuning pretrained transformer models with explicit control against overfitting. The approach formulates transformer fine-tuning as a bilevel optimization-based regularization problem, in which model parameters and regularization hyperparameters are jointly updated. Information collected during initial warm-up iterations, including validation gradients and training Hessian information, is used to construct a local descent direction by solving an LP that minimizes a scaled directional derivative while preserving training optimality. This validation-aware descent direction enables focused local updates of both parameters and regularization hyperparameters, reducing overfitting without requiring repeated full retraining cycles. The resulting method, termed Linear Programming-based Fine-Tuning (LiFT) for transformers, differs from conventional fine-tuning by systematically identifying task-specific updates rather than relying on heuristic or grid-based hyperparameter selection. Experiments on GPT-2 Small fine-tuned on WikiText-2 demonstrate that LiFT enables effective adaptation through selective tuning of transformer blocks and regularization parameters, yielding consistent improvements in test perplexity across multiple layer configurations and regularization settings, with particularly pronounced gains in overfitting-prone scenarios. Beyond empirical performance, LiFT establishes a principled connection between transformer fine-tuning, bilevel optimization, local search, and regularization theory.

72. 【2606.16242】Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

链接https://arxiv.org/abs/2606.16242

作者:David Huang,Jaewon Chang,Avidan Shah,Prateek Mittal,Chawin Sitawarin

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:continuously improves jailbreak-detection, Rapid Response generates, Rapid Response, including Anthropic, improves jailbreak-detection classifiers

备注: Spotlight at ICML 2026

点击查看摘要

Abstract:The Rapid Response (RR) framework, deployed in production systems, including Anthropic's ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier's training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept's presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.

73. 【2606.16240】Creative Collision: Directorial Persona Steering and Competition in Large Language Models

链接https://arxiv.org/abs/2606.16240

作者:Subramanyam Sahoo,Justin Shenk

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, prior work injects, inference time, powerful tool, tool for shaping

备注: Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

点击查看摘要

Abstract:Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a \emph{single} semantic direction into the residual stream. We study the richer setting in which two semantically opposing steering vectors are superimposed -- a regime we call \textbf{Creative Collision}. Concretely, we construct directorial persona vectors for Steven Spielberg (optimistic, redemptive moral valence) and Martin Scorsese (dark, morally ambiguous) via mean-difference activation contrast on curated screenplay-derived corpora, then interpolate between them with a scalar mixing parameter $\alpha \in [0,1]$ and a steering coefficient $\lambda$. Across five evaluation axes -- moral valence, generation coherence, surface style, directional dominance, and vector geometry -- three principal findings emerge: (i)~Spielberg's representational signature exhibits robust \emph{directional dominance}, suppressing Scorsese's moral influence across almost the entire interpolation range; (ii)~intermediate collision points paradoxically \emph{improve} generation coherence relative to pure single-director steering at high $\lambda$; and (iii)~both personas localise maximally to layer~28 of a 40-layer decoder-only transformer, revealing a shared \emph{moral-tone substrate}. These results illuminate the geometry of competing semantic directions in transformer residual streams and have direct implications for controllable creative generation and value-aligned narrative synthesis.

74. 【2606.16215】PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

链接https://arxiv.org/abs/2606.16215

作者:Zhenbang Du,Jun Luo,Zhiwei Zheng,Xiangchi Yuan,Kejing Xia,Dachuan Shi,Qirui Jin,Qijia He,Shaofeng Zou,Yingbin Liang,Wenke Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Multi-turn tool-use agents, call tools, Multi-turn tool-use, interaction turns, adapt to observations

备注: Project page: [this https URL](https://zhenbangdu.github.io/pact-project-page/)

点击查看摘要

Abstract:Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

75. 【2606.16211】Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

链接https://arxiv.org/abs/2606.16211

作者:Xingyu Tan,Shiyuan Liu,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang

类目:Computation and Language (cs.CL)

关键词:Biomedical question answering, increasingly requires reasoning, question answering, increasingly requires, interacting entities

备注

点击查看摘要

Abstract:Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However, existing biomedical QA benchmarks mainly focus on exam-style knowledge, literature comprehension, or short-range multi-hop inference, leaving source-conditioned graph reasoning and evidence topology construction underexplored. To fill this gap, we introduce BioMedHop, a multi-source graph-grounded benchmark for evaluating biomedical reasoning over structured evidence topologies. BioMedHop contains 10,045 instances across KG, document, web, and hybrid evidence settings, covering shared-neighbor matching, intersection reasoning, path-based reasoning, and counting, with option-based, open-ended, and numeric count renderings. To support this benchmark, we further propose BioWeave, a source-aware reasoning framework that retrieves biomedical KG paths, gathers supporting clues from documents and web sources, assembles them into a unified evidence graph, and verifies answers through entity-level evidence support. Comprehensive experiments show that BioWeave achieves the best overall performance among compared methods on BioMedHop, outperforming the strong hybrid baseline ToG-2 by 10.5% in the overall average. Moreover, BioWeave consistently improves different LLM backbones and enables smaller models, such as Qwen3-4B, to achieve reasoning performance comparable to GPT-4-Turbo.

76. 【2606.16206】Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

链接https://arxiv.org/abs/2606.16206

作者:Junyi Yao,Zihao Zheng,Baichuan Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:Large language models, necessarily imply stronger, stronger task-solving ability, Large language, imply stronger learning

备注

点击查看摘要

Abstract:Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

77. 【2606.16183】LLM-Powered Virtual Population for Demand Simulation and Pricing

链接https://arxiv.org/abs/2606.16183

作者:Chengpiao Huang,Kaizheng Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:LLM-powered virtual population, virtual population model, develop an LLM-powered, LLM-powered virtual, virtual population

备注: 18 pages, 7 figures

点击查看摘要

Abstract:We develop an LLM-powered virtual population model that simulates demand for pricing decisions, in settings where products are described by rich unstructured information, such as text descriptions and images, and where decision makers need not only mean-demand predictions but also uncertainty estimates for counterfactual prices. Our model represents exposed customers as draws from a finite mixture of customer personas. For each persona, product, and candidate price, an LLM elicits a persona-level purchase probability using both structured persona information and unstructured product information. These probabilities are aggregated through calibrated mixture weights to form a predictive distribution of aggregate demand. The resulting simulator can evaluate counterfactual prices under various pricing objectives, including expected revenue and risk-aware criteria such as conditional value at risk. We test the framework on an online HM fashion dataset with product descriptions and images. The calibrated LLM-based simulator achieves the best overall predictive performance among the models considered, and supports sample-efficient pricing decisions. Our framework provides a practical way to use LLMs as demand simulators for products with limited historical demand data but rich product information. By producing a full predictive demand distribution rather than only a point forecast, it enables managers to compare candidate prices, quantify demand uncertainty, and choose prices that target either average-case revenue or risk-aware objectives.

78. 【2606.16158】Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

链接https://arxiv.org/abs/2606.16158

作者:Yifan Wang,Peiming Li,Shiyu Li,Zhiyuan Hu,Xiaochen Yang,Wenming Yang,Yang Tang,Zheng Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, perceive fine-grained details

备注

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at this https URL.

79. 【2606.16151】GRACE: Step-Level Benchmark for Faithful Reasoning over Context

链接https://arxiv.org/abs/2606.16151

作者:Hoang Pham,Dong Le,Anh Tuan Luu

类目:Computation and Language (cs.CL)

关键词:document-grounded question answering, reasoning tasks require, tasks require models, input context, rule-based deduction

备注

点击查看摘要

Abstract:Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

80. 【2606.16140】VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

链接https://arxiv.org/abs/2606.16140

作者:Sen Xu,Shixi Liu,Wei Wang,Jixin Min,Yingwei Dai,Zhibin Yin,Yirong Chen,Xin Zhou,Junlin Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:technical report introduces, strictly small-model regime, report introduces, technical report, developed to investigate

备注

点击查看摘要

Abstract:This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

81. 【2606.16137】XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

链接https://arxiv.org/abs/2606.16137

作者:Yupei Li,Qiyang Sun,Xiaoliang Wu,Chenxi Wang,Berrak Sisman,Björn W. Schuller

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Speech deepfake detection, systems require trustworthy, Speech deepfake, require trustworthy explanations, deepfake detection

备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45\%, verified through human evaluation and faithfulness checks.

82. 【2606.16127】AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

链接https://arxiv.org/abs/2606.16127

作者:Andreas Einwiller,Max Klabunde,Florian Lemmerich

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:users' everyday lives, increasing central role, extent specific models, everyday lives, attitudes and characteristics

备注: v1, 50 pages

点击查看摘要

Abstract:The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We introduce AuAu, a comprehensive benchmark that aims to assess the risk of LLMs generating responses with authoritarian tendencies. This benchmark combines three evaluation approaches: (i) psychometric questions from an extensive pool of 15 human validated instruments; (ii) contextual behavior vignettes probing intended actions in concrete situations; and (iii) responses to realistic user prompts. Unlike prior work, AuAu evaluates not only a general closeness towards authoritarianism but also the established sub-concepts Authoritarian Aggression, Authoritarian Submission, and Conventionalism. Evaluating 17 models from China, the EU, Russia, and the USA, we find that all tested models exhibit substantial authoritarian response rates under the psychometric evaluation, though rates drop significantly in increasingly more realistic downstream task. We further find that an authoritarian system prompt easily manipulates 15 out of 17 models to promote increased authoritarianism. Our results underscore the need for continued, systematic auditing of LLM-based AI systems to detect and ultimately mitigate undesired authoritarian tendencies in generated output. Our code and data are available at: this https URL

83. 【2606.16118】Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

链接https://arxiv.org/abs/2606.16118

作者:Olivia Peiyu Wang,Sanna Wong-Toropainen,Daneshvar Amrollahi,Ryan Bai,Tashvi Bansal,Arush Garg,Leilani H. Gilpin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:Large Language Models, Large Language, approximation remains unclear, heuristic approximation remains, achieve strong performance

备注: 10 pages, submitted to COLM 2026 (under review, average score of 6.25 across 4 reviewers) and accepted by the AI4Law workshop at ICML. This is the version where we already addressed most of the reviews from the COLM reviewers

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.

84. 【2606.16111】owards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

链接https://arxiv.org/abs/2606.16111

作者:Junyi Li,Xiaowei Qian,Yingyi Zhang,Wenlin Zhang,Guojing Li,Sheng Zhang,Xiao Han,Yichao Wang,Xiangyu Zhao

类目:Computation and Language (cs.CL)

关键词:Recent advances, tool-integrated language agents, solve complex reasoning, advances in tool-integrated, agents have significantly

备注: ICML 2026 Spotlight Paper

点击查看摘要

Abstract:Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and multi-hop QA tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.

85. 【2606.16100】Your "Pro" LLM Subscription May Actually Be "Free": Exposing Fingerprint Spoofing Risks in LLM Inference Services

链接https://arxiv.org/abs/2606.16100

作者:Jiahao Zhang,Xiuyu Li,Suhang Wang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Model, Large Language, users increasingly rely, advertised premium models, APIs become ubiquitous

备注

点击查看摘要

Abstract:As Large Language Model (LLM) APIs become ubiquitous, users increasingly rely on black-box fingerprinting to verify that providers are serving the advertised premium models. However, these methods may overlook adversarial providers who manipulate model weights to cheat the fingerprint process. We introduce a novel threat termed fingerprint spoofing, where a malicious provider stealthily serves a weaker model that has been parameter-efficiently fine-tuned to mimic a stronger model, thereby evading user-side fingerprinting. We first formally prove that user-side resource constraints (i.e., finite query budgets and weak fingerprinting classifiers) make current fingerprinting vulnerable to fingerprint spoofing. Guided by this theoretical analysis, we propose GhostPrint, a cost-effective attack framework leveraging surrogate modeling, reward-ranked fine-tuning, and knowledge distillation. Extensive evaluations in both static and continual fingerprinting settings demonstrate that GhostPrint allows weak models to consistently bypass representative fingerprint methods while maintaining utility at a low fine-tuning cost, exposing a critical vulnerability in current LLM fingerprinting pipelines.

86. 【2606.16093】Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

链接https://arxiv.org/abs/2606.16093

作者:Kuzey Torlak,Hüseyin Arda Arslan,Anıl Dervişoğlu,Beyza Nur Deniz,Onur Boyar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:long-range dependencies remains, Modeling long-range dependencies, Gated State Spaces, long-range dependencies, dependencies remains

备注: 16 pages, 9 tables, 4 figures

点击查看摘要

Abstract:Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while State Space Models (SSMs) scale linearly ($O(N)$) but suffer from a selective recall bottleneck, struggling to retrieve precise information from compressed states. This creates a fundamental tradeoff between efficiency and perplexity. To tackle these challenges, we propose the \textit{Parallel Hybrid Architecture (PHA)}, which runs Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) as independent parallel branches fused by a learnable mixing mechanism. Instead of forcing SSMs to approximate attention or serializing the two paradigms, PHA allows each branch to specialize: GSS captures global context, while attention performs selective retrieval, with FFN providing complementary processing. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming Hedgehog (16.70) and H3-125M (23.70). Scaling to 180M parameters yields 16.42 PPL, which gives comparable results with the pure attention baseline while delivering 24\% higher throughput and up to 40\% lower memory usage at long contexts. On OpenWebText, our 125M model achieves 19.72 PPL, outperforming standard Transformers (20.60) and GSS hybrid baselines (19.80). These results demonstrate that separating sequence modeling paradigms into parallel specialists enables Transformer-level perplexity with substantially improved efficiency for long-context language modeling.

87. 【2606.16084】Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas

链接https://arxiv.org/abs/2606.16084

作者:Mudit Sinha,Sanika Chavan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:lower-level units combine, combine into larger, Sperm Whale Project, Dominica Sperm Whale, larger sequences

备注: 22 pages, 2 figures, 4 tables. Preprint

点击查看摘要

Abstract:Human language has often been described as combining structure at two levels: lower-level units combine into larger units, which then combine into larger sequences. We test for this design feature, duality of patterning, in sperm whale codas using 1,483 codas from the Dominica Sperm Whale Project. Because acoustic similarity can imitate symbolic structure, we treat the problem as computational-linguistic structure discovery from continuous audio rather than as a direct claim about language or meaning. We use a consensus of frozen audio encoders, held-out structural tests, per-statistic nulls, and acoustic-null recoverability gates. The evidence supports a narrow two-tier architecture. At the lower tier, clicks compose into codas not by a stable ordered rule, but by which clicks are present together with their inter-click rhythm. At the upper tier, coda tokens show bout-level sequential dependence, with an NSB second-order transfer-entropy lift of 0.132 bits (p = 0.002). Under tempo scaling, encoder-derived click identity is strongly rate-bound, while coda identity remains substantially more stable, yielding a measurable abstraction gradient across the click-to-coda step. Rhythm-only baselines recover substantial lower-tier structure but fail to reproduce the upper-tier sequential-dependence signal. We do not claim language, semantics, perception, or human-like phonemes. Instead, we report representation-level evidence for a duality-of-patterning-like architecture whose lower tier is rhythmic rather than segmental, and provide a portable null-controlled framework for testing combinatorial structure in induced acoustic token systems.

88. 【2606.16074】PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

链接https://arxiv.org/abs/2606.16074

作者:Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Elyas Irankhah,Sreeraj Ramachandran,Ashley Hagaman,Sarah Lowe,Aimee Roundtree

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:patients' lived experiences, remains largely unstructured, patient-centered outcomes research, Patient-generated text, social context

备注

点击查看摘要

Abstract:Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.16074 [cs.CL]

(or
arXiv:2606.16074v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.16074

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
89. 【2606.16047】From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

链接https://arxiv.org/abs/2606.16047

作者:Jakub Bąba,Jarosław A. Chudziak

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, general reasoning capabilities, strong general reasoning, Language Models

备注: Accepted for publication in the proceedings of KES 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.

90. 【2606.16026】In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

链接https://arxiv.org/abs/2606.16026

作者:Isaac Hands,Bin Huang,Adam Spannaus,John Gounley,Heidi Hanson,Eric Durbin,Sally R. Ellingson

类目:Computation and Language (cs.CL)

关键词:supervised biomedical NLP, biomedical NLP models, hampers supervised biomedical, biomedical NLP, supervised pipeline designed

备注

点击查看摘要

Abstract:We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.

91. 【2606.16019】Scaling Human and G2P Supervision for Robust Phonetic Transcription

链接https://arxiv.org/abs/2606.16019

作者:Alexander Metzger,Aruna Srivastava,Ruslan Mukhamedvaleev

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词:Expert phonetic annotation, Expert phonetic, non-standard dialects, dialects and atypical, Expert

备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We study how automatic phonetic transcription performance scales with human and G2P supervision in English. Using a curated 80-hour benchmark spanning native, non-native and post-stroke speech, we identify a supervision quality threshold: G2P supervision helps only when fewer than 20-30 hours of human annotation are available. Beyond this threshold, it provides no significant benefit and can reduce cross-dialect robustness. What is effective after this threshold is ASR pretraining which we use to achieve a 2.3x reduction in weighted phone feature error rate over prior systems, with strong gains on non-native and aphasic speech. These results suggest that quantity-driven G2P scaling may yield diminishing returns for robust generalization.

92. 【2606.16011】Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

链接https://arxiv.org/abs/2606.16011

作者:Nafiseh Nikeghbal,Amir Hossein Kargaran,Shaghayegh Kolli,Jana Diesner

类目:Computation and Language (cs.CL)

关键词:approach correct answers, closely large language, approach correct, LLMs stick, plausible counter-argument

备注: Accepted to the non-archival workshops AI4Good and AIWILD at ICML 2026

点击查看摘要

Abstract:Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at this https URL and this https URL.

93. 【2606.16009】Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

链接https://arxiv.org/abs/2606.16009

作者:Claudio Fantinuoli

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:achieved remarkable progress, approaching human parity, systems approaching human, real-time branch, standard benchmarks

备注

点击查看摘要

Abstract:Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains far inferior to interpreter-mediated communication, revealing what we term the \emph{accuracy illusion}: systems that appear accurate on paper but fail in practice to support smooth, goal-oriented interaction. This paper defines MI as a distinct subfield of speech translation, with its own characteristics and the need for evaluation methods grounded in communicative effectiveness rather than isolated fidelity metrics. Drawing on insights from interpreting studies, we identify critical dimensions of professional interpreting practice that are overlooked by current systems, and consolidate them into three interdependent design priorities for future MI: \emph{agency} (context-sensitive initiative and repair), \emph{grounding} (multimodal and discourse-level situational awareness), and \emph{experience} (adaptive improvement through real interaction). Together, these priorities chart a path toward closing the usability gap and enabling systems that can sustain authentic multilingual communication in real time.

94. 【2606.16000】GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

链接https://arxiv.org/abs/2606.16000

作者:Aleksandr Tsymbalov,Danis Zaripov,Artem Epifanov,Anastasya Palienko

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Guarded Reward-guided Agent, Guarded Reward-guided, Reward-guided Agent Correction, Science for pre-deployment, Data Science

备注

点击查看摘要

Abstract:We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

95. 【2606.15998】Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

链接https://arxiv.org/abs/2606.15998

作者:Utshab Kumar Ghosh,Shubham Chatterjee

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:assuming that semantically, entity, Entity Relevance, OER, non-relevant documents

备注: ICTIR '26

点击查看摘要

Abstract:Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($\kappa \approx 0$), while OER operationalizations agree substantially ($\kappa \approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

96. 【2606.15984】ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

链接https://arxiv.org/abs/2606.15984

作者:Andrei-Marius Avram,Aureliu-Valentin Antonie,Ştefan-Bogdan Badea,Andrei Florea,Robert-Nicolae Zaharoiu,Dumitru-Clementin Cercel

类目:Computation and Language (cs.CL)

关键词:proceedings faces significant, faces significant hurdles, significant hurdles due, parliamentary proceedings faces, Automated transcription

备注

点击查看摘要

Abstract:Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROManian PARliamentary Speech Corpus (ROMPAR) dataset, a 17.80-hour corpus of Romanian and Moldavian parliamentary speech, featuring double-annotated ground truth and explicit labels for reconstructed word fragments. To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic invariance across age, gender, and dialect. We address the inherent instability of adversarial objectives in generative architectures by introducing an exponential decay mechanism for the adversarial coefficients. Furthermore, we implement an LLM-guided decoding strategy with position-dependent weighting to facilitate morphological completion of truncated terminal words. Our results demonstrate that the proposed framework significantly reduces WER and achieves an F1-score of 96.6% in morphological reconstruction.

97. 【2606.15980】Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

链接https://arxiv.org/abs/2606.15980

作者:Evan Duan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:deployment safety stacks, increasingly common layer, language model internal, model internal representations-are, safety stacks

备注

点击查看摘要

Abstract:Activation monitors-lightweight probes trained on a language model's internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

98. 【2606.15974】A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

链接https://arxiv.org/abs/2606.15974

作者:Weixiao Zhou,Gengyao Li,Xianfu Cheng,Junnan Zhu,Feifei Zhai,Zhoujun Li

类目:Computation and Language (cs.CL)

关键词:evaluation remains limited, sample sizes, significant advancement, remains limited, limited by insufficient

备注: 21 pages, 18 figures

点击查看摘要

Abstract:Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning systems and efficient small models, or lack fine-grained, multi-dimensional assessments. To bridge these gaps, we propose OmniCSEval, a unified benchmark comprising 1,800 diverse conversations across six real-world scenarios, featuring context lengths ranging from 128 to 32k tokens. For fine-grained evaluation, we employ a bidirectional fact-checking framework that integrates key fact matching to assess completeness and conciseness, alongside summary fact verification to evaluate faithfulness. To ensure reliable assessment, we establish a human-LLM collaborative pipeline for key fact extraction and a multi-LLM consensus verifier for summary fact decomposition. Leveraging this framework, we evaluate 28 LLMs across four distinct categories grouped by reasoning capability and model scale. Our extensive empirical study reveals critical insights regarding the cross-scenario challenges current LLMs continue to face, the impacts of reasoning and scale, and the efficiency and adaptability of reasoning models. We also provide guidance for system selection in real-world deployments.

99. 【2606.15972】Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

链接https://arxiv.org/abs/2606.15972

作者:Ji Feng,Zhouxing Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:verify reasoning outputs, formal proof assistants, increasingly applied, machine-checkable rigor, enabling use cases

备注: 15 pages, 1 figure. Code available at [this https URL](https://github.com/ucr-rai/base-and-edit)

点击查看摘要

Abstract:With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer selection in test-time scaling with K sampled candidate answers. However, employing Lean requires that LLM outputs, originally in natural language, first be formalized. Existing Lean-based answer-selection work uses an autoformalization model to generate a formal statement in Lean for each candidate answer independently, incurring a significant computational cost. We propose BASE, a base-and-edit pipeline that formalizes a single base candidate per problem and derives the remaining K-1 statements by editing the answer expression in place. To facilitate this, we train a rewriter model LEANSCRIBE to localize the answer in the base formalization and generate a reusable edit function for the other K-1 candidates. BASE simultaneously improves selection accuracy and reduces formalization cost - a Pareto improvement that holds on all 12 (dataset, solver) configurations across four benchmarks and three solvers, cutting autoformalizer calls by about 5x at K=8, with the reduction expected to become larger as K grows. Code is available at this https URL.

100. 【2606.15971】SAG: SQL-Retrieval Augmented Generation with Query-Time Dynamic Hyperedges

链接https://arxiv.org/abs/2606.15971

作者:Yuchao Wu,Junqin Li,XingCheng Liang,Yongjie Chen,Yinghao Liang,Linyuan Mo,Guanxian Li

类目:Computation and Language (cs.CL)

关键词:large language models, access external knowledge, offers an effective, Retrieval-Augmented Generation, effective approach

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) offers an effective approach for large language models to access external knowledge. However, existing methods rely on dense similarity retrieval and face inherent limitations in handling structured constraints and multi-hop reasoning. Incorporating knowledge graphs partially alleviates these issues, but at the cost of semantic fragmentation, high maintenance overhead, and difficult incremental updates. This paper introduces SAG (SQLRetrieval Augmented Generation), a structured architecture for retrieval and agent systems. Instead of pre-building a global static graph, SAG converts each chunk into one semantically complete event and a set of indexing entities, then uses SQL join queries to dynamically link events that share entities into local hyperedges,constructing, at query time, a dynamically instantiated local index structure. This design avoids the need for global graph rebuilding and ongoing maintenance; the system naturally supports incremental writes, concurrent processing, and continuous scaling through its reliance on standard database infrastructure. Across HotpotQA, 2WikiMultiHop, and MuSiQue, three standard multi-hop benchmarks,SAG achieves the best results on 8 out of 9 Recall@K metrics, reaching 80.0% Recall@5 on MuSiQue, the benchmark with the highest multi-hop reasoning this http URL has also been deployed at a production scale of hundreds of millions of data items, with online retrieval latency kept within seconds. Project site and code are available at this https URL.

101. 【2606.15963】PreLort: Prefix-Nested LoRA for Federated Fine-Tuning under Rank Heterogeneity

链接https://arxiv.org/abs/2606.15963

作者:Muhammad Waseem,Nurbek Tastan,Andrej Jovanovic,Nicholas D. Lane,Nils Lukas,Karthik Nandakumar,Samuel Horvath

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:enables privacy-preserving adaptation, large language models, fine-tuning of large, large language, privacy-preserving adaptation

备注

点击查看摘要

Abstract:Federated fine-tuning of large language models using parameter-efficient methods such as LoRA enables privacy-preserving adaptation of foundation models. Heterogeneous hardware resources introduce challenges, as clients with different adapter ranks cannot be directly aggregated. While existing methods enable aggregation under heterogeneous ranks, they fail to control how information is distributed across rank dimensions, leading to suboptimal use of shared low-rank representations. Instead, we propose PreLort: a nested low-rank formulation for federated LoRA that organizes adapter dimensions into a prefix hierarchy. Our approach ensures that lower-rank dimensions encode task-relevant information, while higher-rank dimensions capture additional capacity. Building on this, we introduce (i) a segment-wise aggregation rule that averages only over clients contributing to each rank segment, avoiding dilution from zero-padded lower-rank clients, and (ii) a prefix-nested training strategy that optimizes each adapter under multiple rank truncations, encouraging useful signal to concentrate in low-rank prefix dimensions. Together, these components encourage a consistent low-rank prefix capturing the most task-relevant information, while higher-rank dimensions learn additional capacity. This allows low-rank clients to benefit from richer information contributed by higher-rank clients, as prefix dimensions are consistently learned and aggregated. Experiments demonstrate that our method consistently outperforms prior heterogeneous federated LoRA methods in accuracy and ROUGE-L, while achieving lower or comparable perplexity across multiple base models.

102. 【2606.15949】FinBalance: A Multi-Document Accounting Reconciliation Benchmark

链接https://arxiv.org/abs/2606.15949

作者:Sasank Tumpati,Devansh Agarwal,Ayush Kedia,Arjun Neekhra,Murari Mandal,Krishna Garg,Yash Sinha,Suman Gupta,Dhruv Kumar

类目:Computation and Language (cs.CL)

关键词:Existing financial-NLP benchmarks, evaluate prepared artifacts, Existing financial-NLP, evaluate prepared, prepared artifacts

备注: 18 pages, 12 figures. Code and data: [this https URL](https://github.com/Devansh1105/finbalance)

点击查看摘要

Abstract:Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a balance sheet, and checked for contradictions. We introduce FinBalance, a multi-document accounting reconciliation benchmark built from source-document bundles across eight industries, three period types, and five difficulty levels. Human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates are composed by a deterministic generator whose ledger produces journal entries,balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs reach at most 46% exact final-balance-sheet accuracy. Four models show a 26-41 pp gap between BS_exact, the model's reported balance sheet, and BS_recon, the balance sheet obtained by replaying its entries through our ledger. Models often recover numerically plausible entries but fail to bind them to supporting documents and aggregate them consistently. Citation-pressure prompting barely changes document-linking errors, while ledger-feedback ablations substantially improve reported balance sheets and expose inconsistency-detection trade-offs. Expert finance reviewers validate the benchmark design and labels.

103. 【2606.15932】Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

链接https://arxiv.org/abs/2606.15932

作者:Xuanle Zhao,Qiushi Sun,Jingyu Xiao,Xuexin Liu,Haoyue Yang,Qiaosheng Chen,Xianzhen Luo,Jing Huang,Yufeng Zhong,Lei Chen,Shuai Fu,Zhenlin Wei,Jinhe Bi,Lei Jiang,Haibo Qiu,Siqi Yang,Peng Shi,Jian Hu,Zhixiong Zeng

类目:Computation and Language (cs.CL)

关键词:real programming tasks, vector drawings, substantially advanced, interactive states, LLMs have substantially

备注: Work completed in January 2026. Updating now

点击查看摘要

Abstract:While LLMs have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, documents, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, geometry, data semantics, editability, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, execute, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move multimodal code generation from single-output imitation toward evidence-grounded executable systems.

104. 【2606.15914】Contaminated Collaboration: Measuring Gender Bias Transfer in LLM-Assisted Student Writing

链接https://arxiv.org/abs/2606.15914

作者:Ariyan Hossain,Kazi Kamruzzaman Rabbi,Farig Sadeque,S M Taiabul Haque

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:amplify stereotyped generations, model outputs, stereotyped generations, studied extensively, extensively in model

备注: 18 pages, 7 pages

点击查看摘要

Abstract:Gender bias in LLMs has been studied extensively in model outputs, with biased prompts shown to amplify stereotyped generations. Whether such bias propagates into text produced by humans who use these systems, however, remains underexplored. We investigate whether gender bias in an LLM writing assistant transfers into career plan essays written by students. We first verify that a gender-biased prompt induces gender-differentiated language in LLM-generated essays, while a neutral prompt does not. We then recruited participants (N = 123) in a controlled environment to write career plan essays for paired biographical profiles differing only in gender under three conditions: no AI assistance, neutral LLM assistance, or gender-biased LLM assistance. Students in the biased condition produced essays with a significantly larger agentic gap and more gender-stereotypic occupation suggestions than those in the control and neutral conditions. Our results also reveal that this bias transfer is asymmetric: agency is suppressed in female-target essays while male-target writing remains largely unaffected. Our findings highlight the risk of bias propagation in AI-assisted writing, calling for fairness-aware design in educational AI tools.

105. 【2606.15911】Interactor: Agentic RL oriented Iterative Creation for Ad Description Generation in Sponsored Search

链接https://arxiv.org/abs/2606.15911

作者:Penghui Wei,Jiayu Wu,Chao Ye,Zhi Guo,Shuanglong Li,Lin Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:automatically generating informative, paper focuses, focuses on automatically, sponsored search, descriptions

备注

点击查看摘要

Abstract:This paper focuses on automatically generating informative ad descriptions in sponsored search. Unlike ad titles which are usually optimized to attract user click feedbacks, ad descriptions have a longer text span and possess the potential of incorporating world knowledge to address user search intents while presenting the fine-grained selling points of the ads. We propose Interactor, a multi-turn iterative creation framework optimized with agentic RL for ad description generation. The generation model acts as a policy that interacts with a customized environment consisting of multiple generative reward models. Given initial generations by the policy, the customized GenRMs evaluate multi-dimensional qualities including knowledge capacity and landing page consistency, providing both binary signals and reasoning feedbacks. The policy then iteratively refines the descriptions based on such feedbacks to ensure continuous improvement. Experiments on industrial datasets show that the Interactor framework significantly outperforms state-of-the-art approaches in generating knowledge-rich and faithful ad descriptions. Since May 2026, it has been deployed online in a leading search ads system, contributing to both ad revenue and user experience.

106. 【2606.15910】Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models

链接https://arxiv.org/abs/2606.15910

作者:Reza Khanmohammadi,Kundan Thind,Mohammad M. Ghassemi

类目:Computation and Language (cs.CL)

关键词:medical image fluently, language priors, vision-language model, image fluently, confidence

备注

点击查看摘要

Abstract:A vision-language model can answer a question about a medical image fluently and confidently while barely using the image, leaning instead on language priors. In medicine this is the failure that matters most, because the answer looks trustworthy and is not, and the only protection is a confidence score reliable enough to tell the system when to abstain. We ask a deployment question rather than an accuracy one: how much imaging work a model can safely handle alone, and which confidence signal makes that possible. We evaluate seven confidence estimators across five open-weight LVLMs and three medical visual-question-answering datasets spanning broad clinical imaging, radiology, and pathology, with every probe trained only on natural images and applied without adaptation. Recast as bounded selective prediction (automate a case only when confidence clears a threshold, defer the rest), the comparison is cautionary. The standard metrics are poor guides: discrimination barely separates the methods, and the weak calibration of a cheap self-report is cheaply removed by off-domain temperature scaling without changing deployable yield. What distinguishes a usable estimator is the high-confidence region a clinician acts on: the weakest baselines are confidently wrong on 41 to 45 percent of their errors against 1 to 4 percent for the best probe, and no estimator is reliably best across domains or models. Safe handoff is governed at two levels: base-model competence sets a ceiling, so a well-calibrated score recovers roughly a third of radiology cases at a 20 percent error tolerance but almost none of pathology; the confidence layer then decides how much of that ceiling is reachable. The usable role today is calibrated triage, not autonomy: automate the cases a calibrated score marks safe, route the rest to a clinician. We release all outputs, correctness judgments, and confidence scores, with code.

107. 【2606.15906】MAGE-RAG: Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA

链接https://arxiv.org/abs/2606.15906

作者:Yilong Zuo,Xunkai Li,Jing Yuan,Qiangqiang Dai,Hongchao Qin,Ronghua Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Multimedia (cs.MM)

关键词:question answering requires, multimodal question answering, locate sparse evidence, Long-document multimodal question, evidence

备注

点击查看摘要

Abstract:Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at this https URL.

108. 【2606.15903】Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

链接https://arxiv.org/abs/2606.15903

作者:Dongxu Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:retrieves stored facts, agent memory pipeline, inscribe-time LLM recovers, LLM recovers canonicalization, extensively benchmarked

备注: 23 pages including appendices. Code, benchmark, and adapters released under MIT at [this https URL](https://github.com/deeplethe/lethe)

点击查看摘要

Abstract:Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

Comments:
23 pages including appendices. Code, benchmark, and adapters released under MIT at this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

ACMclasses:
I.2.7; I.2.11; H.3.3

Cite as:
arXiv:2606.15903 [cs.CL]

(or
arXiv:2606.15903v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.15903

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
109. 【2606.15893】BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

链接https://arxiv.org/abs/2606.15893

作者:Ning Li,Zixuan Guo,Yan Xu,Wenbo Fei,Yifan Niu,Chang Luo,Yasheng Wang,Weiwen Liu,Yong Yu,Weinan Zhang

类目:Computation and Language (cs.CL)

关键词:deploying large language, large language models, hallucination mitigation, provided evidence, remain a major

备注

点击查看摘要

Abstract:Hallucinations remain a major obstacle to deploying large language models (LLMs) in knowledge-intensive settings, where generated responses must be faithfully grounded in provided evidence. Reinforcement learning (RL) is a promising direction for hallucination mitigation, but response-level faithfulness rewards suffer from a granularity mismatch: localized hallucinations can cause supported content to receive spurious penalties. Although recent work introduces fine-grained feedback such as claim-level verification and token-level rewards, unbalanced credit assignment can still induce length, verbosity, or optimization-noise biases. We propose BALTO, a Balanced Token-level Policy Optimization framework for hallucination mitigation. BALTO extracts checkable factual claims, verifies them against the reference context, and projects claim-level judgments to token-level labels. A balanced token-level credit assignment mechanism is introduced into the framework. This design redistributes probability mass from unsupported content toward faithful content, rather than suppressing the entire response. We systematically analyze the limitations of response-level rewards from a theoretical standpoint, and prove BALTO's advantages in training stability and optimization efficiency for hallucination mitigation. Experiments on ConFiQA, RAGTruth, and FinLLM-Eval show that BALTO achieves the highest faithfulness across all six model--benchmark settings and consistently outperforms existing post-training baselines in Q-Score, demonstrating a stronger faithfulness--informativeness trade-off.

110. 【2606.15884】Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

链接https://arxiv.org/abs/2606.15884

作者:Eri Onami,Youmi Ma,Shuhei Kurita,Naoaki Okazaki

类目:Computation and Language (cs.CL)

关键词:applied domain tasks, reasoning in LLMs, presented a neuron-level, neuron-level analysis, analysis of legal-domain

备注

点击查看摘要

Abstract:We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we confirmed that suppressing the identified neurons collapses accuracy on the target task, whereas suppressing the same number of random neurons does not. We further found a small subset of neurons influential across all seven tasks; once these are removed, suppressing the remaining neurons degrades only the task they were identified from, revealing genuinely task-specific neurons in every model studied. Within the legal domain, the three benchmarks exhibit relatively high neuron overlap and tend to be affected jointly, suggesting of legal components neurons that span jurisdictions. The distribution of identified neurons in our experiments suggests that the hypothesis that influential neurons are concentrated in middle MLP layers may depend on the input format and content, rather than being a universal phenomenon.

111. 【2606.15883】Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

链接https://arxiv.org/abs/2606.15883

作者:Haq Nawaz Malik,Nahfid Nissar,Faizan Iqbal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:downstream NLP applications, challenging downstream NLP, modified Perso-Arabic script, frequently omits diacritic, omits diacritic marks

备注

点击查看摘要

Abstract:Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

112. 【2606.15877】Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

链接https://arxiv.org/abs/2606.15877

作者:Alex Bogdan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:improves large language, large language models', language models' performance, improves large, large language

备注: 64 pages, 6 figures

点击查看摘要

Abstract:Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects are documented; what has been missing is a principled account of which property decides the outcome. We argue it is meta-uncertainty: how unsure the model is about the reliability of its own evidence. When that uncertainty is high, extra reasoning stops adding signal and starts manufacturing false confidence. We prove that the policy minimizing expected free energy under uncertain precision stops integrating cues after a finite number of high-validity ones when the precision prior is heavy-tailed (Theorem 2.6.1), and under a Descending Dominance condition, is sample-wise identical to take-the-best (Theorem 2.7.4). Fast-and-frugal heuristics and active inference are, then, two descriptions of the same computation. The prediction is that on high-meta-uncertainty items, longer CoT should degrade accuracy. We score the regime per item (simulate-and-recover rho 0.96), build FEH-79, a benchmark of Knightian frames with matched controls, and run a pre-registered study across seven models (five open-weight 3B-32B, two frontier), five CoT lengths, and 7,875 responses. The gate, fixed before any data, required a negative interaction with posterior probability above 0.95 and an accuracy drop of more than 6 points. It held. The high-regime drop is 17.3 points (95% CI [7.7, 25.5]); matched items with definite answers show no cost. The effect is regime-dependent: decisive in capable mid-to-large models, directional in the two frontier systems, absent-to-reversed in the weakest. The framework answers when CoT helps and unifies the Bayesian and fast-and-frugal traditions: less-is-more effects are evidence about the meta-uncertainty regime, not against Bayesian cognition.

113. 【2606.15872】SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

链接https://arxiv.org/abs/2606.15872

作者:Jingru Guo,Xiangyuan Xue,Lian Zhang,Wanghan Xu,Siki Chen,Philip Torr,Wanli Ouyang,Lei Bai,Zhenfei Yin

类目:Computation and Language (cs.CL)

关键词:systems fall short, large language models, commercial systems fall, scientific reasoning remains, expert-level performance

备注

点击查看摘要

Abstract:Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.

114. 【2606.15833】When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

链接https://arxiv.org/abs/2606.15833

作者:Yongqi Kang,Yu Fu,Yong Zhao

类目:Computation and Language (cs.CL)

关键词:Incomplete Knowledge Graph, Graph Question Answering, Knowledge Graph Question, requires completing missing, Incomplete Knowledge

备注

点击查看摘要

Abstract:Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge quality. We ask a question that, to our knowledge, has not been systematically tested: does textual verifiability actually track correctness? Exploiting the gold deleted triples provided by the standard random-deletion protocol, we measure both. The finding is counterintuitive: among gold-correct completed edges, 76-96% have no supporting passage even under exhaustive retrieval, robustly across deletion rates (20%/40%), datasets (CWQ/WebQSP), and relation types (structural, commonsense, long-tail). Most Freebase-style facts simply do not occur as head-tail co-mentions in text. Textual faithfulness therefore measures provenance, not correctness -- separated by a paradigm-level gap no in-corpus retrieval closes. This reframes edge completion. Since most completed edges -- correct or not -- are causally redundant for the answer (95-97% of correct answers do not depend on any unsupported edge), the central question shifts from "is the edge correct?" to "admit or abstain under provenance uncertainty?" Within this framing we present TGComplete, a provenance-favoring admission policy that retrieves evidence at a reasoning breakpoint, verifies a candidate through a lightweight loop, and abstains when support is absent. Against the generate-to-complete baseline GoG, it attains higher edge precision against gold (15-21% vs 3-14%), with no statistically detectable EM loss and 3.1-7.4 times higher strict faithfulness of admitted edges -- at the cost of lower recall. We position TGComplete not as uniformly better, but as a principled point on a precision/provenance-recall trade-off, appropriate when auditability matters.

115. 【2606.15821】he Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

链接https://arxiv.org/abs/2606.15821

作者:Miso Choi,Seonga Choi,Mincheol Kwon,Woosung Joung,Jinkyu Kim,Jungbeom Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:share common foundational, Recent advances, common foundational LLMs, forming distinct model, large language models

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adaptation. We further show that this inheritance is consistent with attention-head weight preservation, and that context-truthful heads attend to query-relevant evidence. Building on this finding, we propose TruthProbe, a soft-gating strategy that amplifies context-truthful heads while preserving other head contributions. TruthProbe improves contextual truthfulness on HaluEval and reduces multimodal hallucination on POPE and CHAIR, with base-LLM Truth Scores transferring effectively to their fine-tuned LLM and MLLM descendants. Code is available at this https URL.

116. 【2606.15815】On Defining Erasure Harms for NLP

链接https://arxiv.org/abs/2606.15815

作者:Yu Lu Liu,Arnav Goel,Jackie Chi Kit Cheung,Alexandra Olteanu,Ziang Xiao,Su Lin Blodgett

类目:Computation and Language (cs.CL)

关键词:including representational harms, deployment of NLP, NLP systems, including representational, systems has raised

备注

点击查看摘要

Abstract:The deployment of NLP systems has raised concerns about harms they might produce, including representational harms. Recent literature has begun to conceptualize and measure one such harm, the harm of erasure. Nevertheless, the field lacks a clear and cohesive conceptual foundation for identifying and measuring erasure. Existing conceptualizations of erasure are often broad -- making it difficult to identify what is needed to establish and measure erasure -- or else specific to particular settings -- facilitating measurement for those settings but potentially challenging to adapt to other settings. To address this gap, we develop and propose a structured definition of erasure that clarifies what components are necessary for establishing whether erasure has occurred, which practitioners need to explicitly articulate and operationalize in order to measure erasure.

117. 【2606.15783】da704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

链接https://arxiv.org/abs/2606.15783

作者:Tai Tran Tan,An Dinh Thien

类目:Computation and Language (cs.CL)

关键词:Narrative Story Similarity, Narrative Representation Learning, Story Similarity, Representation Learning, Narrative Story

备注

点击查看摘要

Abstract:We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across abstract themes, course of action, and outcomes. We develop two pipelines: (Track A) a single-view method that encodes full narratives with smart layer freezing to reduce overfitting, and (Track B) a multi-view method that models theme, plot, and outcome with view-specific projection heads and self-supervised alignment. Both pipelines build on sentence-transformers models and are trained with contrastive loss on synthetic data. The code is available at the following GitHub repository: this https URL.

118. 【2606.15778】DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

链接https://arxiv.org/abs/2606.15778

作者:Ali Sarabadani,Mahtab Tajvidiyan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:Large Language Models, Large Language, Language Models, struggle to incorporate, costly retraining

备注

点击查看摘要

Abstract:Large Language Models (LLMs) struggle to incorporate new knowledge without forgetting or costly retraining. We propose DYNA, a lightweight framework that augments a frozen LLM with a temporal knowledge graph where events are nodes and temporal relations are directed, timestamped edges. The graph serves as an external, updatable memory. At query time, DYNA retrieves relevant nodes via random walks and centrality measures, then augments the LLM's response. Evaluated on three temporal recall tasks, DYNA reduces catastrophic forgetting by ~7% compared to fine-tuning and improves temporal ordering by ~5% over standard RAG. Higher graph clustering coefficients correlate with better retrieval, showing that graph structure matters. Contributions: (1) episodic memory as temporal KG, (2) retraining-free LLM augmentation, (3) graph properties as predictors of retrieval performance.

119. 【2606.15770】da704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

链接https://arxiv.org/abs/2606.15770

作者:Tai Tran Tan,An Dinh Thien

类目:Computation and Language (cs.CL)

关键词:English question-answer pairs, question-answer pairs extracted, strategies in English, English question-answer, political evasion strategies

备注

点击查看摘要

Abstract:This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two distinct paradigms: (1) Parameter-Efficient Fine-Tuning of Qwen3 models (4B-32B) using QLoRA, enhanced with tiered upsampling and weighted cross-entropy loss to address severe class imbalance, and (2) structured Chain-of-Thought (CoT) prompting of reasoning-capable API models, namely DeepSeek-V3.2 and Grok-4-Fast. Our evaluation demonstrates that structured CoT prompting of reasoning-enabled models substantially outperforms our baseline parameter-efficient fine-tuning implementation in absolute Macro F1. Our best system, Grok-4-Fast with extended reasoning and few-shot hierarchical CoT prompting, achieves a Macro F1 of 0.5147 on Subtask 2 (9-class evasion) and 0.7979 on Subtask 1 (3-class clarity), ranking 8th out of 33 teams on Subtask 2 and 13th out of 41 teams on Subtask 1 on the official leaderboard. Furthermore, our ablation studies reveal key insights into effective prompt design for evasion detection: presenting labels within a hierarchical taxonomy helps structure model reasoning, while few-shot exemplars provide task calibration. However, the strongest prompt variants are not statistically distinguishable in Macro F1, and explicitly enabling extended reasoning modes yields substantial performance gains by facilitating the multi-step pragmatic analysis required to detect evasive intent.

120. 【2606.15741】A Self Consistency Based Reranking for Narrative Question Answering

链接https://arxiv.org/abs/2606.15741

作者:Molham Mohamed,Ali Hamdi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Narrative question answering, long textual contexts, understand long textual, natural language processing, Narrative question

备注

点击查看摘要

Abstract:Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

121. 【2606.15735】EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

链接https://arxiv.org/abs/2606.15735

作者:Jiyoun Kim,Muhan Yeo,Eunhye Jang,Jeewon Yang,Hangyul Yoon,Su Ji Lee,Hee Jo Han,Hee-Jae Jung,Doyun Kwon,Jun young Lee,Jaehun Lee,Jung-Oh Lee,Sunjun Kweon,Jong Hak Moon,Daseul Kim,Minjae Cho,Edward Choi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:crucial clinical documents, patient readmission, ongoing care, hospital stay, diagnostic decision-making

备注

点击查看摘要

Abstract:Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

122. 【2606.15734】Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

链接https://arxiv.org/abs/2606.15734

作者:Weihang Su,Jiacheng Kang,Jingyan Xu,Qingyao Ai,Jianming Long,Hanwen Zhang,Bangde Du,Xinyuan Cao,Min Zhang,Yiqun Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Continual post-training enables, potentially causing catastrophic, post-training enables models, repeatedly updating shared, causing catastrophic forgetting

备注

点击查看摘要

Abstract:Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

123. 【2606.15733】Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

链接https://arxiv.org/abs/2606.15733

作者:Zhenyu Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Instruction-tuned language models, causal-reasoning question differently, Instruction-tuned language, English variable, structural causal model

备注

点击查看摘要

Abstract:Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

124. 【2606.15714】Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

链接https://arxiv.org/abs/2606.15714

作者:Hanyang Chen,Hongliang Li,Jiarui Cao,Yang Li,Yang Jiang,Haonan Wen,Kaiyu Huang,Shengnan Guo,Huaiyu Wan

类目:Computation and Language (cs.CL); Robotics (cs.RO)

关键词:large-scale multimodal data, recently demonstrated promising, learning generalist robot, generalist robot policies, demonstrated promising capabilities

备注

点击查看摘要

Abstract:Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

125. 【2606.15696】Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?

链接https://arxiv.org/abs/2606.15696

作者:Jason M Pittman,Yesenia Medina-Santos,Anton Phillips Jr.,Brielle C. Stark

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Correct Information Units, Correct Information, Information Units, quantify communicative informativeness, quantify communicative

备注: 5 tables, 4 figures

点击查看摘要

Abstract:Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained raters. This study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts. Sixteen picture-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa. Zero-shot prompting was insufficient across models. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. Mean few-shot F1 scores ranged from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B, with no significant differences between fixed global and per-chunk local example selection. Phi-3-mini was unstable and did not yield reliable performance. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia. Few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems.

126. 【2606.15652】MosaicQuant: Inlier-Outlier Disaggregation for Unified 4-Bit LLM Quantization

链接https://arxiv.org/abs/2606.15652

作者:Yangjia Hu,Haodong Wang,Zicong Hong,Qianli Liu,Quanxin Shou,Jian Lin,Song Guo,Xiaowei Shen,Xiangjun Huang,Dian Wang,Jian Yang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, quantization significantly reduces, language models, significantly reduces, reduces the memory

备注: 17 pages

点击查看摘要

Abstract:4-bit quantization significantly reduces the memory footprint and accelerates the inference of large language models (LLMs). However, its limited bit-width representation struggles to faithfully capture both dense common values (\emph{inliers}) and rare large-magnitude values (\emph{outliers}), causing substantial accuracy degradation. Existing mixed-precision methods mitigate this by retaining outliers in high precision, but at the cost of breaking the uniformity of low-bit execution, introducing precision conversion and extra data movement that undermine practical speedup. We propose \textbf{MosaicQuant}, a unified 4-bit LLM quantization paradigm built on a novel principle of \emph{inlier--outlier disaggregation}. Rather than elevating outlier precision, MosaicQuant quantizes the full weight matrix into a dense 4-bit base component, where inliers are captured faithfully while outlier are inevitably quantized. A sparse 4-bit residual component is then introduced to compensate for these quantization errors, selectively targeting the most error-critical weight blocks where output distortion is shown to be concentrated. However, a unified representation alone is insufficient, as naïvely executing the sparse residual as a separate kernel still breaks the unified low-bit inference pipeline. To bridge this gap, we introduce \textbf{ZipperEngine}, which fuses sparse block computation into the dense 4-bit GEMM kernel via an overlapped pipeline, unifying not only the representation but also the execution into a single coherent low-bit inference pipeline. Extensive experiments on LLaMA3 and Qwen3 demonstrate that MosaicQuant preserves near-FP16 accuracy while achieving up to $1.24\times$ speedup over the W16A16 baseline.

127. 【2606.15643】Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

链接https://arxiv.org/abs/2606.15643

作者:Gili Lior,Tzviel Frostig,Gabriel Stanovsky,Matan Eyal

类目:Computation and Language (cs.CL)

关键词:exhaustive evaluation scales, evaluation scales linearly, items conflate general, automatic translation introduces, Multilingual benchmarks

备注

点击查看摘要

Abstract:Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals. Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11-16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages, whereas accuracy-based baselines concentrate detections in a few languages, and recovering culture-specific items that accuracy-based baselines miss.

128. 【2606.15641】Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

链接https://arxiv.org/abs/2606.15641

作者:Guy Rotman,Adi Kopilov,Danit Berger Zalmanson,Omri Allouche

类目:Computation and Language (cs.CL)

关键词:In-context learning, remains largely unexplored, specialized domains remains, domains remains largely, largely unexplored

备注: Accepted for publication in Findings of the Association for Computational Linguistics 2026

点击查看摘要

Abstract:In-context learning (ICL) is the standard method for low-resource classification, yet its efficacy in specialized domains remains largely unexplored. We address the challenge of classifying semantically complex, multi-party B2B conversations, where traditional ICL encounters significant limitations, especially as context length increases due to the concatenation of multiple few-shot examples. We introduce the \texttt{Call Playbook} dataset, featuring five classification tasks derived from real-world B2B conversations targeting core sales concepts. To bridge the gap between performance and practical utility, we propose novel knowledge extraction methods that distill verbose examples into compact, interpretable representations of structured classification criteria and precise task descriptions. Our approach achieves a 99\% reduction in token usage and improves macro-averaged AUC by up to 7\% over traditional ICL. Notably, it remains robust as context grows, unlike advanced token compression baselines which degrade by over 9 F1 points. Importantly, our framework enables direct refinement of classification logic, addressing critical needs for transparency, efficiency, and user interaction in real-world NLP applications.

129. 【2606.15621】Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation

链接https://arxiv.org/abs/2606.15621

作者:Nils Matteson

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:language-model rollout caused, Per-token counterfactual credit, substitute an alternative, compare outcomes, Per-token counterfactual

备注: 10 pages, 3 figures. Code, per-pivot data, logs, and registration: [this https URL](https://github.com/thaw-ai/thaw) (benchmarks/, paper/refeed-drift/)

点击查看摘要

Abstract:Per-token counterfactual credit estimation asks which token in a language-model rollout caused the final answer to be right or wrong: cut the transcript at a pivot, substitute an alternative token, replay continuations, and compare outcomes. Published methods re-feed the transcript prefix as a fresh prompt, assuming this reproduces the state the model passed through during generation. We measure what that assumption costs on a stock inference engine, with a three-pass design: continuations resumed from the verified decode-time KV state, an identical second exact pass (a replica noise floor), and a re-feed pass. Across six configurations and three models (including a GRPO-trained checkpoint), at low-margin decision tokens, re-feeding changes the credit estimate at rates 14-28 percentage points above the replica floor (7-21pp under a treatment-independent conditioning; problem-clustered t = 2.9-6.4). Most changes are zero-boundary crossings of the quantized estimator rather than polarity reversals, and the perturbation is consistent with mean-zero, so averaged quantities are largely safe; but selection is not: a critical-token set chosen by thresholding $|\hat{A}_t|$ under re-feed overlaps the exact-resume selection at Jaccard 0.34-0.90, versus a 0.63-0.96 replica ceiling. A causal confirmation closes the loop: under vLLM's batch-invariant kernels all three passes are identical on every measured channel, with both disagreement rates exactly zero. Replica passes themselves disagree on 9-23% of eligible estimates: single-sample credit measurements at decision tokens are unreliable under any replay. Settings were fixed in advance; exact-pass cache hits in the second campaign are instrumented (100% hit rate, 3,434 pivots); total compute was under 10 USD. We recommend that counterfactual credit studies resume decoder state or use batch-invariant kernels, and report a replica floor.

130. 【2606.15610】LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

链接https://arxiv.org/abs/2606.15610

作者:Hiroyasu Usami,Keisuke Hara,Ayato Tsuboi,Naohiko Matsuda

类目:Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:open-ended model evaluation, human preference annotation, model evaluation, annotation is costly, difficult to reproduce

备注: 22 pages, 4 figures

点击查看摘要

Abstract:LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

131. 【2606.15591】Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems

链接https://arxiv.org/abs/2606.15591

作者:Tirthankar Mittra

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Math Word Problem, high-quality Physics Word, Physics Word Problems, Generating high-quality Physics, Word Problems

备注

点击查看摘要

Abstract:Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.

132. 【2606.15566】LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

链接https://arxiv.org/abs/2606.15566

作者:Eyup Engin Kucuk,Tarik Kelestemur,Ömer Dağlar Tanrikulu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:social science, coding is central, central to social, expert annotation, Qualitative coding

备注: 9 pages, 4 figures; Code and data: [this https URL](https://github.com/EyupEK/autoresearch_bayes)

点击查看摘要

Abstract:Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $\alpha$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $\alpha$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

133. 【2606.15532】EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

链接https://arxiv.org/abs/2606.15532

作者:Rongzhi Zhu,Xiang Huang,Yuchuan Wu,Rui Wang,Zequn Sun,Tao Ren,Weiyao Luo,Bingxue Qiu,Jieping Ye,Yongbin Li,Wei Hu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, single-response dialogue generation, static understanding tasks, Language Models

备注

点击查看摘要

Abstract:Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only recognize a user's emotion, but also improve the user's emotional and relational state over several turns. We introduce EIBench, a simulator-based benchmark for interactive emotion management. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held-out testing. The scenarios are organized by a 2x2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building. In each scenario, an LLM simulator plays the user, updates an emotion-relation state after each turn, and maps the final state to an anchor-based score. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per-turn state updates provide dense feedback for RL. We evaluate 15 open- and closed-source LLMs. Current models perform well on support and rapport-building scenes, but struggle with boundary maintenance under user pressure. To improve the EI ability of LLMs, we propose Centered Turn-Credit GRPO (CTC-GRPO), a GRPO extension that reuses the simulator's per-turn state updates as dense turn-level feedback while preserving the final outcome reward. CTC-GRPO improves Qwen3-8B from -22.4 to +22.4 on EIBench and also improves on out-of-distribution evaluations including SAGE (+12.4) and EQBench3 (+20.9%). Our results show that simulator-tracked user states can support both evaluation and training for multi-turn emotion management.

134. 【2606.15521】Emergent retokenization symmetry in large language models: phenomenology and applications

链接https://arxiv.org/abs/2606.15521

作者:Kanishk Jain,Matthew Day,Tankut Can

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:introduces representational redundancy, fixed token vocabulary, valid token encodings, byte string admits, Tokenization introduces representational

备注

点击查看摘要

Abstract:Tokenization introduces representational redundancy: under a fixed token vocabulary, every byte string admits many valid token encodings, or segmentations, that decode to the same surface string. However, given a prompt, most language model tokenizers break this representational symmetry by returning a canonical segmentation. Training only on canonical segmentations should influence inference behavior, and there is little reason to expect models to respect segmentation symmetry on downstream tasks. We find that this symmetry partially emerges during training. Here, we probe this emergent symmetry through experiments testing token compositional understanding, representation diversity, and task focused benchmark performance. We primarily use \textbf{retokenization} -- replacing a prompt's canonical tokenization with an alternative segmentation while preserving its bytes exactly. Relative to other prompt perturbations, retokenization is unusually clean because it isolates segmentation effects without changing syntax, semantics or surface form. We use retokenization to study sensitivity and robustness to semantically identical input representations across pretraining and post-training. Moreover, this partial retokenization symmetry suggests a distinct inference-time sampling axis. While temperature sampling generates diverse outputs from the model using its next-token probability distribution, retokenization generates diversity from the model's internal computations through semantically equivalent input representations. We find that while this retokenization sampling strategy can hurt performance on easy problems, it can also recover solutions that conventional sampling does not find. Overall, our work presents retokenization as a simple yet powerful probe of large language models, shedding light on compositional understanding and prompt sensitivity, and offering a novel sampling strategy.

135. 【2606.15517】SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

链接https://arxiv.org/abs/2606.15517

作者:Viswonathan Manoranjan,Amogh Gupta,Anvesh Rao Vijjini,Thomas Hofweber,Snigdha Chaturvedi

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, Large, sensitive prompts, Abstract

备注

点击查看摘要

Abstract:Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce SHARD, a self-reframing distillation method to improve safe-helpfulness. It first rewrites sensitive prompts to surface benign intent using philosophical guidelines, then reframes its original responses into safe, more helpful ones, and finally fine-tunes the model on its self-reframed responses. Across DNA and the English subset of LINGUASAFE, SHARD improves helpfulness for most model families while preserving safety. It also remains competitive with distillation from a larger teacher model, suggesting that models can internalize safe and helpful behavior elicited from their own. Warning: This paper contains content that may be offensive or harmful.

136. 【2606.15510】AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

链接https://arxiv.org/abs/2606.15510

作者:Nikolaos Lavidas,Kiki Nikiforidou,Dag Haug,Leonid Kulikov,Vassiliki Geka,Vassileios Symeonidis,Theodoros Michalareas,Sofia Chionidi,Anastasia Tsiropina,Eleni Plakoutsi,Evangelos Argyropoulos

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词:PROIEL Treebank Family, single PROIEL XML, Late Byzantine, Late Antique, Early Modern

备注: 16 pages. Data paper for the v0.4 release of AthDGC. Concept DOI: [https://doi.org/10.5281/zenodo.20439182](https://doi.org/10.5281/zenodo.20439182) . Companion site: [this https URL](https://athdgc.github.io)

点击查看摘要

Abstract:AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: https://doi.org/10.5281/zenodo.20439182.

137. 【2606.15483】Evaluative Judgement in Teaching AI-based Translation: A Class-room Case Study of AI-Mediated Translation and Post-Editing

链接https://arxiv.org/abs/2606.15483

作者:Gokhan Dogru

类目:Computation and Language (cs.CL)

关键词:fourth-year Machine Transla-tion, BA-level translation programme, Machine Transla-tion, elicit evaluative judgement, fourth-year Machine

备注: Workshop on Teaching AI-based Translation and Technologies (TAITT 2026) - EAMT 2026

点击查看摘要

Abstract:Drawing on 23 anonymized student pro-jects from a fourth-year Machine Transla-tion and Post-editing course in a BA-level translation programme, this paper exam-ines how structured comparison of gen-eral-purpose LLMs and online MT sys-tems can elicit evaluative judgement in AI-mediated translation. Students translat-ed short specialised English Wikipedia texts into Catalan or Spanish, generated four system outputs, evaluated them using automatic metrics and human adequa-cy/fluency assessment, selected one output for post-editing, and justified their deci-sion in written reports. Descriptive counts are reported for all 23 projects, while qualitative interpretation is based on the 22 cases accompanied by written reports. Results show that students did not treat automatic metrics as final authority: final post-editing selections often diverged from metric rankings and were justified through adequacy, fluency, terminology, naturalness, and expected post-editing ef-fort. The study therefore does not bench-mark systems under controlled conditions; it analyses how students justified system choice within an authentic classroom as-signment.

138. 【2606.15461】ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

链接https://arxiv.org/abs/2606.15461

作者:Pierre Dantas,Lucas Cordeiro,Waldir Junior

类目:Computation and Language (cs.CL); Hardware Architecture (cs.AR)

关键词:PLCs execute safety-critical, execute safety-critical programs, industrial sectors, execute safety-critical, PLCs execute

备注: 24 pages

点击查看摘要

Abstract:PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil graphics. This paper presents ESBMC-PLC, the first open-source formal verifier with native LD support (PLCopen XML format), implemented as a new ESBMC frontend. ESBMC-PLC translates LD rungs to GOTO IR, models the PLC scan cycle as a while(true) loop with nondeterministic inputs, and checks safety properties via SMT-based bounded model checking or k-induction. A five-property YAML language (mutual_exclusion, invariant, absence, response, reachability) avoids temporal logic. A survey of 22 studies (2020-2026) identifies four research gaps; ESBMC-PLC closes two of them. Evaluation on 13 benchmarks (6 domains, 3 sources - including deployed CONTROLLINO PLCs and MathWorks Simulink PLC Coder) shows correct classification across 61 properties: all 9 author-constructed programs (Categories A/B) as expected, all 4 vendor programs (Category C) correctly unlabeled, with 8 bugs found (actionable counterexamples), 7 unbounded k-induction proofs, all runs under 60ms on Apple Silicon. Feature comparison with PLCverif shows that ESBMC-PLC is the only open-source tool that combines native LD, k-induction, and SMT bit-vector semantics.

139. 【2606.15449】ransfer Learning for FHIR Questionnaire Terminology Binding

链接https://arxiv.org/abs/2606.15449

作者:Maxim Gorshkov

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:require FHIR Questionnaire, Electronic prior authorization, FHIR Questionnaire items, workflows require FHIR, Vinci CDS-Library lack

备注

点击查看摘要

Abstract:Electronic prior authorization workflows require FHIR Questionnaire items to carry LOINC codes, yet most items in the HL7 Da Vinci CDS-Library lack these bindings. We treat this as a retrieval problem: given a Questionnaire item's text, find the correct LOINC code in a pool of 97,314 active codes. We compare six methods (TF-IDF, frozen MiniLM, BioBERT, BioLORD, contrastively fine-tuned MiniLM, and a TF-IDF+GPT reranker) on a 54-item evaluation set spanning three query styles (natural question, medium, and terse). No single method wins on every metric. BioLORD, a frozen encoder pre-trained on biomedical ontology definitions, has the best top-rank accuracy (R@1 = 0.185, MRR = 0.246) despite seeing no task-specific data, while a contrastive fine-tune on raw LHC-Forms pairs takes R@5 (0.389) and R@10 (0.426). A distribution-shift ablation shows why the fine-tune in our main table is not the strongest one: adding GPT-generated paraphrases to the raw pairs drops R@5 from 0.389 to 0.296, so the augmented union underperforms raw-only training on every metric except R@1. Performance peaks at 5k training pairs. Error analysis on BioLORD's R@1 failures shows that wrong-specificity and ambiguous-text cases together account for 59% of errors.

140. 【2606.15422】Pepti-Agent: An AI Agent for Peptide Design and Optimization

链接https://arxiv.org/abs/2606.15422

作者:Houxu Chen,Achuth Chandrasekhar,Amir Barati Farimani

类目:Computation and Language (cs.CL); Biomolecules (q-bio.BM)

关键词:Therapeutic peptides occupy, development requires satisfying, nonspecific surface fouling, valuable design space, Therapeutic peptides

备注

点击查看摘要

Abstract:Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence-based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural-language reasoning rather than by tracking the evolving multi-property state of each candidate. We present Pepti-Agent, a closed-loop, peptide-specific framework that exposes generation, property prediction, and single-residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence's current property profile rather than by language reasoning alone. Task-specific PeptideGPT models generate candidates, ProtBERT-based classifiers score solubility, hemolysis, and non-fouling, and two interchangeable mutation operators propose sequence edits. By recording a per-step trace of controller decisions, predictor outputs, and accepted mutations, Pepti-Agent offers a reproducible substrate for benchmarking multi-objective design strategies and for prioritizing candidates for experimental validation.

141. 【2606.15419】Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

链接https://arxiv.org/abs/2606.15419

作者:Zaifu Zhan,Shuang Zhou,Rui Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:medical question answering, question answering, reasoning, Objective, peer-reviewed reasoning

备注: Accepted by the Journal of the American Medical Informatics Association

点击查看摘要

Abstract:Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

Comments:
Accepted by the Journal of the American Medical Informatics Association

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.15419 [cs.CL]

(or
arXiv:2606.15419v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.15419

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Zaifu Zhan [view email] [v1]
Sat, 13 Jun 2026 18:09:44 UTC (1,355 KB)

142. 【2606.15416】Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

链接https://arxiv.org/abs/2606.15416

作者:Guangyue Peng,Wei Li,Wen Luo,Houfeng Wang

类目:Computation and Language (cs.CL)

关键词:involves detecting, usage of grammar, Grammatical Error Correction, detecting and correcting, correcting the wrong

备注: 15 pages, 6 figures

点击查看摘要

Abstract:Grammatical Error Correction (GEC) involves detecting and correcting the wrong usage of grammar. While large language models (LLMs) with in-context learning (ICL) capabilities have shown significant progress on various natural language processing (NLP) tasks, their few-shot performance on GEC remains suboptimal. This is mainly due to the challenge of retrieving suitable in-context demonstrations that capture error patterns instead of semantic similarity. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states. From these states, we extract the Grammatical Error Representation (GER), an informative and semantically neutral encoding of grammatical errors. Our novel GER-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction. For high-resource languages, our results on 8B-sized open-source models match those of closed-source models such as Deepseek2.5 and GPT-4o-mini. For low-resource languages, our $F_{0.5}$ scores surpass the baseline by up to a factor of 1.20. This method provides a more precise and resource-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research.

143. 【2606.15412】Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

链接https://arxiv.org/abs/2606.15412

作者:Jakob Mraz,Tomaž Curk,Blaž Zupan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:transforming biomedical literature, Biomedical relation extraction, transforming biomedical, biomedical literature, structured knowledge

备注

点击查看摘要

Abstract:Biomedical relation extraction (BioRE) is a key step in transforming biomedical literature into structured knowledge. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains. We investigate few-shot BioRE using prompt-based learning with large language models (LLMs) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments on the BioREDirect dataset reveal a clear precision-recall trade-off. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient. The best-performing model achieves a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) while remaining below the supervised baseline (0.56). Much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across relation types in an imbalanced setting, prompt-based approaches outperform the supervised baseline (0.45 vs. 0.38), particularly on rare relation types. These findings highlight the potential of LLMs for BioRE in low-resource settings and underscore the importance of well-defined relation schemas.

144. 【2606.15405】-Mem: Memory That Anticipates, Not Archives

链接https://arxiv.org/abs/2606.15405

作者:Weidong Guo,Dakai Wang,Zixuan Wang,Hui Liu,Yu Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:long-term conversational memory, sessions earlier, coherent across extended, commitments made, made many sessions

备注

点击查看摘要

Abstract:Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

145. 【2606.15396】CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

链接https://arxiv.org/abs/2606.15396

作者:Wenbo Yu,Bohua Wang,Hao Fang,Kuofeng Gao,Jingru Zeng,Xiaochen Yang,Tianyi Zhang,Xiaoxiao Ma,Jiawei Kong,Hao Wu,Bin Chen,Shu-Tao Xia,Min Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Malicious content generated, large language models, pose severe safety, Malicious content, severe safety risks

备注

点击查看摘要

Abstract:Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to Chinese-specific regulatory policies, cultural context and linguistic nuances, failing to support fine-grained risk classification for diverse deployment needs. In this paper, we introduce a 5-macro, 31-micro category fine-grained risk taxonomy for Chinese scenarios, and build CHILLGuard: a dedicated Chinese LLM content safety guardrail. To address the critical scarcity of high-quality annotated Chinese safety data, we propose a scalable multi-stage data construction pipeline: we expand multi-source corpus via retrieval-augmented generation, generate implicit harmful samples through prompt engineering rewriting, and refine high-quality data via multi-model voting-based label calibration. Based on this, we build CHILLGuardTrain, a large-scale training set with 405,007 samples, and CHILLGuardTest, a rigorously curated annotated test set with 51,745 samples. We then train CHILLGuard on CHILLGuardTrain under a generator-classifier collaborative framework via Model-aware Direct Preference Optimization. Extensive experiments under multiple settings demonstrate the state-of-the-art performance of CHILLGuard, e.g., a 15.92% improvement of F1 score over Qwen3Guard-8B-Strict on our benchmark. We will release our resources at this https URL.

146. 【2606.15390】Not All Skills Help: Measuring and Repairing Agent Knowledge

链接https://arxiv.org/abs/2606.15390

作者:Yixuan Wang,Yiyang Zhou,Yiming Liang,Congyu Zhang,Fuxiao Liu,Jiawei Zhou,Huaxiu Yao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:current systems entrust, LLM agents, accumulating natural-language skills, LLM judgment, LLM

备注: 18 pages, 5 figures

点击查看摘要

Abstract:LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at this https URL.

147. 【2606.15378】Rethinking the Role of Efficient Attention in Hybrid Architectures

链接https://arxiv.org/abs/2606.15378

作者:Ziqing Qiao,Yinuo Xu,Chaojun Xiao,Zhou Su,Zihan Zhou,Yingfa Chen,Xiaoyue Xu,Xu Han,Zhiyuan Liu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Modern language models, recurrent sequence mixers, Modern language, language models increasingly, models increasingly adopt

备注: 23 pages, 13 figures

点击查看摘要

Abstract:Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.

148. 【2606.15367】S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents

链接https://arxiv.org/abs/2606.15367

作者:Yao Dong,Xinglin Xiao,Liwei Dong,Xinlong Jin,Zhengbo Li,Heng Zhang,Duyun Wang,Nan Xu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Deep research agents, Deep research, solve complex knowledge-intensive, research agents aim, research agents

备注

点击查看摘要

Abstract:Deep research agents aim to solve complex knowledge-intensive tasks through long-horizon planning, evidence gathering, reasoning, and report generation. While recent progress in search agents has demonstrated strong capabilities in information retrieval and answer verification, most existing training datasets remain search-centric, focusing primarily on closed-ended question answering and information localization. As a result, they mainly train information-seeking behavior while providing limited coverage of key deep research capabilities, including evidence integration, knowledge synthesis, planning, file understanding, and structured report generation. In this work, we propose a unified trajectory construction paradigm for deep research agents that combines closed-ended QA and open-ended exploration. The proposed framework consists of graph-grounded task formulation, agentic trajectory rollout, and multi-dimensional trajectory verification, enabling scalable synthesis of high-quality agentic trajectories spanning long-chain complex reasoning, deep research instruction following, report writing, file understanding and generation, and skills usage. Compared with existing search-oriented datasets, our synthesized trajectories place greater emphasis on knowledge synthesis, complex reasoning, and planning. S1-DeepResearch-32B achieves state-of-the-art performance among open-source models of comparable scale across 20 benchmarks spanning five capability dimensions, including complex reasoning, instruction following, report generation, file understanding, and skills usage. On several challenging deep research benchmarks, it approaches the performance of leading proprietary frontier models. These results highlight the importance of jointly modeling information acquisition, knowledge synthesis, and planning-oriented agent behaviors for building effective deep research agents.

149. 【2606.15345】Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

链接https://arxiv.org/abs/2606.15345

作者:Yuheng Lu,Qingcheng Zeng,Heli Qi,Puxuan Yu,Fuheng Zhao,Rui Yang,Hitomi Yanaka,Naoto Yokoya,Weihao Xuan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:produce grounded answers, reason over retrieved, retrieved sources, increasingly evaluated, produce grounded

备注: Preprint

点击查看摘要

Abstract:Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

150. 【2606.15335】Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

链接https://arxiv.org/abs/2606.15335

作者:Xuan Liu,Hefeng Zhou,Sicheng Chen,Chao Yang,Xingcheng Xu,Jingjing Qu,Jiong Lou,Jie LI,Xia Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:privacy leakage arises, agents exchange text, distributed agents exchange, vocabulary choices, organizational boundaries

备注

点击查看摘要

Abstract:When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and syntactic patterns. We propose DiSan(Disentangled Sanitization), a privacy-preserving sanitization framework and a built-in component of Intern-Shannon for multi-agent collaboration. DiSan uses a two-stream encoder to factorize text into a source-invariant role subspace that preserves task semantics and a source-identifying style subspace that remains local. Federated proto-type alignment and adversarial regularization enable joint training without centralizing raw text. Experiments show that identifier-level masking is insufficient: masking 19.2% of tokens reduces TF-IDF stylometric attribution by only 18.6%. By contrast, DiSan reduces answer-level PII exposure by 20 times while maintaining 83% answer faithfulness on a distributed multi-agent RAG benchmark, and lowers Enron stylometric attribution by 73.2% under TF-IDF and 70.6% under a neural probe.

151. 【2606.15333】Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

链接https://arxiv.org/abs/2606.15333

作者:Zirui Pang,Chenlong Zhang,Haosheng Tan,Zhuoran Jin,Jiaheng Wei,Zixin Zhong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:preserving general utility, removing hazardous knowledge, LLM unlearning, general utility, cost-effective alternative

备注

点击查看摘要

Abstract:LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as learning a refusal behavior, but their on-policy optimization repeatedly samples from the same forget and retain/boundary prompts throughout training. We identify a critical inefficiency in this process: easy cases quickly converge and provide little useful gradient signal, while hard cases near the forget/retain boundary continue to produce low-reward rollouts that are discarded after a single use. To address this issue, we propose ReRULE, an off-policy replay enhancement for reinforcement unlearning. ReRULE stores low-reward hard-case rollout groups in a replay buffer during early GRPO training and reuses them in later stages through importance-sampled off-policy updates, redirecting computation toward boundary cases that still require learning. Theoretically, we show that ReRULE yields a tighter hard-case convergence bound than pure on-policy RULE. Empirically, ReRULE improves MUSE-Books Retain Quality from 46.3 to 56.2 while adding only 5--11% training time across benchmarks. Its limited improvement on the simpler TOFU setting further supports the intended conditional behavior: replay is most beneficial when the hard/easy disparity is pronounced.

152. 【2606.15325】Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

链接https://arxiv.org/abs/2606.15325

作者:Rong Wang,Kun Sun

类目:Computation and Language (cs.CL)

关键词:Large language models, English learning, Large language, priors from pretraining, increasingly deployed

备注: 12 pages, 2 figures

点击查看摘要

Abstract:Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in priors from pretraining. This assumption is tested on 1,800 L2-Arctic utterances spanning six L1 backgrounds, three audio-capable LLMs, four pronunciation dimensions, and five evidence conditions ranging from a text-only baseline to numeric acoustic features and raw audio. Each (utterance x model x condition x dimension) cell is scored on three metrics: Rating Accuracy (RA) against gold labels, Evidence Coherence (EC) assessing internal consistency without ground truth, and Grounded Correctness (GC) evaluated against gold evidence. Results show three findings across models. First, rating accuracy and grounded reasoning decouple: 39.6% of judged cells contain internally coherent reasoning that supports a wrong rating, against only 15.8% where the reasoning supports a correct rating. Second, phoneme-level feedback converges to a fixed inventory of L2-English difficulty phones that recurs across all six L1 backgrounds and all evidence conditions. Third, acoustic evidence improves the rating only when the supplied feature directly probes the target dimension: textualised F0 range raises pitch-variation grounding from (0.18-0.19) to (0.45-0.62) across all three models, while stress and phoneme correctness, which require target-to-realisation alignment, remain ungrounded. The same audio waveform without textualised F0 values does not reproduce this improvement. These findings indicate that current general-purpose LLMs are more reliable as verbalisers of externally computed pronunciation evidence than as standalone diagnostic engines.

153. 【2606.15307】Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

链接https://arxiv.org/abs/2606.15307

作者:Mohamed Bayan Kmainasi,Mucahid Kutlu,Ali Ezzat Shahroor,Abul Hasnat,Firoj Alam

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:convey harmful intent, propagandistic memes exploit, Relative Policy Optimization, Group Relative Policy, exploit the interplay

备注

点击查看摘要

Abstract:Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

154. 【2606.15300】CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

链接https://arxiv.org/abs/2606.15300

作者:Yuxin Zhang,Ju Fan,Meihao Fan,Shaolei Zhang,Xiaoyong Du

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:autonomous engineers, creating a growing, increasingly demonstrating, demonstrating the potential, potential to operate

备注: Accepted at ICML 2026. 37 pages, 11 figures. Project page: [this https URL](https://coda-bench.github.io/) Code: [this https URL](https://github.com/ruc-datalab/CoDA-Bench) Data: [this https URL](https://huggingface.co/datasets/RUC-DataLab/CoDA-Bench)

点击查看摘要

Abstract:Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

155. 【2606.15266】Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

链接https://arxiv.org/abs/2606.15266

作者:Yuchen Song,Xi Chen,Mingze Li,Satoshi Nakamura

类目:Computation and Language (cs.CL)

关键词:achieved impressive progress, speech naturalness, achieved impressive, impressive progress, progress in semantic

备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains heavily underexplored, compounded by a lack of reliable automatic evaluation metrics for tonal languages like Chinese. We investigate English-to-Chinese S2ST stress transfer by constructing a stress-annotated Chinese dataset and an XLS-R-based Mandarin stress detector. Integrating this with the English EmphAssess system, we propose a novel objective metric for cross-lingual stress evaluation. Furthermore, we fine-tune CosyVoice3 to build a stress-aware S2ST system. Experiments demonstrate that our proposed S2ST architecture significantly outperforms existing systems in stress translation capability while maintaining competitive translation quality. Furthermore, our evaluation metric exhibits a strong correlation with human subjective judgments.

156. 【2606.15216】Spokes: Optimizing for Diverse Pretraining Data Selection

链接https://arxiv.org/abs/2606.15216

作者:Clarence Lee,Yejin Choi,Luke Zettlemoyer,Pang Wei Koh,Hai Leong Chieu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:fixed data budgets, redundancy and repetition, plays a critical, critical role, budgets by reducing

备注: 9 pages, 4 figures

点击查看摘要

Abstract:Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a +489 increase in G-Vendi score on a 500k-sample subset. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods. Notably, SPOKES (diversity-only) improves average downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. More importantly, jointly optimizing for both quality and diversity yields the strongest results: SPOKES achieves gains of +1.5 and +1.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering.

157. 【2606.15191】AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

链接https://arxiv.org/abs/2606.15191

作者:Michelle Barbosa,Sebastian Padó,Franziska Weeber

类目:Computation and Language (cs.CL)

关键词:Socio-cultural stereotypical bias, important consideration, development and deployment, Socio-cultural stereotypical, stereotypical bias

备注: The 1st Workshop on Stereotypes Across Cultures in Language Technologies

点击查看摘要

Abstract:Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.

158. 【2606.15161】Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

链接https://arxiv.org/abs/2606.15161

作者:Tao Jing,Ningxin Wu,Chen Kang,Dong Yu,Changliang Li,Pengyuan Liu

类目:Computation and Language (cs.CL)

关键词:considerable layer-wise redundancy, standard pruning approach, large language models, established non-uniform sparsity, efficient compression

备注: 10 pages, 4 figures, 4 tables. Submitted to EMNLP 2026

点击查看摘要

Abstract:The considerable layer-wise redundancy in large language models (LLMs) has established non-uniform sparsity allocation across layers as the standard pruning approach for efficient compression. Existing layer-wise allocation methods that estimate allocation strategy from local signals such as activation outliers or weight spectra mainly derive from local layer importance, whereas the final post-pruning performance is also influenced by the network's subsequent compensatory capacity. In this paper, we directly characterize this property through controlled perturbation experiments. We make the following empirical findings. First, layers exhibit highly heterogeneous responses to pruning-scale perturbations. In most cases, early layers amplify perturbations, while middle and late layers actively absorb them, with relative L2 drift decreasing monotonically across depth and direction realigning toward the unperturbed hidden-state trajectory. Second, absorption is a large-perturbation phenomenon. Under small perturbations the network exhibits amplification across all layers, and the transition to absorption occurs smoothly as perturbation magnitude grows to pruning scale. This enriches the linearized accumulation theory underlying related works. Building on these findings, we define an absorption coefficient per layer and propose absorption-aware correction, an orthogonal augmentation that improves OWL and AlphaPruning by reducing perplexity by 7.13% and boosting zero-shot accuracy by 1.02% across multiple model families at 70% sparsity.

159. 【2606.15152】Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

链接https://arxiv.org/abs/2606.15152

作者:Shijun Wan,Xuehai Wu,Jiwen Zhang,Siyuan Wang,Zhongyu Wei

类目:Computation and Language (cs.CL)

关键词:visible social signals, Social interaction depends, emotional shifts, language and visible, social signals

备注

点击查看摘要

Abstract:Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at this https URL, and the dataset is available at this https URL.

160. 【2606.15144】PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

链接https://arxiv.org/abs/2606.15144

作者:Jann Railey Montalan,David Demitri Africa,Jimson Paulo Layacan,Richell Isaiah Flores,Ivan Yuri De Leon,Lance Calvin Gamboa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:underlies word formation, Large language models, Large language, word formation, sequences of subword

备注: Submitted to EMNLP 2026

点击查看摘要

Abstract:Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

161. 【2606.15121】When Cognitive Graphs Meet LLMs: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

链接https://arxiv.org/abs/2606.15121

作者:Mengzhu Liu,Long Qin,Chuan Ai,Zhengqiu Zhu,Hongru Liang,Chen Gao,Yong Li,Xin Lu,Quanjun Yin

类目:Computation and Language (cs.CL)

关键词:Predicting individual panic, proactive emergency intervention, Predicting individual, individual panic emotional, emotional arousal

备注

点击查看摘要

Abstract:Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

162. 【2606.15088】When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

链接https://arxiv.org/abs/2606.15088

作者:Yu Liu,Zhiwei Yang,Wenxiao Zhang,Cong Cao,Fangfang Yuan,Kun Peng,Haimei Qin,Lei Jiang,Jin B. Hong,Hao Peng,Yanbing Liu

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:piece Für Elise, piano piece Für, Für Elise, piece Für, Elise is calm

备注

点击查看摘要

Abstract:A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

163. 【2606.15080】AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

链接https://arxiv.org/abs/2606.15080

作者:Dayeon Ki,Kevin Duh,Marine Carpuat

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Reasoning Models, show strong performance, Large Reasoning, show strong, fail to reason

备注: 20 pages, 5 figures

点击查看摘要

Abstract:While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.

164. 【2606.15079】Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

链接https://arxiv.org/abs/2606.15079

作者:Ang Li,Ben Liu,Bin Han,Bin Hu,Bin Jing,Binbin Hu,Bing Li,Cai Chen,Caizhi Tang,Changxin Tian,Chao Huang,Chao Zhang,Chen Liang,Chen Qian,Chengfu Tang,Chengyao Wen,Chilin Fu,Chunwei Wu,Cong Zhang,Cunyin Peng,Daixin Wang,Dalong Zhang,Deng Zhao,Dingnan Jin,Dingyuan Zhu,Donghao Zhang,Fan Yuan,Fangzheng Zhao,Fanzhuang Meng,Feifan Wu,Feng Xu,Fengbin Fang,Gangshan Wang,Guodong Yang,Hailin Zhao,Haitao Wang,Haitao Zhang,Hanxiao Zhang,Hanzi Wang,Hao Dai,Hao Liu,Hao Qian,Hao Wu,Haoxiong Liu,Haoyu Xu,Heng Zhang,Hong Liu,Hongliang Zhang,Hongrui Liu,Hongxun Li,Hongzhi Ruan,Huaidong Xiong,Huihuang Zheng,Huikang Tang,Jia Guo,Jia Li,Jia Liu,Jiameng Wang,Jiaming Liu,Jiannan Shi,Jianping Wei,Jiaolong Yang,Jiapeng Wang,Jie Gao,Jie Wang,Jiewei Wu,Jin Yang,Jinjin Li,Jinjing Huang,Jinquan Sun,Jinyao Chen,Juanhui Tu,Jun Liu,Jun Mei,Jun Xu,Jun Zhou,Junjie Ou,Junnan Sipan,Junpeng Fang,Kaihong Zhang,Kaiqin Hu,Ke Shi,Kuan Xu,Kun Tang,Kunlong Chen,Lanyin Mei,Lei Chen,Lei Liang,Lei Xu,Li Tang,Liang Jiang,Liangcheng Fu,Lihui Zhang,Linfeng Shi,Lintao Ma,Liyuan Liu,Longfei Li,Longfei Zheng,Lu Liu,Lu Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:deliver both low-latency, intelligence requires models, strong reasoning capabilities, agentic intelligence requires, requires models

备注

点击查看摘要

Abstract:Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

165. 【2606.15077】Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

链接https://arxiv.org/abs/2606.15077

作者:Kyle Gao,Joel Cumming,Jonathan Li,Linlin Xu,David A. Clausi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:natural language queries, retrieving remote sensing, remote sensing data, cloud-based geospatial catalogues, language queries

备注: Accepted for publication in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), ISPRS Congress 2026

点击查看摘要

Abstract:We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

166. 【2606.15070】Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

链接https://arxiv.org/abs/2606.15070

作者:Jiakai Li,Ke Qin,Rongzheng Wang,Yizhuo Ma,Qizhi Chen,Muquan Li,Shuang Liang

类目:Computation and Language (cs.CL)

关键词:test-time compute scaling, incorporating test-time compute, solve complex problems, compute scaling, problems through explicit

备注: ICML 2026 Spotlight

点击查看摘要

Abstract:By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model's reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.

167. 【2606.15069】CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

链接https://arxiv.org/abs/2606.15069

作者:Qianyu Wang,Xiaoman Wang,Yuanyuan Liang,Xinyuan Li,Yunshi Lan

类目:Computation and Language (cs.CL)

关键词:Grammatical error correction, Grammatical error, GEC Mutual Information, GEC benchmarks, GEC

备注

点击查看摘要

Abstract:Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data this http URL code is released at this https URL

168. 【2606.15059】A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

链接https://arxiv.org/abs/2606.15059

作者:Yulin Xue,Siqi Ouyang,Lei Li

类目:Computation and Language (cs.CL)

关键词:real-time cross-lingual communication, enables real-time cross-lingual, continuous input, cross-lingual communication, real-time cross-lingual

备注: Accepted to IWSLT 2026 Scientific Track

点击查看摘要

Abstract:Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.

169. 【2606.15044】Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

链接https://arxiv.org/abs/2606.15044

作者:Kieron Seven Jun Wei Lee,Muhammad Reza Qorib,Andrew Ivan Soegeng,Hwee Tou Ng

类目:Computation and Language (cs.CL)

关键词:bridge discrete text, continuous neural representation, depend on subword, Multilingual large language, bridge discrete

备注

点击查看摘要

Abstract:Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

170. 【2606.15037】ReportQA: QA-Based Radiology Report Evaluation

链接https://arxiv.org/abs/2606.15037

作者:Yiming Shi,Shaoshuai Yang,Xi Chen,Haolin Li,Hengyu Zhang,Che Jiang,Kaiwen Wang,Xun Zhu,Dong Xie,Fei Wang,Dejing Dou,Miao Li,Ji Wu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:advancing automated report, automated report generation, essential for advancing, advancing automated, Radiology report

备注

点击查看摘要

Abstract:Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

171. 【2606.15033】Cloze: An Open Research Platform for Studying Human-AI Conversations in Mental Health Contexts

链接https://arxiv.org/abs/2606.15033

作者:Matthew Flathers,Francesco Cipriani,John Torous

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:open-source web platform, conducting controlled, monitored studies, open-source web, Cloze

备注: 7 pages, 2 figures. Cloze is released under AGPL-3.0

点击查看摘要

Abstract:Cloze is an open-source web platform for conducting controlled, monitored studies of human-AI conversation in mental health research contexts. Consumer large language model (LLM) products such as ChatGPT, Claude, and Gemini are built for individual productivity, and offer researchers little experimental control, inconsistent data export, and no shared safety scaffolding that holds across providers. Cloze gives research teams a single environment in which they configure which models participants converse with, how the AI is instructed, how conversations are scheduled over time, and which safety constraints apply unconditionally, while every message is captured with full provenance (model version, prompt configuration, timing). The platform currently supports OpenAI, Anthropic, Google, and locally hosted open-weight models served through Ollama behind a unified interface, and runs in the cloud or fully on premises so that participant data need never leave an institution. Cloze is research infrastructure for building an evidence base on human-AI interaction in mental health contexts. It is not a therapeutic product.

172. 【2606.15026】Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

链接https://arxiv.org/abs/2606.15026

作者:Desta Haileselassie Hagos,Saurav Keshari Aryal,Patrick Ymele-Leki,Anietie Andy,Legand L. Burge

类目:Computation and Language (cs.CL)

关键词:Temporal Convolutional Networks, Long Short-Term Memory, affective computing, important for health, health monitoring

备注: Accepted for publication in the 17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2026). DOI: [this https URL](https://doi.org/10.1145/3807503.3819363)

点击查看摘要

Abstract:Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

173. 【2606.15017】Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

链接https://arxiv.org/abs/2606.15017

作者:Sina Hajimiri,Masih Aminbeidokhti,Jose Dolz,Ismail Ben Ayed,Issam H. Laradji,Spandana Gella,Nicolas Gontier

类目:Computation and Language (cs.CL)

关键词:augment a base, Online web agents, Online web, base actor, Online

备注

点击查看摘要

Abstract:Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

174. 【2606.15007】Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

链接https://arxiv.org/abs/2606.15007

作者:NVIDIA:Aaron Blakeman,Aaron Thomas,Aastha Jhunjhunwala,Abhibha Gupta,Abhinav Khattar,Adam Rajfer,Adi Renduchintala,Adil Asif,Aditya Vavre,Adriana Flores Miranda,Ahmad Bilal,Aileen Zaman,Ajay Hotchandani,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Alex Gronskiy,Alex Kondratenko,Alex Steiner,Alex Ye,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alice Gatti,Alisa Liu,Alok Kumar,Amar Phanishayee,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Anahita Bhiwandiwalla,Ananth Subramaniam,Andrea Santilli,Andrew Fulks,Andrew McHarg,Andrew Tao,Andrii Skliar,Anjulie Agrusa,Ankur Srivastava,Ankur Verma,Anna Shors,Anna Warno,Antoni-Joan Solergibert I Llaquet,Arham Mehta,Arkadiusz Nowaczynski,Arti Jain,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asma Kuriparambil Thekkumpate,Atefeh Sohrabizadeh,Avinash Kaur,Avinash Vem,Ayush Dattagupta,Barath Subramaniam Anandan,Bardiya Sadeghi,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bill Thiede,Bita Darvish Rouhani,Bo Deng,Bob Schatz,Boris Ginsburg,Boxin Wang,Brad Nemire,Brandon Norick,Brian Dang,Brian Westphal,Brian Yu,Brucek Khailany,Bryan Catanzaro,Carlo del Mundo,Caryln Aarish,Chankyu Lee,Chantal Hwang,Charbel Sakr,Charles Wang,Charlie Truong,Chen Cui,Cheng Cheng,Cheng-Ping Hsieh,Chenghao Zhang,Chenhui Deng,Chintan Patel,Chris Alexiuk,Christian Cosgrove,Christian Munley,Christine Harvey,Christopher Parisien,Chunyang Shen,Coco Li,Collin Neale,Cynthia Gao,Cyril Meurillon,Dan Gil

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Hybrid Mamba-Attention language, billion active parameter, Supervised Fine Tuning, Hybrid Mamba-Attention, Mamba-Attention language model

备注

点击查看摘要

Abstract:We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

175. 【2606.14961】CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

链接https://arxiv.org/abs/2606.14961

作者:Juming Xiong,Weixin Liu,Kevin Guo,Congning Ni,Junchao Zhu,Chongyu Qu,Chao Yan,Katherine Brown,Avinash Baidya,Xiang Gao,Bradley Malin,Zhijun Yin

类目:Computation and Language (cs.CL)

关键词:improve LLM performance, poorly supported, plausible yet incomplete, incomplete or poorly, LLM performance

备注

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.

176. 【2606.14943】Simplifying the Modeling of Arbitrary Conditionals in Natural Language

链接https://arxiv.org/abs/2606.14943

作者:Yinhan Lu,Eric Elmoznino,Léo Gagnon,Sarthak Mittal,Tejas Kasetty,Guillaume Lajoie

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Transformers model sequences, conditional likelihood computation, Causal Transformers model, Causal Transformers, joint distribution

备注

点击查看摘要

Abstract:Causal Transformers model sequences through an autoregressive factorization of the joint distribution, which enables efficient left-to-right decoding and conditional likelihood computation. However, they cannot tractably sample from or evaluate arbitrary conditionals -- e.g., a block of text conditioned on past and future tokens. Recent work aims to solve this problem through novel architectures, but they often lead to sub-optimal modeling of such conditionals and degraded generations. We propose Arbitrary Conditionals GPT (AC-GPT) which introduces a simple modification to standard causal Transformers to enable evaluating and sampling from arbitrary conditionals -- including past, future, and mixed contexts -- within a single forward pass. Unlike prior approaches, our method preserves the standard left-to-right ordering and next-token prediction objective essential for both strong performance and efficient training on natural language. Crucially, this compatibility allows existing LLMs to be fine-tuned for arbitrary conditioning. Our empirical results indicate that our method outperforms baselines on modeling arbitrary conditionals, without degrading standard left-to-right performance.

177. 【2606.14922】An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

链接https://arxiv.org/abs/2606.14922

作者:Vinh Dang Quang,Huy Ngo Quang

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:couple of years, deep learning-based TTS, improved dramatically, deep learning, learning-based TTS systems

备注: 4 pages

点击查看摘要

Abstract:For the last couple of years, the field of speech synthesis has improved dramatically thanks to deep learning. There are more and more deep learning-based TTS systems developed to make it possible to produce voices with high intelligibility and naturalness. Meanwhile, controlling the expressiveness is yet a big deal, generating speech in different styles or manners has received a lot of attention from community recently. This paper aims to give our solutions to deal with the task emotional speech synthesis (ESS) at VLSP 2022 which allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. By integrating speaker embedding, prosody bottleneck into FastSpeech 2, our systems can promisingly generate emotional speech of a single speaker (Sub-task 1), transfer speaking styles from another speaker to the target speaker with neutral non-expressive data while retaining the target speaker's identity (Sub-task 2).

178. 【2606.14885】Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

链接https://arxiv.org/abs/2606.14885

作者:Yi Lu,Zhuofeng Li,Ping Nie,Haoxiang Zhang,Yuyu Zhang,Kai Zou,Wenhu Chen,Jimmy Lin,Dongfu Jiang,Yu Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large corpora relies, scalable candidate discovery, Agentic search, candidate discovery, large corpora

备注: 25 pages, 4 figures, 22 tables

点击查看摘要

Abstract:Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.

179. 【2606.14875】Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

链接https://arxiv.org/abs/2606.14875

作者:Sisong Bei,Mikhail L. Arbuzov,Ziwei Dong,Dmitri Kalaev,Alexey Shvets

类目:Computation and Language (cs.CL)

关键词:multi-hop question answering, small language models, study context compression, Telegraph English, study context

备注

点击查看摘要

Abstract:We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.

180. 【2606.14867】Evaluating the Robustness of Proof Autoformalization in Lean 4

链接https://arxiv.org/abs/2606.14867

作者:Zhengtao Gui,Sheng Yang,Zhouxing Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:formal language, Proof autoformalization, Proof autoformalization aims, informal proof written, mathematical informal proof

备注: Preprint

点击查看摘要

Abstract:Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via this https URL.

181. 【2606.14832】PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

链接https://arxiv.org/abs/2606.14832

作者:Chenxin Li,Zhengyao Fang,Zhengyang Tang,Pengyuan Lyu,Xingran Zhou,Xin Lai,Fei Tang,Liang Wu,Yiduo Guo,Weinong Wang,Junyi Li,Yi Zhang,Yang Ding,Huawen Shen,Sunqi Fan,Shangpin Peng,Zheng Ruan,Anran Zhang,Benyou Wang,Chengquan Zhang,Han Hu

类目:Computation and Language (cs.CL)

关键词:increasingly expected, GUI, PhoneHarness, agents, PhoneHarness Bench

备注: Project Page: [this https URL](https://phoneharness.github.io/)

点击查看摘要

Abstract:Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

182. 【2606.14820】Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

链接https://arxiv.org/abs/2606.14820

作者:Yuxuan Chen,Haoyuan Yu,Peize He

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:phase fine structures, Recent spatial, localization tasks, raising questions, fine structures

备注: Accepted to INTERSPEECH 2026; 6 pages, 3 figures

点击查看摘要

Abstract:Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.

183. 【2606.14782】Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

链接https://arxiv.org/abs/2606.14782

作者:Tianhao Chen,Yuheng Wu,Kelu Yao,Xiaogang Xu,Xiaobin Hu,Dongman Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, Large Language, achieve strong vision-language, strong vision-language reasoning

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%.

184. 【2605.28860】Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

链接https://arxiv.org/abs/2605.28860

作者:Jeanmely Rojas Nunez,Viraj Sawant,Nathan Allen,Nomgondalai Amgalanbaatar,Yannis Zongo,Vasu Sharma,Maheep Chaudhary

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:large language models, Fine-tuning large language, frequently induces catastrophic, prior capabilities, language models

备注

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: this https URL.

185. 【2603.04592】From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

链接https://arxiv.org/abs/2603.04592

作者:Junlong Tong,Zilong Wang,YuJie Ren,Peiran Yin,Hao Wu,Wei Zhang,Xiaoyu Shen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Standard Large Language, Large Language Models, Standard Large, Language Models, Large Language

备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at this https URL.

186. 【2606.14823】Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26,278 target-disease pairs with temporal validation and feature ablation

链接https://arxiv.org/abs/2606.14823

作者:Victoria Paterson

类目:Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:approved drug targets, higher approval rate, Open Targets, enriched among approved, approved drug

备注

点击查看摘要

Abstract:Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.

信息检索

1. 【2606.17041】Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

链接https://arxiv.org/abs/2606.17041

作者:Anzhe Xie,Weihang Su,Yujia Zhou,Yiqun Liu,Qingyao Ai

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:ECO-guided study selection, ECO-guided study, study selection, statistical aggregation, demanding form

备注: 13 pages, 7 figures, preprint for arXiv, dataset and code available at [this https URL](https://github.com/BFTree/MetaSyn)

点击查看摘要

Abstract:Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

Comments:
13 pages, 7 figures, preprint for arXiv, dataset and code available at this https URL

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

ACMclasses:
H.3.3; I.2.7; H.3.7

Cite as:
arXiv:2606.17041 [cs.CL]

(or
arXiv:2606.17041v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.17041

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2606.16973】How Much Do Reviews Really Contribute? A Study on Text-Enriched Matrix Factorization for Recommendations

链接https://arxiv.org/abs/2606.16973

作者:Eduardo Ferreira da Silva,Mayki dos Santos Oliveira,Joel Machado Pires Denis Dantas Boaventura,Frederico Araújo Durão

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Recommender System, Incorporating textual reviews, Incorporating textual, prominent strategy, strategy for enriching

备注: 14 pages, 4 figures, SBBD 2026 ISSN 2763-8979

点击查看摘要

Abstract:Incorporating textual reviews into a Recommender System has become a prominent strategy for enriching collaborative signals with semantic information. However, the actual contribution of review-derived representations remains an open question, particularly when strong collaborative baselines are employed. In this work, we systematically investigate the impact of textual information on Matrix Factorization by introducing and comparing three enrichment strategies over a common collaborative backbone. First, we propose a learnable gating mechanism that adaptively balances collaborative and textual signals during training. This mechanism is applied to two distinct review representations: (i) aggregated topic profiles extracted from user and item histories, and (ii) full text embedding representations derived from reviews. Additionally, we explore a cross-attention mechanism that identifies and emphasizes the most informative dimensions of the textual representation before fusion with collaborative factors. We evaluate six variants: pure, enriched with topic profiles and text via gating; enriched with topics and text via gating; and enhanced with cross-attention over textual features. Experiments across multiple review-based datasets reveal that although adaptive fusion mechanisms improve representation flexibility, the marginal contribution of textual signals remains limited compared to the collaborative backbone. These findings suggest that, under typical rating-prediction settings, collaborative information continues to dominate performance, raising important considerations for the effective integration of semantic review signals into recommendation models.

3. 【2606.16970】A Theoretical Framework for Risk Analysis of Stochastic Rankers

链接https://arxiv.org/abs/2606.16970

作者:Debasis Ganguly

类目:Information Retrieval (cs.IR)

关键词:stochastic ranking policies, top ranks, fair exposure, deterministic rankers, rankers that seek

备注

点击查看摘要

Abstract:Different from deterministic rankers that seek to maximize relevance at top ranks, stochastic ranking policies instead estimate distributions over permutations, from which rankings are sampled, towards obtaining diversified or fair exposure. Such policies are commonly evaluated in terms of expected effectiveness postreranking. However, the randomness inherent in these policies gives rise to a fundamental but under-explored ex ante question: prior to applying stochastic reranking, how large can the induced variation in retrieval effectiveness be in the worst case? This paper presents a theoretical analysis of reranking risk, defined as the maximum absolute change in discounted cumulative gain (DCG) resulting from a permutation sampled from a stochastic reranking policy applied to a fixed retrieved this http URL derive that this risk is governed by the distribution of the recall points in the initial retrieved list. We conduct experiments on submitted runs from the TREC Fairness 2022 track that employ stochastic reranking policies and empirically demonstrate that the effectiveness variations predicted by our theory closely approximate the observed changes in DCG.

4. 【2606.16838】OneRank: Unified Transformer-Native Ranking Architecture for Multi-Task Recommendation

链接https://arxiv.org/abs/2606.16838

作者:Jiakai Tang,Sunhao Dai,Kun Wang,Zhiluohan Guo,Yu Zhao,Cong Fu,Kangle Wu,Yabo Ni,Anxiang Zeng,Xu Chen,Jun Xu

类目:Information Retrieval (cs.IR)

关键词:diverse user feedback, enable complementary learning, user feedback, essential in recommender, recommender systems

备注: KDD 2026 Accepted

点击查看摘要

Abstract:Multi-task learning (MTL) is essential in recommender systems to enable complementary learning among diverse user feedback. While modern industrial practices have shifted from DNNs to Transformer-centric architectures to strengthen sequence modeling and scaling capacity, they still decouple feature encoding from multi-task prediction, treating the Transformer as a task-agnostic encoder. This design fundamentally limits the performance and scalability by (1) creating an information bottleneck under heterogeneous task objectives, (2) inducing gradient interference that leads to the seesaw phenomenon, and (3) forcing a dataflow transition in which attention-based, context-adaptive representation learning is converted to static feed-forward task prediction with incompatible information read-write dynamics. We propose OneRank, a Transformer-native multi-task ranking framework that eliminates encoder-predictor separation and introduces task-private channels for forward representation learning and backward optimization, enabling task-specialized learning while reducing inter-task interference. In the forward pass, OneRank learns task-specific representations bottom-up through task-conditioned information selection, candidate-aware contextualization, and controlled cross-task interaction. In the backward pass, cross-task gradient detachment isolates task-private parameter updates from shared knowledge extraction modules, preventing negative transfer. We further replace static task-specific MLP scorers with dynamic matching-based scoring for context-aware personalized ranking. By internalizing multi-task reasoning within the Transformer stack, OneRank establishes a unified and scalable architectural paradigm. Offline and online experiments on large-scale industrial datasets show that OneRank significantly outperforms state-of-the-art baselines while maintaining computational efficiency.

Comments:
KDD 2026 Accepted

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.16838 [cs.IR]

(or
arXiv:2606.16838v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.16838

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2606.16821】How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation

链接https://arxiv.org/abs/2606.16821

作者:Yimeng Chen,Zhe Ren,Firas Laakom,Yu Li,Dandan Guo,Jürgen Schmidhuber

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词:Large language model, agents synthesize open-web, Large language, synthesize open-web content, behalf of users

备注: 23 pages, 3 figures

点击查看摘要

Abstract:Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline, a five-mode attack taxonomy, and multiple output-level metrics. We evaluate 13 LLM backends on 308 cases each. Results show that vulnerability patterns vary across backends: overall attack success rate (ASR) ranges from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, the strongest attack mode differs by model family, and the same deployment scaffold could amplify or decrease ASR on different backends. An auxiliary agent-skill probe, where endorsement becomes an install command, exposes a sharp split among otherwise robust backends: Claude over-rejects while GPT over-trusts. These findings argue for treating recommendation reliability under adversarial search content as a first-class dimension of backend safety evaluation.

6. 【2606.16817】Understanding the Behaviors of Environment-aware Information Retrieval

链接https://arxiv.org/abs/2606.16817

作者:Ruifeng Yuan,Chaohao Yuan,David Dai,Yu Rong,Hong Cheng,Hou Pong Chan,Chenghao Xiao

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Recent retrieval-augmented generation, demonstrated strong capability, current research overlooks, Recent retrieval-augmented, query formulation strategies

备注: ACL 2026 Main

点击查看摘要

Abstract:Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we present the first systematic analysis of how LLMs can learn to adapt their query formulation strategies for different retrievers via reinforcement learning (RL). Our empirical study reveals that RL effectively teaches an LLM to tailor its queries to specific retriever characteristics. We discover that different retrievers exhibit surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), suggesting strategies learned for one retriever ineffective for another. We further show that performance can be enhanced by incorporating retriever-specific human guidance and by scaling model size. To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. Our work provides the first empirical evidence and actionable insights for building truly retriever-aware RAG systems. Code and resources are available at this https URL.

7. 【2606.16703】Harmonizing Semantic and Collaborative in LLMs: Reasoning-based Embedding Generator for Sequential Recommendation

链接https://arxiv.org/abs/2606.16703

作者:Qidong Liu,Mingyao Huang,Moranxin Wang,Wenxuan Yang,Haiping Zhu

类目:Information Retrieval (cs.IR)

关键词:Sequential Recommender Systems, Sequential Recommender, Recommender Systems, users' interaction histories, widely deployed

备注: 11pages,5figures

点击查看摘要

Abstract:Sequential Recommender Systems (SRS) predict the next item of interest based on users' interaction histories and have been widely deployed, but hindered by long-tail problem. Large Language Models (LLMs), with strong semantic understanding and reasoning capabilities, offer a promising way to enrich item semantics and have recently been used as embedding generators. However, two fundamental gaps remain. First, current LLM-based embedding methods fail to exploit the model's inner reasoning capacity. Second, existing methods often inject collaborative signals implicitly via supervised fine-tuning, lacking explicit guidance for collaborative embedding alignment. In this paper, we introduce ReaEmb, a novel framework that resolves both issues via a Latent Reasoning-enhanced Contrastive Learning (LRCL) stage and a Collaborative Reward Reinforcement Learning (CRRL) stage. LRCL exploits the LLMs' inner reasoning capacity through a two-pass forward process with an additional attention module. CRRL subsequently explicitly injects collaborative signals into the LLM via a tailored reinforcement learning. Extensive experiments on three real-world datasets demonstrate superior effectiveness of ReaEmb across multiple SRS models. To ease reproducibility, we release the code online.

8. 【2606.16661】SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAG

链接https://arxiv.org/abs/2606.16661

作者:Nathanaël Langlois

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Fixed-length chunking, boundary fragmentation, split across segments, degrading retrieval recall, chunking in Retrieval-Augmented

备注: 5 pages, 1 figure

点击查看摘要

Abstract:Fixed-length chunking in Retrieval-Augmented Generation (RAG) often leads to boundary fragmentation, where critical evidence is split across segments, degrading retrieval recall. While static windowing and parent retrieval improve recall, they introduce significant token overhead. We propose SCAR (Semantic Continuity-Aware Retrieval), an adaptive retrieval policy that selectively expands neighboring chunks by weighing query-neighbor relevance against a structural continuity penalty. SCAR uses a relative expansion threshold tied to each retrieved chunk's own query-relevance, yielding an approximately scale-invariant decision rule that transfers across embedding models without recalibration. Across four diverse corpora (RFC, GDPR, a 10-K report, and a Merger agreement; N=320 queries; 160 boundary-fragmented), SCAR achieves 92.8% recall on boundary-fragmented queries with only 7.84 chunks, a 22.9% reduction compared to static windowing (10.16 chunks). Paired bootstrap tests (B=10,000) confirm the chunk reduction is highly significant (p0.0001, Cohen's d=-1.49, large effect), with a small recall difference (Cohen's d=-0.33). The policy transfers across three embedding models (text-embedding-3-large, BGE-large-en-v1.5, zembed-1) using the same single hyperparameter setting, and downstream RAGAS evaluation on the 10-K corpus confirms SCAR preserves generation faithfulness while reducing context tokens by 27.1%.

9. 【2606.16641】PIANO: Personalized Reranking via Information Aggregation Node for Music Search Optimization

链接https://arxiv.org/abs/2606.16641

作者:Weisheng Li,Chuqiao Huang,Pengcheng Li,Zhengchao Peng,Qiang Xiao,Zhongqian Xie,Qiang Huang,Chuanjiang Luo

类目:Information Retrieval (cs.IR)

关键词:Unlike short-video content, Unlike short-video, short-video content, tracks have long, long lifecycles

备注: Accepted at ECML PKDD 2026. 18 pages, 4 figures

点击查看摘要

Abstract:Unlike short-video content, music tracks have long lifecycles and lasting value. Effective music search re-ranking must therefore align the user's current query with long-term preferences while jointly optimizing Click-Through Rate (CTR) and Conversion Rate (CVR). However, existing methods suffer from two limitations: (1) sequential methods rely on item-interaction history and therefore cannot use historical search queries to tell which past preferences match the user's current search intent; (2) most listwise models optimize a single objective (e.g., CTR only), and conventional multi-objective methods balance click and conversion at the item level, ignoring how these trade-offs play out across the whole ranked list. To address these limitations, we propose PIANO, a personalized listwise re-ranking framework with two key components: (i) the Query-Driven Interest Refiner (QDIR) uses cross-attention over historical queries to align past intents with the current one; (ii) the Information Aggregation Node (IAN), a learnable [CLS]-style token, aggregates the candidate list and predicts CTR/CVR at the list level. Extensive experiments on public and industrial datasets show consistent gains over strong baselines. In online A/B tests on NetEase Cloud Music, a leading music streaming platform, PIANO achieves statistically significant improvements in CTR (+0.62%) and CVR (+4.45%).

10. 【2606.16387】Leveraging Code-Mixed Product Metadata and User Feedback for Personalized Recommendation on Daraz Bangladesh

链接https://arxiv.org/abs/2606.16387

作者:KM Fahim A Bari,Muhammad Abdullah Adnan,Nafis Sadeq

类目:Information Retrieval (cs.IR)

关键词:Bengali Unicode, platforms host millions, Bangladeshi e-commerce platforms, Latin script, transcribed in Latin

备注

点击查看摘要

Abstract:Bangladeshi e-commerce platforms host millions of product reviews written in Bengali Unicode, English, and Banglish, where Bengali is phonetically transcribed in Latin script. However, the impact of code-mixed reviews on recommendation performance remains largely unexplored. We present the first such benchmarking on product reviews from Daraz Bangladesh, evaluating six model families under a per-user chronological leave-last-out protocol. To address the severe long-tail sparsity of the dataset, where 59.3% of users have exactly one interaction, we conduct a systematic k-core threshold ablation across five density configurations. The results reveal that Item-based Collaborative Filtering remains stable across settings, Implicit Matrix Factorization degrades sharply with decreasing density, and Explicit Matrix Factorization uniquely improves at higher thresholds. To characterize the impact of code-mixing on recommendation quality, we perform a language-stratified evaluation of content-based filtering using character n-gram TF-IDF profiles. The results provide empirical evidence that fragmentation of the Banglish vocabulary reduces NDCG@10 by 46.8% relative to Bengali-script users, a degradation traceable to transliteration inconsistency across surface forms. This work establishes a reproducible evaluation foundation for recommendation research in code-mixed, low-resource e-commerce settings. The code is publicly available at this https URL.

11. 【2606.16316】RL-Index: Reinforcement Learning for Retrieval Index Reasoning

链接https://arxiv.org/abs/2606.16316

作者:Yongjia Lei,Nedim Lipka,Zhisheng Qi,Utkarsh Sahu,Koustava Goswami,Franck Dernoncourt,Ryan A. Rossi,Yu Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Retrieving external knowledge, solving real-world tasks, coding requiring deep, relevant knowledge involves, knowledge involves implicit

备注

点击查看摘要

Abstract:Retrieving external knowledge is essential for solving real-world tasks, yet it remains challenging when the relationship between a query and its relevant knowledge involves implicit and complex reasoning beyond surface-level semantic or lexical matching (e.g., mathematical problems relying on the same theorem or coding requiring deep reasoning). Existing approaches primarily rely on query-side reasoning (e.g., query rewriting), which introduces significant online latency and underutilizes the opportunity to perform reasoning over the knowledge corpus itself (i.e., index-side reasoning). In this paper, we propose RL-Index, an agentic indexing framework that formulates retrieval index reasoning as a reinforcement learning problem. Instead of performing reasoning at query time, RL-Index shifts reasoning to the indexing stage by augmenting documents with LLM-generated rationales that explicitly encode the latent query-knowledge relationship. To optimize the quality of these rationales, we employ Group Relative Policy Optimization (GRPO) and use retrieval similarity as a verifiable reward signal, enabling direct optimization of indexing decisions for retrieval effectiveness. Extensive experiments on the BRIGHT benchmark demonstrate that RL-Index consistently improves both retrieval and downstream question-answering performance, while significantly reducing online inference latency. Moreover, the learned rationale augmentation generalizes across diverse retrievers and generators, highlighting its robustness as a plug-and-play indexing strategy across different retrieval systems.

12. 【2606.16209】Viral Images: Identifying Reprintings within 1.5 Million Photographs in Chronicling America

链接https://arxiv.org/abs/2606.16209

作者:Bruno Buccalon,Yueran Sun,Benjamin Charles Germain Lee

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:digitized historic American, historic American newspapers, Chronicling America initiative, historic American, Chronicling America

备注: 13 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Within the millions of digitized historic American newspapers in the Chronicling America initiative are tens of millions of photographs, illustrations, cartoons, and advertisements. Much of this visual culture is shared across newspaper titles and issues. Just as reprinted texts within these newspapers speak to the virality of textual content, so too does this reprinted visual culture speak to newspapers as sites of constant information circulation and exchange. In this paper, we introduce Viral Images, a project to identify reprintings within 1.5 million photographs in Chronicling America. For our analysis, we adopt the Newspaper Navigator dataset of extracted photographs from over 16 million pages in Chronicling America. We introduce an unsupervised method of identifying reprintings by leveraging contrastive language-image pretraining (CLIP) to embed these 1.5 million photographs and applying clustering to identify re-printed content. We detail our public interface, this https URL, which we designed in order to enable humanists to interactively browse and study these identified clusters. In addition, we analyze the identified clusters, uncovering a diversity of photographs and advertisements that have been circulated across different newspapers over time.

13. 【2606.16010】heorem-Grounded Execution Ontologies for Interpretable Machine Reasoning

链接https://arxiv.org/abs/2606.16010

作者:Raghu Anantharangachar

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large language models, achieved impressive performance, Large language, tasks spanning mathematics, spanning mathematics

备注

点击查看摘要

Abstract:Large language models have achieved impressive performance on reasoning tasks spanning mathematics, science, programming, and commonsense inference. Despite these advances, their reasoning processes remain largely latent, making them difficult to interpret, verify, replay, debug, and transfer across domains. Existing approaches such as chain-of-thought, tree-of-thoughts, graph-of-thoughts, and tool-augmented reasoning expose intermediate reasoning artifacts but typically lack explicit execution semantics, formal state representations, and verifiable reasoning structures. We introduce Theorem-Grounded Execution Ontologies (TGEO), a framework that models reasoning as an executable state-transition process rather than a sequence of generated tokens. Given an input problem, TGEO identifies relevant theorem families, binds the problem to a domain ontology, discovers semantic objects, instantiates states and operators, constructs predicates and contracts, and synthesizes an executable reasoning graph. The resulting graph provides an interpretable, replayable, and auditable representation of reasoning in which every state transition, operator application, and validation step is explicitly represented. TGEO integrates five architectural components: (1) theorem-grounded reasoning priors, (2) executable ontologies, (3) operator-mediated state transitions, (4) predicate and contract-based execution validation, and (5) architectural auditing and failure localization. We evaluate TGEO on theorem-intensive reasoning tasks derived from mathematical benchmark domains and a curated Golden Execution Suite. Our findings demonstrate the value of executable reasoning representations for interpretable, verifiable, and reproducible AI reasoning systems.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.16010 [cs.IR]

(or
arXiv:2606.16010v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.16010

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Raghu Anantharangachar [view email] [v1]
Sun, 14 Jun 2026 20:44:29 UTC (77 KB)

14. 【2606.15998】Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

链接https://arxiv.org/abs/2606.15998

作者:Utshab Kumar Ghosh,Shubham Chatterjee

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:assuming that semantically, entity, Entity Relevance, OER, non-relevant documents

备注: ICTIR '26

点击查看摘要

Abstract:Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($\kappa \approx 0$), while OER operationalizations agree substantially ($\kappa \approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

15. 【2606.15911】Interactor: Agentic RL oriented Iterative Creation for Ad Description Generation in Sponsored Search

链接https://arxiv.org/abs/2606.15911

作者:Penghui Wei,Jiayu Wu,Chao Ye,Zhi Guo,Shuanglong Li,Lin Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:automatically generating informative, paper focuses, focuses on automatically, sponsored search, descriptions

备注

点击查看摘要

Abstract:This paper focuses on automatically generating informative ad descriptions in sponsored search. Unlike ad titles which are usually optimized to attract user click feedbacks, ad descriptions have a longer text span and possess the potential of incorporating world knowledge to address user search intents while presenting the fine-grained selling points of the ads. We propose Interactor, a multi-turn iterative creation framework optimized with agentic RL for ad description generation. The generation model acts as a policy that interacts with a customized environment consisting of multiple generative reward models. Given initial generations by the policy, the customized GenRMs evaluate multi-dimensional qualities including knowledge capacity and landing page consistency, providing both binary signals and reasoning feedbacks. The policy then iteratively refines the descriptions based on such feedbacks to ensure continuous improvement. Experiments on industrial datasets show that the Interactor framework significantly outperforms state-of-the-art approaches in generating knowledge-rich and faithful ad descriptions. Since May 2026, it has been deployed online in a leading search ads system, contributing to both ad revenue and user experience.

16. 【2606.15906】MAGE-RAG: Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA

链接https://arxiv.org/abs/2606.15906

作者:Yilong Zuo,Xunkai Li,Jing Yuan,Qiangqiang Dai,Hongchao Qin,Ronghua Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Multimedia (cs.MM)

关键词:question answering requires, multimodal question answering, locate sparse evidence, Long-document multimodal question, evidence

备注

点击查看摘要

Abstract:Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at this https URL.

17. 【2606.15838】Intelligent Multimodal Retrieval and Reasoning for Geospatial Knowledge Discovery on the I-GUIDE Platform

链接https://arxiv.org/abs/2606.15838

作者:Yunfan Kang,Erick Li,Furqan Baig,Wei Hu,Alexander Michels,Anand Padmanabhan,Shaowen Wang

类目:Information Retrieval (cs.IR)

关键词:discovery increasingly requires, increasingly requires search, I-GUIDE Smart Search, knowledge discovery increasingly, Smart Search

备注

点击查看摘要

Abstract:Geospatial knowledge discovery increasingly requires search across heterogeneous artifacts: datasets, maps, notebooks, software, publications, and the provenance links among them. Conventional geoportals support metadata and spatial filtering, but they rarely provide semantic retrieval, graph-aware provenance traversal, and conversational synthesis in one integrated system. This paper presents I-GUIDE Smart Search, a production multimodal geospatial retrieval-augmented generation (RAG) system embedded in the I-GUIDE Platform, and reports on its design, deployment, and evaluation. The system combines production-maintained OpenSearch keyword, vector, and spatial indexes with a Neo4j knowledge graph and an iterative RAG pipeline for memory-aware query augmentation, reasoning, retrieval-method routing, relevance grading, grounded generation, hallucination and relevance checking. In a single-A100 RAG deployment, I-GUIDE Smart Search supports interactive use up to about 100 concurrent simulated users, reaching 4.4 requests per second with p50 latency near 25 seconds despite 20-50 LLM calls per query. For answer quality, we evaluate a four-category benchmark of 170 unique human-filtered user-facing queries, together with ten intent-specific probe sets generated from the deployed indexes and graph. Smart Search improves retrieved evidence coverage and judged answer quality over non-retrieval and naive-RAG baselines, with the clearest gains on exact-identifier, spatially constrained, simple-recommendation, and domain-specific factual queries requiring current indexed evidence. We distill transferable deployment lessons for spatial RAG systems, covering spatial metadata quality, graph provenance, retrieval routing, interface contracts, refusal-aware evaluation, latency-cost tradeoffs, and the role of the user interface in deployed geospatial cyberinfrastructure.

18. 【2606.15752】One Sequential Recommendation Model Pretrained from Synthetic Priors Predicts Multiple Datasets

链接https://arxiv.org/abs/2606.15752

作者:Woosung Kang,Jiwon Jeong,Jonghyeok Shin,Jeongwhan Choi,Noseong Park

类目:Information Retrieval (cs.IR)

关键词:Existing sequential recommendation, observed interaction distribution, Existing sequential, Prior-data Fitted Network, training data

备注: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:Existing sequential recommendation models rely on dataset-specific training, where the learned parameters are fitted to the item catalog and the observed interaction distribution of the training data. This limits generalization to new domains, typically requiring retraining from scratch. In this work, we propose SRPFN, a Prior-data Fitted Network for sequential recommendation -- predicting the next item in a single forward pass without any gradient-based parameter updates in the target domain. SRPFN is pretrained offline on 25.6M sequences sampled from a synthetic prior that spans diverse item-to-item transition patterns, learning to produce posterior predictive next-item distributions. At inference time, SRPFN generates recommendations by conditioning on a support set of item-item transition examples from the target domain, adapting to domain-specific patterns without retraining. Extensive experiments on five benchmarks across 10 baselines show that SRPFN achieves the best or second-best performance across nearly all metrics and datasets, while being substantially more computationally efficient than trained baselines. These results establish that a single model pretrained on synthetic priors can generalize across diverse real-world domains, offering a framework for update-free sequential recommendation.

19. 【2606.15734】Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

链接https://arxiv.org/abs/2606.15734

作者:Weihang Su,Jiacheng Kang,Jingyan Xu,Qingyao Ai,Jianming Long,Hanwen Zhang,Bangde Du,Xinyuan Cao,Min Zhang,Yiqun Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Continual post-training enables, potentially causing catastrophic, post-training enables models, repeatedly updating shared, causing catastrophic forgetting

备注

点击查看摘要

Abstract:Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

20. 【2606.15449】ransfer Learning for FHIR Questionnaire Terminology Binding

链接https://arxiv.org/abs/2606.15449

作者:Maxim Gorshkov

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:require FHIR Questionnaire, Electronic prior authorization, FHIR Questionnaire items, workflows require FHIR, Vinci CDS-Library lack

备注

点击查看摘要

Abstract:Electronic prior authorization workflows require FHIR Questionnaire items to carry LOINC codes, yet most items in the HL7 Da Vinci CDS-Library lack these bindings. We treat this as a retrieval problem: given a Questionnaire item's text, find the correct LOINC code in a pool of 97,314 active codes. We compare six methods (TF-IDF, frozen MiniLM, BioBERT, BioLORD, contrastively fine-tuned MiniLM, and a TF-IDF+GPT reranker) on a 54-item evaluation set spanning three query styles (natural question, medium, and terse). No single method wins on every metric. BioLORD, a frozen encoder pre-trained on biomedical ontology definitions, has the best top-rank accuracy (R@1 = 0.185, MRR = 0.246) despite seeing no task-specific data, while a contrastive fine-tune on raw LHC-Forms pairs takes R@5 (0.389) and R@10 (0.426). A distribution-shift ablation shows why the fine-tune in our main table is not the strongest one: adding GPT-generated paraphrases to the raw pairs drops R@5 from 0.389 to 0.296, so the augmented union underperforms raw-only training on every metric except R@1. Performance peaks at 5k training pairs. Error analysis on BioLORD's R@1 failures shows that wrong-specificity and ambiguous-text cases together account for 59% of errors.

21. 【2606.15448】EventConnector: Mining Social Event Relations through Temporal Graphs

链接https://arxiv.org/abs/2606.15448

作者:Zijie Lei,Haofei Yu,Ge Liu,Jiaxuan You

类目:Information Retrieval (cs.IR)

关键词:Understanding and retrieving, retrieving related real-world, social analysis, retrieving related, fundamental challenge

备注

点击查看摘要

Abstract:Understanding and retrieving related real-world events based on their temporal dynamics is a fundamental challenge in time-sensitive applications such as forecasting, information retrieval, and social analysis. Existing methods often rely on semantic similarity or global time-series alignment, which overlook the transient and directional dependencies that frequently underlie real-world correlations. In this work, we introduce \textit{EventConnector}, a framework that constructs a temporal event graph capturing localized co-fluctuations and lead-lag relationships between events through their time-series trajectories. We further propose \textbf{EC-Fusion}, an adaptive retrieval mechanism that fuses EventConnector's graph-based scores with a complementary Granger-causal signal via a graph-quality-aware mixing weight. Across two real-world prediction market benchmarks (Polymarket and Kalshi) and nine forecasting architectures evaluated over three random seeds, EC-Fusion is the best non-oracle retrieval method on $17/18$ model--dataset cells, reducing RMSE by $6.87\%$ on average (up to $10.86\%$) over the strongest comparable retrieval baseline, with statistical significance at $p 0.01$ after Holm--Bonferroni correction. These results highlight the effectiveness of temporally grounded graph modeling, augmented with causal-signal fusion, in capturing latent event relationships beyond what semantic similarity or traditional alignment techniques can offer.

22. 【2606.15380】Confidence-Based Stopping Methods for Systematic Reviews

链接https://arxiv.org/abs/2606.15380

作者:Aaron Fletcher,Mark Stevenson

类目:Information Retrieval (cs.IR)

关键词:Technology Assisted Review, Technology Assisted, Assisted Review stopping, Assisted Review, stopping methods aim

备注

点击查看摘要

Abstract:Technology Assisted Review stopping methods aim to ensure that no more documents are screened than necessary. Most existing approaches focus on achieving a target recall, which does not consider whether an information need has been met. This paper introduces two heuristic stopping methods that instead monitor whether screened documents contain enough information to make a decision. Evaluation on a standard dataset of Diagnostic Test Accuracy Systematic Reviews demonstrates that the proposed approaches substantially reduce the number of documents that need to be examined while, in the majority of cases, maintaining conclusions that are consistent with all evidence available.

23. 【2606.15367】S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents

链接https://arxiv.org/abs/2606.15367

作者:Yao Dong,Xinglin Xiao,Liwei Dong,Xinlong Jin,Zhengbo Li,Heng Zhang,Duyun Wang,Nan Xu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Deep research agents, Deep research, solve complex knowledge-intensive, research agents aim, research agents

备注

点击查看摘要

Abstract:Deep research agents aim to solve complex knowledge-intensive tasks through long-horizon planning, evidence gathering, reasoning, and report generation. While recent progress in search agents has demonstrated strong capabilities in information retrieval and answer verification, most existing training datasets remain search-centric, focusing primarily on closed-ended question answering and information localization. As a result, they mainly train information-seeking behavior while providing limited coverage of key deep research capabilities, including evidence integration, knowledge synthesis, planning, file understanding, and structured report generation. In this work, we propose a unified trajectory construction paradigm for deep research agents that combines closed-ended QA and open-ended exploration. The proposed framework consists of graph-grounded task formulation, agentic trajectory rollout, and multi-dimensional trajectory verification, enabling scalable synthesis of high-quality agentic trajectories spanning long-chain complex reasoning, deep research instruction following, report writing, file understanding and generation, and skills usage. Compared with existing search-oriented datasets, our synthesized trajectories place greater emphasis on knowledge synthesis, complex reasoning, and planning. S1-DeepResearch-32B achieves state-of-the-art performance among open-source models of comparable scale across 20 benchmarks spanning five capability dimensions, including complex reasoning, instruction following, report generation, file understanding, and skills usage. On several challenging deep research benchmarks, it approaches the performance of leading proprietary frontier models. These results highlight the importance of jointly modeling information acquisition, knowledge synthesis, and planning-oriented agent behaviors for building effective deep research agents.

24. 【2606.15345】Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

链接https://arxiv.org/abs/2606.15345

作者:Yuheng Lu,Qingcheng Zeng,Heli Qi,Puxuan Yu,Fuheng Zhao,Rui Yang,Hitomi Yanaka,Naoto Yokoya,Weihao Xuan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:produce grounded answers, reason over retrieved, retrieved sources, increasingly evaluated, produce grounded

备注: Preprint

点击查看摘要

Abstract:Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

25. 【2606.15331】HoloRec: Holistic Encoding and Interleaved Reasoning for Generative Recommendation

链接https://arxiv.org/abs/2606.15331

作者:Shuqi Zhao,Jingsong Su,Xiang Liu,Xingzhi Yao,Yiming Qiu,Huimu Wang,Liang Lin,Pengbo Mo,Mingming Li,Jiao Dai,Jizhong Han,Songlin Hu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Generative recommendation models, traditional cascade architectures, objective fragmentation problem, requires expensive annotations, lacking hierarchical structure

备注

点击查看摘要

Abstract:Generative recommendation models that formulate the task as sequence generation overcome the objective fragmentation problem of traditional cascade architectures, yet existing approaches still suffer from flat semantic representations lacking hierarchical structure for multi-step reasoning and an externally constructed chain-of-thought (CoT) that requires expensive annotations and remains disconnected from the generation objective. We propose HoloRec, an endogenous chain-of-thought recommendation mechanism that unifies representation, reasoning, and generation by constructing a hierarchical semantic encoding matrix via multi-granularity nested residual quantization optimized by a holistic reconstruction loss. HoloRec supports two inference modes: a non-thinking mode that uses lightweight multi-granularity supervised alignment for fast prediction, and a thinking mode that employs an interleaved reasoning scheme to generate CoT steps on the fly, directly embedding reasoning into the generation process without external data. Experiments on multiple public recommendation datasets demonstrate that HoloRec consistently outperforms baselines, with especially significant gains in sparse scenarios, and the thinking mode achieves better accuracy than the non-thinking mode with only modest inference overhead.

26. 【2606.15330】OneBar: An End-to-End Content-Grounded Generative Query Recommendation Framework for E-Commerce Video Feeds

链接https://arxiv.org/abs/2606.15330

作者:Yao Tang,Ying Yang,Ben Chen,Yufei Ma,Zihan Liang,Chenyi Lei,Wenwu Ou,Jian Liu

类目:Information Retrieval (cs.IR)

关键词:easily express content-induced, expose clickable search, express content-induced search, clickable search entries, search entries beneath

备注: Any questions feel free to contact: benchen4395@gmail.com

点击查看摘要

Abstract:Short-video platforms now expose clickable search entries beneath the video player, enabling users to easily express content-induced search intent. However, conventional query recommendation systems on short-video platforms suffer from latency constraints and objective misalignment, while recent generative approaches struggle with noisy content-side metadata and preference drift. To address these issues, we propose OneBar, an end-to-end generative framework for real-time query recommendation for E-Commerce video feeds. OneBar features three key innovations: (1) a collaborative-multimodal intent grounding module that fuses multimodal video understanding and behavior-derived collaborative anchors; (2) a Unified End-to-End architecture equipped with a prompt-compression mechanism for efficient online serving; and (3) a progressive preference learning strategy for efficient preference-internalization, which internalizes hierarchical behavior preferences into the generative policy, eliminating the need for a separately trained reward model. Compared with online base, OneBar increases Query Exposure by 16.91\% and Query Click by 18.68\%, while maintaining a slight Query CTR gain of 0.19\%. The additional search traffic further contributes to 20.36\% more guided orders and 21.67\% higher GMV.

27. 【2606.15277】Guiding Federated Graph Recommendation with LLM-encoded knowledge

链接https://arxiv.org/abs/2606.15277

作者:Thi Minh Chau Nguyen,Hien Trang Nguyen,Duc Anh Nguyen,Van Ho-Long,Thanh Trung Huynh,Zhao Ren

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Emerging Technologies (cs.ET); Machine Learning (cs.LG)

关键词:Graph-based recommender systems, preserving user privacy, Graph-based recommender, user privacy, extracting collaborative signals

备注: Technical Report

点击查看摘要

Abstract:Graph-based recommender systems are highly effective at extracting collaborative signals from user--item interactions, and federated learning (FL) allows these models to be trained while preserving user privacy. However, aggregating graph representations across distributed, non-IID clients remains a challenge; structural embeddings learned locally often misalign, and naive averaging fails to capture meaningful cross-client relationships. Most existing federated graph methods rely exclusively on structural aggregation, neglecting the rich, global semantic context available in large language models (LLMs). In this paper, we propose a novel framework that uses LLM-encoded knowledge to guide federated graph recommendation. Specifically, clients learn structural representations from local graphs while simultaneously summarizing their typical interaction patterns into compact semantic vectors via a frozen LLM. The central server then uses these LLM-encoded semantic signals to discover related preference patterns across clients, guiding the selective aggregation of their structural representations. This enables semantically informed cross-client collaboration without exposing raw data. Extensive experiments on standard benchmarks show that guiding structural alignment with LLM-encoded knowledge consistently improves recommendation accuracy over existing federated graph baselines.

28. 【2606.15252】Beyond Positive Signals: Unlocking Implicit Negative Behaviors for Enhanced Sequential User Modeling

链接https://arxiv.org/abs/2606.15252

作者:Zexuan Cheng,Yue Liu,Jun Zhang,Jie Jiang

类目:Information Retrieval (cs.IR)

关键词:modern click-through rate, User behavior sequence, User behavior, click-through rate, central component

备注

点击查看摘要

Abstract:User behavior sequence modeling has become a central component in modern click-through rate (CTR) prediction. Over the past years, the community has invested substantial effort into improving how sequences are encoded, from target-aware attention and interest evolution networks to unified architectures that jointly process sequential and non-sequential features. However, a more fundamental question remains under-explored: what should constitute the behavior sequence? Current practice constructs sequences exclusively from positive interactions (clicks, purchases, completions), while the far more abundant implicit negative behaviors (skips, low engagement, scroll-past) are largely underutilized. As gains from longer positive sequences approach diminishing returns, we revisit this underutilized data source within the sequential modeling framework. In this paper, we demonstrate that mixed-polarity behavior sequences, which chronologically interleave positive and negative tokens within a fixed length budget, consistently outperform positive-only sequences across diverse model architectures with negligible additional computational overhead. We further identify a semantic indistinguishability problem inherent to naive polarity embeddings and propose Target-Aware Polarity Fusion (TAPF), a lightweight target-conditioned gating mechanism that provides additional gains by differentiating behavioral evidence. Notably, even the simpler polarity bias baseline captures the majority of improvement, underscoring that the primary contribution is the mixed-polarity data paradigm itself. Experiments on three public benchmarks demonstrate consistent improvements of +1.9% to +9.6% relative AUC across five architectures, which validate the practical value of our approach.

29. 【2606.15225】Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call

链接https://arxiv.org/abs/2606.15225

作者:Weibo Gao,Qi Liu,Linan Yue,Zheng Zhang,Yichao Du,Fangzhou Yao,Ao Yu,Zhenya Huang,Shijin Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large-scale learner-task interaction, Large-scale learner-task, intelligent educational systems, crucial for intelligent, intelligent educational

备注: LLM Agent, Educational Data Mining, Data Synthesis, Human Simulation

点击查看摘要

Abstract:Large-scale learner-task interaction data are crucial for intelligent educational systems but are costly to collect and constrained by privacy and learner engagement. Learner simulators play a critical role in simulating scalable learner behavior without the need for continuous involvement of real learners. However, existing methods are predominantly \textbf{individual-centric}, pairing a simulator with each learner to iteratively infer latent knowledge states from dense interaction histories, which is both data- and computation-intensive, and fragile in cold-start scenarios. We propose a \textbf{cohort-aware roll-call simulation paradigm} that first constructs cohort-level proficiency priors and refines individual learner states through a small number of targeted diagnostic queries. Based on this paradigm, we introduce \textbf{Edu-Theater}, an LLM-powered agent system that performs cohort-aware learner simulation via a teacher agent and retrospective roll-call probing over learner logs. Edu-Theater enables scalable future behavior simulation without the need for dense per-learner histories. Experiments on two real-world datasets demonstrate that Edu-Theater achieves higher simulation accuracy with significantly fewer LLM calls, producing synthetic data that enhances downstream applications such as adaptive testing.

30. 【2606.14958】MVEB: Massive Video Embedding Benchmark

链接https://arxiv.org/abs/2606.14958

作者:Adnan El Assadi,Roman Solomatin,Isaac Chung,Chenghao Xiao,Deep Shah,Manan Dey,Shriya Sudhakar,Zacharie Bugaud,Wissam Siblini,Ayush Sunil Munot,Yashwanth Devavarapu,Rakshitha Ireddi,Michelle Yang,Márton Kardos,Niklas Muennighoff,Kenneth Enevoldsen

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Massive Video Embedding, embeddings spanning classification, Video Embedding Benchmark, video-centric question answering, pair classification

备注

点击查看摘要

Abstract:We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at this https URL.

31. 【2606.14932】Retrieval-as-a-Service:A System-Oriented Analysis of Industrial Retrieval Pipelines in Web Systems

链接https://arxiv.org/abs/2606.14932

作者:Fang Liu,Yuan Yuan,Yifan Dang,Xuncheng Zhang,Cuiqianhe Du

类目:Information Retrieval (cs.IR)

关键词:API discovery, foundational infrastructure component, modern Web services, advertising targeting, supporting applications

备注

点击查看摘要

Abstract:Retrieval systems have become a foundational infrastructure component in modern Web services, supporting applications such as content recommendation, advertising targeting, and API discovery. In large-scale industrial environments, retrieval is increasingly deployed as an independent service layer, commonly referred to as Retrieval-as-a-Service (RaaS). This paper presents a system-oriented survey of industrial retrieval pipelines, focusing on architectural design and deployment trade-offs under real-world constraints. Unlike prior surveys that emphasize algorithmic developments, we analyze retrieval systems from an infrastructure perspective, highlighting how latency requirements, scalability constraints, and resource limitations shape system design in production environments. We introduce a unified RaaS pipeline abstraction that models retrieval as a multi-stage service, including high-efficiency candidate generation, embedding-based semantic matching, and resource-aware re-ranking. We further examine the integration of Large Language Model (LLM)-based retrieval mechanisms and analyze their impact on semantic performance, latency, and computational overhead. The results provide a system-level understanding of retrieval as a service-oriented infrastructure and offer practical guidelines for designing scalable, efficient, and QoS-aware retrieval architectures in large-scale Web systems.

32. 【2606.14821】Co-Scraper: query-aware DOM Pruning and Reusable Scraper Synthesis for Lightweight Web Data Extraction

链接https://arxiv.org/abs/2606.14821

作者:Shoupeng Wang,Jiantao Qiu,Wuyang Zhang,Conghui He

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:necessitates automated information, content necessitates automated, similar web pages, web pages offers, automated information extraction

备注

点击查看摘要

Abstract:The abundant and heterogeneous nature of web content necessitates automated information extraction, and generating scrapers that can be reused across similar web pages offers an effective solution for scalable data extraction. In this work, we propose Co-Scraper, a two-stage framework capable of handling the hierarchical complexity of long HTML documents. By integrating a query-aware DOM pruning mechanism with stable extraction strategy induction, Co-Scraper can effectively transforms web content into executable programmatic wrappers using a fine-tuned Qwen3-8B model. On the test set of SWDE, Co-Scraper achieves state-of-the-art performance with an F1 score of 94.78% and a reuse success rate of 90.39%. This framework significantly enhances the accuracy and resilience of data extraction, providing a highly efficient approach for web data acquisition tasks.

33. 【2606.14817】Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

链接https://arxiv.org/abs/2606.14817

作者:Sooyeon Kim,Piotr S. Maciąg

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, generating personalized reading, personalized reading content, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

34. 【2606.14770】An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

链接https://arxiv.org/abs/2606.14770

作者:Houssam El Mir

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Pedestrian Attribute Recognition, enabling forensic search, Pedestrian Attribute, Attribute Recognition, video surveillance

备注

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

35. 【2512.10104】Phishing Email Detection Using Large Language Models

链接https://arxiv.org/abs/2512.10104

作者:Najmul Hasan,Prashanth BusiReddyGari,Haitao Zhao,Yihao Ren,Jinsheng Xu,Shaohu Zhang

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, globally consequential vectors, deploy Large Language, phishing email, cyber intrusion

备注: 7 pages

点击查看摘要

Abstract:Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

计算机视觉

1. 【2606.17053】Context-Aware RL for Agentic and Multimodal LLMs

链接https://arxiv.org/abs/2606.17053

作者:Peiyang Xu,Bangzheng Li,Sijia Liu,Karthik R. Narasimhan,Pramod Viswanath,Prateek Mittal,Xingyu Fu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large language models, Large language, answering requires identifying, requires identifying, identifying a small

备注: 29 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

2. 【2606.17049】BRDFusion: Physics Meets Generation for Urban Scene Inverse Rendering

链接https://arxiv.org/abs/2606.17049

作者:Yi-Ruei Liu,Jie-Ying Lee,Zheng-Hui Huang,Yu-Lun Liu,Chih-Hao Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables numerous applications, including content creation, captured videos enables, videos enables numerous, autonomous driving simulation

备注: Project page: [this https URL](https://shigon255.github.io/brdfusion-page/)

点击查看摘要

Abstract:Inverse rendering of urban scenes from captured videos enables numerous applications, including content creation and autonomous driving simulation. Physically-based rendering methods follow and control lighting physics, but suffer from reconstruction and rendering artifacts. While generative models produce realistic videos, they offer limited consistency and controllability. We present BRDFusion, a unified framework that combines two complementary models for inverse and forward rendering. Specifically, BRDFusion recovers explicit, consistent scene properties with physical modeling and alleviates optimization ambiguity with generative priors. During forward rendering, the physical model provides controllable rendering from the scene configuration, and the generative model denoises and fixes artifacts. Therefore, our method produces high-quality videos while allowing precise control, outperforming baselines in real and synthetic scenes. Moreover, BRDFusion supports novel-view relighting, night simulation, and dynamic object insertion/editing. Project page: this https URL

3. 【2606.17048】Exact Posterior Score Estimation for Solving Linear Inverse Problems

链接https://arxiv.org/abs/2606.17048

作者:Abbas Mammadov,Ozgur Kara,Kaan Oktay,Iskander Azangulov,Adil Kaan Akan,Hyungjin Chung,James Matthew Rehg,Yee Whye Teh

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:reverse Gaussian corruption, learn powerful data, powerful data priors, Diffusion and flow-based, flow-based models learn

备注

点击查看摘要

Abstract:Diffusion and flow-based models learn powerful data priors by training a denoiser to reverse Gaussian corruption. To use this prior to solve a linear inverse problem, one needs to sample from the posterior, but the score that the prior provides is the unconditional score, not the posterior score. Existing methods either steer a fixed pretrained denoiser with approximate measurement-matching corrections, or train a conditional restoration model that abandons the denoising structure of the prior. We derive the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, and show that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot under an anisotropic noise covariance. We turn this identity into Exact Posterior Score (EPS), a denoising training objective that preserves the input/output structure of standard pretraining and can therefore be trained from scratch or fine-tuned from a pretrained denoiser. At inference, EPS uses the same sampler as the underlying backbone, with no likelihood gradients or projections. We evaluate EPS on five linear inverse problems across FFHQ and ImageNet, where it outperforms training-free and training-based baselines on fidelity, perceptual, and distributional metrics, while using roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.

4. 【2606.17046】Geometric Action Model for Robot Policy Learning

链接https://arxiv.org/abs/2606.17046

作者:Jisang Han,Seonghu Jeon,Jaewoo Jung,René Zurbrügg,Honggyu An,Tifanny Portela,Marco Hutter,Marc Pollefeys,Seungryong Kim,Sunghwan Hong

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Generalist robot policies, follow user instructions, Generalist robot, robot actions interact, robot policies

备注: Project page: [this https URL](https://cvlab-kaist.github.io/Geometric-Action-Model/)

点击查看摘要

Abstract:Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

5. 【2606.17040】R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

链接https://arxiv.org/abs/2606.17040

作者:Xiuwei Xu,Haowen Sun,Angyuan Ma,Yiwei Zhang,Zhenyu Wu,Xiaofeng Wang,Bingyao Yu,Zheng Zhu,Jie Zhou,Jiwen Lu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:typically requires scaling, diverse object poses, robot configurations, camera viewpoints, requires scaling demonstrations

备注: Project page: [this https URL](https://r2rdreamer.github.io/)

点击查看摘要

Abstract:Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.

6. 【2606.17037】he Importance of Phase in Neural Representations: An Internal Oppenheim-Lim Test of Image Classifiers

链接https://arxiv.org/abs/2606.17037

作者:Alper Yıldırım

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Oppenheim and Lim, natural images stay, images stay recognizable, Fourier phase, showed that natural

备注

点击查看摘要

Abstract:Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture--shape gap between CNNs and attention models.

7. 【2606.17030】Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

链接https://arxiv.org/abs/2606.17030

作者:Jie Zhang,Xiaoyue Chen,Anzhe Chen,Chenxu Lv,Deqing Li,Gengze Zhou,Hang Yin,Haoqi Yuan,Haoyang Li,Jiahao Li,Jiazhao Zhang,Jingren Zhou,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Pei Lin,Qihang Peng,Shengming Yin,Tianhe Wu,Tianyi Yan,Xiao Xu,Yan Shu,Yanran Zhang,Ye Wang,Yi Wang,Yilei Chen,Yixian Xu,Yiyang Huang,Yuxiang Chen,Zekai Zhang,Zhendong Wang,Zhixing Lei,Zhixuan Liang,Zihao Liu,Zikai Zhou,Xiong-Hui Chen,Chenfei Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:language-conditioned video world, video world model, introduce Qwen-RobotWorld, language-conditioned video, Embodied World Knowledge

备注

点击查看摘要

Abstract:We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

8. 【2606.17027】MeshLoom: Feed-Forward Non-Rigid Registration of Mesh Sequences

链接https://arxiv.org/abs/2606.17027

作者:Jianqi Chen,Jiraphon Yenphraphai,Xiangjun Tang,Sergey Tulyakov,Chaoyang Wang,Peter Wonka,Rameen Abdal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:directly reconstructs vertex, reconstructs vertex deformations, directly reconstructs, reconstructs vertex, feed-forward registration network

备注: Project page: [this https URL](https://meshloom.github.io/)

点击查看摘要

Abstract:We present MeshLoom, a feed-forward registration network that directly reconstructs vertex deformations across mesh sequences. Our approach advances non-rigid registration beyond existing models, which are typically constrained by costly per-instance optimization, narrow object categories, pairwise-only inputs, or merely intermediate outputs. The network is simple and efficient, registering multiple meshes within seconds. At its core lies a topology-aware encoder--decoder design. Specifically, we first introduce a topology-aware point representation that encodes the anchor (reference) mesh's topology into its per-vertex features. This representation strengthens the network's understanding of the anchor-mesh geometry and disambiguates points that are Euclidean-close yet geodesically distant. We then propose a multi-modal encoder that fuses this anchor-mesh representation with complementary cues from each frame, such as shape latents and image features. These multi-source signals are compressed into a compact global motion embedding that captures dense inter-frame correspondence. A lightweight decoder then queries this global embedding with the anchor-mesh point representation, retrieving per-vertex deformations at target timestamps. Through extensive experiments across diverse motions and object categories, we show that MeshLoom achieves state-of-the-art results on non-rigid registration. In addition, we find that our global embedding-then-query paradigm naturally enables the network to generate deformations at intermediate timestamps, which extends MeshLoom to motion interpolation and mesh morphing. Project page: this https URL .

9. 【2606.17020】FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

链接https://arxiv.org/abs/2606.17020

作者:Jiaju Han,Ben Zhang,Xuemeng Sun,Qike Zhang,Yuxian Dong,Chengyin Hu,Fengyu Zhang,Yiwei Wei,Jiujiang Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:advanced Earth observation, existing work remains, work remains centered, Earth observation understanding, infrared data underexplored

备注

点击查看摘要

Abstract:Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

10. 【2606.16996】ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

链接https://arxiv.org/abs/2606.16996

作者:Tran Dinh Tien,Zhiqiang Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Segment Anything Model, small active subset, strong frozen backbone, entire dataset vocabulary, full-resolution decoding

备注: Preprint. Code is available at [this https URL](https://github.com/VILA-Lab/ActiveSAM)

点击查看摘要

Abstract:Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at this https URL.

11. 【2606.16993】DreamX-World 1.0: A General-Purpose Interactive World Model

链接https://arxiv.org/abs/2606.16993

作者:DreamX Team,Yancheng Bai,Rui Chen,Xiangxiang Chu,Rujing Dang,Hao Dou,Bingjie Gao,Qiwen Gu,Siyu Hong,Jiachen Lei,Geng Li,Jifan Li,Ruimin Lin,Qingfeng Shi,Bingze Song,Lei Sun,Jing Tang,Ruitian Tian,Jun Wang,Jiahong Wu,Pengfei Zhang,Shen Zhang,Jiashu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:general-purpose interactive text, controllable long-horizon generation, interactive text, general-purpose interactive, Unreal Engine rendering

备注: Project page: [this https URL](https://amap-ml.github.io/DreamX_World) , Code: [this https URL](https://github.com/AMAP-ML/DreamX-World)

点击查看摘要

Abstract:DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

12. 【2606.16991】A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT

链接https://arxiv.org/abs/2606.16991

作者:Mariam Elbakry,Aliaa Sayed Sheha,Salma Hassan Tantawy,Aya Yassin,Concetto Spampinato,Karim Lekadir,Xiaomeng Li,Marawan Elbatel

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:escalates acquisition burden, carries inherent risks, abdominal lesion characterization, Multiphasic contrast-enhanced, lesion characterization

备注: Early Accept (top ~9%), MICCAI 2026

点击查看摘要

Abstract:Multiphasic contrast-enhanced CT (CECT) is widely used for abdominal lesion characterization, yet it carries inherent risks of contrast-induced nephropathy, escalates acquisition burden, and heavily contributes to radiologist workload. To address these challenges, we introduce a novel multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation, which learns to synthesize contrast-enhanced findings from single-phase non-contrast CT (NCCT). To support this, we curated a large-scale dataset of paired NCCT-CECT studies and their corresponding contrast-enhanced radiology reports from two centers, partitioned into internal sets and an external validation cohort. Under a unified evaluation protocol, we benchmarked five contemporary deep learning architectures encompassing chest-specific, abdomen-specific, and general-purpose multimodal domains. Extensive experiments demonstrate that NCCT retains diagnostic signals, achieving an average multi-organ AUC of 69.1% on the internal cohort and 63.1% on the external cohort, respectively. By releasing this dataset and standardized benchmark publicly, this study aims to catalyze future research into safer, resource-efficient, and globally accessible contrast-free abdominal imaging workflows. Code is available at: this https URL.

13. 【2606.16960】SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

链接https://arxiv.org/abs/2606.16960

作者:Shuai Yuan,Runxi Tang,Yuzhou Ji,Fudong Ge,Hanshi Wang,Yifei Wang,Xianming Zeng,Jianyun Xu,Xingliang Liu,Yanfeng Wang,Zhipeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:turn requires reliable, requires reliable multi-camera, Modern autonomous driving, understanding for perception, Modern autonomous

备注

点击查看摘要

Abstract:Modern autonomous driving depends on accurate metric 3D understanding for perception, reconstruction, and planning, which in turn requires reliable multi-camera depth prediction. However, the outward-facing nature of vehicle-mounted surround-view camera rigs inherently limits visual overlap across views, challenging the correspondence-based assumptions that underpin conventional multi-view geometry. To bridge this gap, we present SurroundNEXO, named after the Spanish word nexo for a geometric link, a low-overlap multi-camera metric depth framework that grounds cross-view reasoning in ego-centric geometry rather than dense visual correspondences. Instead of directly enforcing early global fusion, SurroundNEXO first assigns image tokens globally comparable ego-frame viewing directions through Ego-Ray Positional Encoding, then uses sparse LiDAR measurements as metric anchors to propagate absolute scale cues, and finally expands feature interaction progressively from view-local modeling to decomposed spatio-temporal reasoning and global integration. This design enables metric-scale depth prediction with improved spatial consistency across weakly overlapping cameras. Across low-overlap autonomous driving benchmarks, including NuScenes, Waymo and DDAD, SurroundNEXO reduces single-view error by 33.2%, improves cross-view consistency by 10.5%, and enhances metric reconstruction quality by 25.6% compared with SOTA methods. It further remains robust under extremely sparse depth prompts and exhibits strong zero-shot generalization to unseen camera layouts.

14. 【2606.16951】Simulation-Based Multi-Fillet Evaluation of Woody Breast Poultry Fillets

链接https://arxiv.org/abs/2606.16951

作者:Chirantan Sen Mukherjee,Seung-Chul Yoon,William J. Beksi

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:significant economic losses, modern broiler chickens, decreased meat quality, Woody breast, breast muscle

备注: To be published in the 2026 International Conference on Automation Science and Engineering (CASE)

点击查看摘要

Abstract:Woody breast (WB) is a myopathy in modern broiler chickens that causes the breast muscle to become unusually stiff and fibrous, leading to decreased meat quality and significant economic losses. State-of-the-art automated WB detection relies on a side-view imaging system to analyze the bending behavior of a single fillet as it falls off a conveyor belt. While highly accurate, this approach is constrained by its single-fillet field of view, creating throughput bottlenecks on commercial processing lines. In this paper, we address this limitation via a novel multi-fillet detection architecture utilizing a top-down camera configuration. To validate our approach, we first develop a high-fidelity digital twin of an industrial conveyor system. Next, we synthesize a diverse dataset of 3D fillet meshes and model their viscoelastic bending dynamics using a physics-based simulation engine. Lastly, a continuous 2D shape deformation score is extracted from the top-down perspective as the simulated fillets traverse the roller precipice. Experimental results demonstrate that the top-down shape score effectively captures the contour changes of the fillets as it bends, providing a robust and scalable alternative to a side-view imaging system for simultaneous multi-fillet WB evaluation.

15. 【2606.16898】Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

链接https://arxiv.org/abs/2606.16898

作者:Dongbin Na,Chanwoo Kim,Giyun Choi,Dooyoung Hong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Detecting unanswerable user, queries remains essential, real-world embodied agents, remains essential, reliable deployment

备注: 18 pages, 3 figures. Code and data: [this https URL](https://github.com/ndb796/SemanticFlip) ; project page: [this https URL](https://ndb796.github.io/SemanticFlip)

点击查看摘要

Abstract:Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence poses various task-dependent risks. The agent may provide misleading information to the user in Embodied Question Answering and select an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these high stakes, only a few prior studies directly address when and how an embodied VLM should respond with "I do not know." This work proposes Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. This work also introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an $F_1$ score of 0.9559. The source codes and datasets are publicly available at this https URL.

16. 【2606.16870】Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation

链接https://arxiv.org/abs/2606.16870

作者:Adrian Ramlal,Yuhao Chen,John S. Zelek

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Realistic visual simulation, requires accurate material, Realistic visual, manipulation requires accurate, accurate material parameters

备注: Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 MetaFood Workshop

点击查看摘要

Abstract:Realistic visual simulation of food manipulation requires accurate material parameters, yet these are difficult to measure directly and vary across the heterogeneous regions of a single food item. We address the inverse problem of estimating material parameters from a target description of fracture behavior in a non-differentiable continuum damage mechanics simulator. Using orange peeling as a test case, we train a neural surrogate on 2,000 forward simulations and compare Covariance Matrix Adaptation Evolution Strategy (CMA-ES, a gradient-free evolutionary optimizer) with Proximal Policy Optimization (PPO, a reinforcement learning algorithm) across the original 9-dimensional parameter space and two learned 4-dimensional latent representations. Since different oranges have different material properties, a practical inverse system must handle arbitrary targets without retraining. We train a goal-conditioned PPO policy that learns a general inverse mapping: given any target description of peeling behavior, the policy produces a material parameter estimate in a single forward pass (8 surrogate evaluations, approximately 10ms). Operating in a normalizing flow latent space with a shared surrogate evaluator, the goal-conditioned policy achieves 0.642 actual recovery when validated through the simulator, outperforming the original parameter space by 23%. A warm-start extension that initializes CMA-ES refinement from the policy's output further improves recovery to 0.828 with 540 evaluations. These findings provide a practical framework for inverse food physics and lay groundwork for vision-driven material identification from video observations of food manipulation.

17. 【2606.16868】Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

链接https://arxiv.org/abs/2606.16868

作者:Markus Bujotzek,Dimitrios Bounias,Stefan Denner,Ralf Floca,Maximilian Fischer,Peter Neher,Klaus Maier-Hein

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:centralizing sensitive data, cross-site label imperfections, enables collaborative medical, medical image segmentation, enables collaborative

备注

点击查看摘要

Abstract:While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at this https URL.

18. 【2606.16866】Redirecting the Flow: Image Customization through Attention Distribution Shift

链接https://arxiv.org/abs/2606.16866

作者:Jie Li,Suorong Yang,Jian Zhao,Furao Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Subject-driven image customization, follow textual instructions, Subject-driven image, image customization aims, Conditional Attention Distribution

备注

点击查看摘要

Abstract:Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.

19. 【2606.16861】An Open-Source Monitoring Framework for Data Exploration and Progress Tracking in Multi-Center Radiology Studies

链接https://arxiv.org/abs/2606.16861

作者:Markus Bujotzek,Jonas Scherer,Stefan Denner,Peter Neher,Benjamin Hamm,Lorenz Feineis,Uenal Akuenal,Andreas Bucher,Tobias Penzkofer,Klaus Maier-Hein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:crucial for advancing, study progress monitoring, study progress, Multi-center studies, monitoring

备注

点击查看摘要

Abstract:Multi-center studies are crucial for advancing medical and radiological research. Data exploration, collaboration discovery, and study progress monitoring are essential for maximizing their potential. However, in practice these processes often rely on manual communication and shared tables, which quickly become outdated and hinder efficient coordination in large distributed studies. This highlights the need for dedicated monitoring solutions that provide transparent and up-to-date insights into study progress. We propose a lightweight, open-source monitoring architecture for multi-center studies based on the widely used Grafana-Prometheus stack. The framework collects aggregated monitoring metrics from distributed study sites and visualizes them through configurable dashboards. As a real-world deployment example, the framework is integrated into the medical imaging platform Kaapana and evaluated within a large multi-center research network. By deploying our solution within the Germany-wide RACOON consortium, we demonstrate its ability to enable privacy-preserving data exploration and study progress monitoring across all 38 German university clinics. The monitoring framework supports transparent coordination of distributed research activities and can facilitate more efficient management of large-scale multi-center studies. The source code and Kaapana integration are publicly available at this https URL.

20. 【2606.16837】Robust Spoofed Speech Detection via Temporal Pyramid Modeling

链接https://arxiv.org/abs/2606.16837

作者:Mahtab Masoudi Nezhad,Nima Karimian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:Spoofed speech detection, cross-dataset generalization remaining, Spoofed speech, Temporal Pyramid Adapter, Temporal Pyramid

备注

点击查看摘要

Abstract:Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

21. 【2606.16799】Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment

链接https://arxiv.org/abs/2606.16799

作者:Zijie Meng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Existing vision-language model, based AI-generated image, semantic-distortion dimensional conflict, monolithic representations optimized, low-level perceptual sensitivity

备注: 11 pages, 2 figures Accepted by ICME2026(spotlight)

点击查看摘要

Abstract:Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at this https URL.

22. 【2606.16795】WaveDINO: Learning-Based Atmospheric Correction of Unwrapped InSAR Interferograms Validated by GNSS: Results at Laguna del Maule and Campi Flegrei Volcanoes

链接https://arxiv.org/abs/2606.16795

作者:Robert Popescu,Juliet Biggs,Tianyuan Zhu,Nantheera Anantrasirichai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Interferometric Synthetic Aperture, Synthetic Aperture Radar, Aperture Radar, enables effective monitoring, atmospheric phase delays

备注: 11 pages, 6 figures

点击查看摘要

Abstract:Interferometric Synthetic Aperture Radar (InSAR) enables effective monitoring of volcanic deformation; however, the observed signals are often corrupted by atmospheric phase delays, seasonal surface changes, and decorrelation effects. Existing atmospheric correction methods, such as numerical weather model-based methods, can reduce these effects but do not consistently remove atmospheric artefacts and may introduce residual biases. To address these limitations, we propose a novel learning-based method for denoising unwrapped InSAR interferograms, using a hybrid training strategy that combines physically motivated synthetic deformation with real atmospheric noise. Specifically, we introduce WaveDINO, a wavelet-based multi-scale denoising framework conditioned on frozen DINOv3 foundation-model features and terrain information. Training uses synthetic magma-source deformation superimposed on short-term interferograms to expose the network to realistic atmospheric statistics while retaining known ground truth. Performance is evaluated on both controlled synthetic data and long-term real interferograms from Laguna del Maule (Chile) and Campi Flegrei (Italy), with independent GNSS measurements used for validation. WaveDINO consistently outperforms competing models, improving agreement with GNSS measurements, and reducing mean GNSS misfit by approximately 3% and 19% at two sites, respectively, while surpassing weather-model-based corrections.

23. 【2606.16794】LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

链接https://arxiv.org/abs/2606.16794

作者:Gyuyeon Na

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:domain-specific LLM-based Visual, facial skin disease, skin disease diagnosis, LLM-based Visual Explanation, disease diagnosis models

备注

点击查看摘要

Abstract:This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.16794 [cs.CV]

(or
arXiv:2606.16794v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.16794

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
24. 【2606.16783】Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

链接https://arxiv.org/abs/2606.16783

作者:Zhiqiang Zhou,Junliang Dai,Xu ling

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, Multimodal large language, rely on text-based, large language, lacking interpretable visual

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.

25. 【2606.16767】xt-Vision Co-Instructed Image Editing

链接https://arxiv.org/abs/2606.16767

作者:Chenxi Xie,Yuhui Wu,Qiaosi Yi,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing image editing, Existing image, image editing methods, generally categorized, image editing

备注

点击查看摘要

Abstract:Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

26. 【2606.16756】3D Classification of Paramagnetic Rim Lesions in Multiple Sclerosis via Asymmetric QSM-FLAIR Modeling

链接https://arxiv.org/abs/2606.16756

作者:Veronica Pignedoli,Giacomo Boffa,Nicoletta Noceti,Matilde Inglese,Francesca Odone,Matteo Moro

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Sclerosis, long-term disability progression, Paramagnetic rim lesions, inflammation in Multiple, Paramagnetic rim

备注: 10 pages, 3 figures, accepted at MICCAI 2026. Github link: [this https URL](https://github.com/veronicapignedoli/FRODO)

点击查看摘要

Abstract:Paramagnetic rim lesions (Rim$^+$) identified on susceptibility-sensitive MRI have recently emerged as a specific biomarker of chronic active inflammation in Multiple Sclerosis (MS) and are associated with long-term disability progression. However, susceptibility imaging and expert interpretation remain limited to specialized centers, visual assessment is time-consuming and variable, and the low prevalence of Rim$^+$ lesions poses severe class imbalance challenges for automated analysis. We propose a 3D multimodal deep learning framework for lesion-level Rim$^+$/Rim$^-$ classification from Quantitative Susceptibility Mapping (QSM) and FLAIR MRI. The architecture explicitly models modality asymmetry by treating QSM as the primary susceptibility-driven signal and conditioning it with FLAIR-derived structural context. To improve robustness under limited data, we employ self-supervised multimodal pretraining followed by supervised fine-tuning with contrastive regularization. The method was evaluated on a clinically acquired cohort of 88 people with MS with expert lesion annotations as reference standard. Results highlight improved performance compared to prior architectures, supporting the effectiveness of asymmetric multimodal modeling for automated chronic active lesion identification.

27. 【2606.16749】Structure-aware Knowledge-guided Heterogeneous Mamba for Zygomaticomaxillary Suture Assessment

链接https://arxiv.org/abs/2606.16749

作者:Xiaoqi Guo,Birui Chen,Xinquan Yang,Chaoyun Zhang,Xuefen Liu,Mianjie Zheng,Kun Tang,Xuguang Li,Wen Ma,Yanhua Xu,Linlin Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:key circummaxillary structure, status directly influences, maturation status directly, Zygomaticomaxillary Suture, maxillary advancement

备注

点击查看摘要

Abstract:The Zygomaticomaxillary Suture is a key circummaxillary structure that connects the zygomatic bone and the maxilla, which serves as a primary site of resistance during maxillary advancement, and its maturation status directly influences the timing and efficacy of orthopedic interventions. However, accurate staging of ZMS maturation remains challenging due to subtle high-frequency transitions in suture lines and the global semantic ambiguity between adjacent stages. To address this, we present the first public ZMS dataset, comprising 3,790 ZMS images covering the entire age range from 4 to 24 years. Based on this dataset, we propose SKMamba, a Structure-aware and Knowledge-guided Mamba-based multi-modal framework for automated ZMS maturation assessment. SKMamba adopts a decoupled dual-path architecture that mimics the hierarchical diagnostic process used by experienced orthodontists. We first introduce an Implicit Edge Extractor (IEE), which leverages structural pre-training to reduce trabecular noise and accentuate sutural boundaries. Complementarily, a Cross-Modal Semantic Alignment (CSA) module is designed to incorporate anatomical descriptions from a large language model (LLM). This module helps align local morphological cues with global semantic descriptions while ensuring that objective morphological evidence remains the primary basis for decisions. Extensive experiments on our ZMS dataset demonstrate that SKMamba achieves state-of-the-art performance compared to existing methods. Code is available at this https URL.

28. 【2606.16742】Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video Detection

链接https://arxiv.org/abs/2606.16742

作者:Renxi Cheng,Jie Gui,Hongsong Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:rapid advancement, video generation models, AI-generated video detection, videos, AI-generated

备注: 13 pages, 5 figures

点击查看摘要

Abstract:With the rapid advancement of video generation models, distinguishing between AI-generated and authentic videos has emerged as a challenging endeavor. The majority of existing research endeavors concentrate on the development of detectors for identifying samples generated by generative adversarial networks. Nevertheless, the detection of AI-generated videos, particularly those produced by text-to-video models, still remains an uncharted territory. Although state-of-the-art text-to-video models can generate realistic visual content similar to real videos, they fall short of generating the details of the images and the changes in details within the videos. Inspired by this, we address AI-generated video detection from a novel perspective of bit-planes, which can effectively describe the details or noises in images or videos. To this end, we propose a simple yet effective approach called Noise Amplification. This approach first extracts noise signals based on bit-planes, then amplifies these noise signals, and finally feeds them into the discriminator networks for video fake classification. Noise amplification is comprehensively constructed by incorporating three aspects: pixel-level intensity enhancement, region-level spatial amplification, and frame-level temporal aggregation. To evaluate methods of AI-generated video detection in challenging scenarios, we also introduce a benchmark named HardGVD. Extensive experiments on both the large-scale dataset GenVidBench and HardGVD show that our simple approach significantly outperforms state-of-the-art methods.

29. 【2606.16690】PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation

链接https://arxiv.org/abs/2606.16690

作者:Yanan Zhou,Ranpeng Qiu,Yincong Chen,Jiajie Cui,Weiming Zhi

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Learning-based manipulation policies, made substantial progress, short-horizon action generation, Learning-based manipulation, policies have made

备注

点击查看摘要

Abstract:Learning-based manipulation policies have made substantial progress in real-world robot manipulation, particularly for short-horizon action generation. However, deployment in open workspaces remains fragile under unexpected local scene dynamics, such as moving objects, transient occlusions, or disturbances near the intended motion. Existing runtime monitors often rely on global observation anomalies, policy uncertainty, or frame-level visual changes, and struggle to distinguish task-relevant execution risk from benign visual variation. We introduce PATCH, an action-chunk-conditioned latent patch innovation monitor for deployment-time intervention. Given the active action chunk, PATCH defines a projected execution corridor, predicts latent patch evolution inside it, and accumulates persistent residuals unexplained by the robot's own motion. These residuals form a localized intervention signal that allows PATCH-Router to pause execution, select an available recovery source, and resume the original policy once localized innovation subsides. Experiments on real robot rollout data show that PATCH produces more stable and context-relevant triggers than competing runtime monitors. Real-robot deployment further demonstrates monitor-driven intervention and policy resumption for disturbance-aware manipulation. Project Page: this https URL.

30. 【2606.16673】MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

链接https://arxiv.org/abs/2606.16673

作者:Yagmur Akarken,Orest Kupyn,Christian Rupprecht

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remarkable generative capabilities, demonstrated remarkable generative, rich perceptual representations, perceptual representations computed, content is rendered

备注

点击查看摘要

Abstract:Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

31. 【2606.16672】Sinkhorn-CPD: Robust point cloud registration via unbalanced entropic optimal transport

链接https://arxiv.org/abs/2606.16672

作者:Jin Zhang,Mingyang Zhao,Bing Liu,Xin Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Coherent Point Drift, rigid point cloud, Coherent Point, Point Drift, point cloud registration

备注: 14 pages, 10 figures; journal version published in Computer-Aided Design

点击查看摘要

Abstract:Coherent Point Drift (CPD) is widely used for rigid point cloud registration because of its soft correspondences and closed-form parameter updates. However, CPD's target-side marginal constraint forces every observation, including outliers, to receive exactly unit probability mass. This assumption degrades registration accuracy under heavy outliers and partial overlap. Optimal transport (OT) methods can handle missing mass through unbalanced formulations, but require hand-tuned annealing schedules. In this paper, we propose Sinkhorn-CPD, which replaces CPD's target-side marginal constraint with dual Kullback-Leibler penalties, allowing the algorithm to discard outliers on both sides. The resulting formulation is a fully unbalanced entropic optimal transport problem, which can be efficiently solved by generalized Sinkhorn iterations. Moreover, Sinkhorn-CPD preserves the closed-form Procrustes and variance updates of CPD. In our method, the variance sigma^2 plays the role of the entropic regularization parameter, which induces an automatic annealing schedule from diffuse to sharp correspondences without manual temperature tuning. Experiments on synthetic, cross-category, and scan-to-CAD benchmarks show that Sinkhorn-CPD achieves state-of-the-art accuracy, with strong robustness to outliers and partial overlap.

32. 【2606.16667】Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

链接https://arxiv.org/abs/2606.16667

作者:Jian Xu,Delu Zeng,John Paisley,Qibin Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision-language models, assert visual details, Large vision-language, vision-language models, hallucination rate

备注

点击查看摘要

Abstract:Large vision-language models (LVLMs) hallucinate: they assert visual details that the image does not support. A principled remedy is selective prediction with a distribution-free guarantee-verify each claim and abstain when the claim is not grounded, so that the hallucination rate among asserted claims is provably bounded. We show, however, that this guarantee is bought at a brutal price: to keep the hallucination rate below $5\%$ on a balanced object-existence benchmark, a state-of-the-art conformal filter must abstain on more than $80\%$ of claims. We argue that abstention is wasteful when more visual evidence is cheaply available, and introduce Budgeted Conformal Evidence Acquisition (BCEA), which replaces the binary answer/abstain decision with a three-way choice: answer, abstain, or acquire additional visual evidence by re-examining the image (zooming, cropping, or applying a claim-specific intervention) under a bounded compute budget. We make two observations. First, acquisition that is plugged naively into a calibrated filter breaks the statistical guarantee -- realized risk overshoots the target by up to $17$ points -- because the acquisition step destroys the exchangeability that conformal calibration relies on. Second, folding the entire acquisition policy into the score function and re-calibrating on post-acquisition scores \emph{restores} the finite-sample guarantee while still recovering coverage. BCEA further uses structured, claim-type-specific interventions. Across the POPE benchmark and COCO-constructed existence and spatial-relation claims, on four open VLMs, BCEA controls the hallucination rate at the target level and consistently improves coverage over a guaranteed-abstention baseline.

33. 【2606.16658】Vision-Language Models as Zero-Annotation Oracles in Histopathology

链接https://arxiv.org/abs/2606.16658

作者:Vishal Jain,Giorgio Buzzanca,Sarah Cechnicka,Maarten Naesens,Priyanka Koshy,Tri Nguyen,Jesper Kers,Candice Roufosse,Bernhard Kainz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Elastica van Gieson, van Gieson, silver or Elastica, Elastica van, existing methods rely

备注: 11 pages, 1 figure, 6 tables. Code available at [this https URL](https://github.com/VishalJ99/vlm-wsi-auto-context)

点击查看摘要

Abstract:Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution HE. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.

34. 【2606.16638】MVM-IOD: An Industrial Object-Centric Benchmark Dataset for the Evaluation of 3D Reconstruction Methods

链接https://arxiv.org/abs/2606.16638

作者:Robert Langendörfer,Markus Hillemann,Markus Ulrich

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Machine Vision Metrology, industrial, errors are costly, computation time, typical industrial objects

备注

点击查看摘要

Abstract:3D object reconstruction, and camera pose estimation in industrial applications are challenging tasks, as errors are costly while the computation time is often limited. The complexity of typical industrial objects further complicates these tasks. Most of the existing datasets in this context do not depict realistic industrial scenarios. Therefore, we introduce the Machine Vision Metrology Industrial Object Dataset (MVM-IOD). Images of typical industrial objects are captured systematically, by moving a camera, mounted at the end effector of an industrial robot arm, on a hemisphere around the objects. MVM-IOD contains reference camera poses and reference 3D point clouds, the acquired RGB images of 9 objects and 2 background choices resulting in 18 scenes, which allows evaluation of all image based methods that compute a 3D reconstruction, camera poses, or novel views of a scene. Based on MVM-IOD, we extensively evaluate current SOTA 3D reconstruction and camera pose estimation methods, such as Structure from Motion, Multi-View Stereo, recent feed forward methods (Visual Geometry Grounded Transformer, {\pi}3), and 2D Gaussian Splatting and report our findings as a baseline for future research. The experiments show that capture setups like ours generate out-of distribution images for feed forward methods, leading to suboptimal point clouds and camera poses. However, these out-of-distribution images can be shifted closer to the training distribution by applying simple preprocessing steps. Consequently, in certain industrial applications, feed forward methods should be used with caution.

35. 【2606.16633】DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

链接https://arxiv.org/abs/2606.16633

作者:Xifeng Xue,Xiaokang Wang,Zirui Li,Ming-Ming Cheng,Guolei Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent vision token, effectively preserve model, Recent vision, preserve model performance, methods effectively preserve

备注: The code will be released at: [this https URL](https://github.com/EMVision-NK/DCP-Prune)

点击查看摘要

Abstract:Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

36. 【2606.16615】SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

链接https://arxiv.org/abs/2606.16615

作者:Shengyu Gong,Weiming Zeng,Yueyang Li,Zijian Kang,Hongjie Yan,Wai Ting Siok,Nizhuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Non-invasive brain-computer interfaces, brain-computer interfaces suffer, interfaces suffer severe, suffer severe fidelity, severe fidelity degradation

备注

点击查看摘要

Abstract:Non-invasive brain-computer interfaces suffer severe fidelity degradation in neural visual decoding when generalizing to natural visual experiences. Conventional multimodal contrastive representation learning solely optimizes geometric distance alignment, neglecting semantic consistency and subject selectivity, causing spurious zero-shot alignment. We propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) Semantic-entity Aware Visual Encoder (SAVE), learning spatial attention to extract semantic content without pre-trained saliency models; (2 Unified EEG Enhancer (UEE), employing multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) Prototype-based Progressive Augmenter (PPA), maintaining an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, surpassing state-of-the-art methods. Code is available at this https URL.

37. 【2606.16601】DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

链接https://arxiv.org/abs/2606.16601

作者:Dingrong Wang,Xian Tao,Zhen Qu,Hengliang Luo,Xinyi Gong,Fei Shen,Zhengtao Zhang,Guiguang Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:localize abnormal regions, detecting unseen defect, unseen defect categories, Large Language Model, anomaly localization aims

备注: Submitted to IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: this https URL.

38. 【2606.16593】Rotational Symmetry based Object Pose Estimation from Point Clouds in the Absence of Known 3D Models

链接https://arxiv.org/abs/2606.16593

作者:Weichen Dai,Ruixun Yu,Yangjie Tang,Yifan Du,Yiyang Zhang,Donglei Sun,Hua Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:automated spray painting, rotational symmetry, pose estimation, automated spray, spray painting

备注

点击查看摘要

Abstract:Object pose estimation is crucial to many industrial applications, with one example being automated spray painting using a robot. However, confidentiality concerns often limit access to high-quality 3D models, posing a significant challenge for point-cloud-based pose estimation. In such scenarios, rotational symmetry, a readily accessible characteristic of many industrial objects, can provide valuable prior information to facilitate pose this http URL this paper, we propose a method that leverages the rotational symmetry commonly found in industrial objects to address the challenge caused by the absence of 3D models. The object pose is jointly estimated with point cloud refinement through an iterative optimization process. This optimization relies on a rotational symmetry constraint loss. To construct this loss, each 3D point is rotated according to the currently estimated pose, and multiple correspondences are identified using nearest-neighbor search by exploiting the rotational symmetry property. These correspondences are then used to compute the rotational symmetry constraint loss, which iteratively refines both the pose and the point this http URL explicitly incorporating rotational symmetry into the optimization process, the proposed method achieves robust pose estimation and generalizes well across diverse object types. The proposed method is evaluated on a dataset specifically created for point clouds without known 3D models, consisting of four categories of synthetic objects and one real wheel hub collected from a production line. Experimental results demonstrate that the proposed method achieves performance comparable to methods that rely on known 3D models.

39. 【2606.16586】LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

链接https://arxiv.org/abs/2606.16586

作者:Zhou Tao,Fang Zhang,Zewen Ding,Shida Wang,Xiaokun Sun,YongXiang Hua,Haoyu Cao,Linli Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, high-resolution inputs preserve

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be reliably selected and used amid redundant visual context. We propose LOCUS (LOcal visual CUe Search), a training framework that teaches MLLMs to internalize local evidence search through a verifiable proxy task. During training, LOCUS provides a local crop as a visual cue and optimizes the model to recover its spatial support in the full image using an IoU-based reward. The visual cue is used only during training, leaving the standard image-question inference interface unchanged. Experiments across fine-grained perception, hallucination, general understanding, and reasoning benchmarks show that LOCUS improves localization-sensitive visual understanding while preserving broad capabilities. Attention analyses further indicate stronger focus on task-relevant evidence regions, suggesting that training-time visual cue search provides an effective route to internalized fine-grained evidence selection.

40. 【2606.16580】Multi-Modal Spatio-Temporal Graph Neural Network with Mixture of Experts for Soil Organic Carbon Prediction

链接https://arxiv.org/abs/2606.16580

作者:Daniele Mos,Felipe Drummond,Anton Bossenbroek,Soufiane el Khinifri

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Top-soil organic carbon, Top-soil organic, organic carbon, prediction is fundamental, agricultural sustainability

备注: Paper is 27 pages, 14 figures, 12 tables

点击查看摘要

Abstract:Top-soil organic carbon (SOC) prediction is fundamental to agricultural sustainability, land use policy and fertilization planning. Existing approaches face two limitations: they pair hand-crafted covariates with classical ML or single-modal deep models that miss rich spectral and temporal information, and grid-based architectures ignore the irregular spatial structure of field measurements. We introduce SpTGNN, a multi-modal spatio-temporal graph neural network addressing both. SpTGNN represents soil measurements as nodes in a heterogeneous graph with three edge types (spatial proximity, spectral similarity, elevation), and applies relational graph attention to learn separate patterns per relation. A fine-tuned TerraMind encoder extracts node features from Sentinel-2, Sentinel-1 and DEM signals, combined with per-sample environmental covariates and learned positional and temporal embeddings. A sparse Mixture-of-Experts module fuses the four streams via top-$k$ routing. Uncertainty is captured by pairing heteroscedastic regression (aleatoric) with deep ensembles (epistemic), and a Moran's $I$ penalty regularizes spatial autocorrelation. We evaluate on a global SOC corpus split into three regional instances ($\sim$49k samples globally, Africa $\sim$26k, Europe $\sim$14k). Our 5-member deep ensemble reports $R^2=0.762$, RMSE $=3.51\pm0.48$ g/kg and MAPE $=22.9\%$ on the Africa test split, improving over a tabular XGBoost baseline; the best single checkpoint reaches validation $R^2=0.864$. Ablations confirm the heterogeneous graph, MoE fusion and fine-tuned backbone each contribute substantively, and the ensemble UQ stack achieves post-calibration ECE of $0.031$ (hybrid) and $0.026$ ($\beta$-NLL). To our knowledge, this is the first framework to unify foundation-model feature extraction, heterogeneous graph attention and decomposed uncertainty quantification for SOC estimation.

41. 【2606.16573】ransformation-driven generation of comparable projection images from multimodal anatomical scenes

链接https://arxiv.org/abs/2606.16573

作者:Dariusz Pojda,Krzysztof Domino,Michał Tarnawski,Agnieszka Anna Tomaka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:undergo independent spatial, independent spatial transformations, Digitally Reconstructed Radiograph, generating reproducible projection-space, work addresses

备注: 36 pages, 11 figures

点击查看摘要

Abstract:This work addresses the computational problem of generating reproducible projection-space observations from heterogeneous anatomical scenes whose components may undergo independent spatial transformations. We propose a transformation-driven framework for synthetic projection imaging from multimodal anatomical data and demonstrate it on mandibular-motion scenarios. In contrast to conventional Digitally Reconstructed Radiograph (DRR) approaches primarily designed for registration, projection realism, or rendering efficiency, the proposed formulation treats projection imaging as an observation process operating on an explicitly represented anatomical scene. Independently transformable volumetric and surface-based anatomical objects are embedded within a shared scene representation and propagated directly into projection space through explicit transformations. Projection geometry, acquisition modelling, material interpretation, and image presentation remain explicitly separated, enabling controlled exploration of methodological assumptions while preserving reproducibility and direct comparability between generated projections. Particular emphasis is placed on transformation-driven anatomical scenarios relevant to craniofacial analysis, including mandibular motion and therapeutic repositioning. Using a shared anatomical reference scene composed of CT/CBCT volumes, segmented structures, surface models, and auxiliary anatomical or therapeutic objects, the framework enables generation of directly comparable VirtualRTG projections from multiple anatomical configurations while preserving identical imaging assumptions. Rather than aiming at fully physically faithful radiographic simulation, the proposed approach provides a controllable and reproducible methodological environment for studying anatomy--projection relationships, motion observability, and transformation-aware imaging workflows.

42. 【2606.16569】PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

链接https://arxiv.org/abs/2606.16569

作者:Zhiang Chen,Nahyuk Lee,Boyang Sun,Taein Kwon,Marc Pollefeys,Zuria Bauer,Sunghwan Hong

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:times underpins persistent, underpins persistent spatial, persistent spatial memory, Registering two captures, form is RGB-only

备注: Project page: [this https URL](https://rckola.github.io/prose/)

点击查看摘要

Abstract:Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

43. 【2606.16566】Local-GS: Accelerating 3D Gaussian Splatting via Tile-Local Warp Coherence

链接https://arxiv.org/abs/2606.16566

作者:Yang Luo,Yan Gong,Yongsheng Gao,Jie Zhao,Xinyu Zhang,Huaping Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significantly advanced real-time, Gaussian Splatting, collections of anisotropic, Gaussian primitives, significantly advanced

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has significantly advanced real-time novel view synthesis by representing scenes as dense collections of anisotropic 3D Gaussian primitives. However, the irregular spatial distribution of Gaussians often leads to poor GPU utilization, as warp divergence and redundant computation degrade rendering performance. To address this, we present Local-GS, a warp-coherent rendering paradigm that, organizes Gaussian primitives with respect to SIMT (Single Instruction, Multiple Threads) execution boundaries rather than scene geometry. Specifically, we propose three warp-coherent stages: a hoisting stage that precomputes shared parameters at tile level, a culling stage that discards warps with no contribution, and a blending stage that replaces per-pixel branching with a uniform instruction stream. Across extensive benchmarks on multiple datasets, Local-GS improves efficiency without compromising quality. As a plug-and-play optimization, it provides additional performance gains to all tested baselines, culminating in a $7.76\times$ speedup on Deep Blending scenes.

44. 【2606.16535】Assessing Reliability of Symbol Detection in Concept Bottleneck Models

链接https://arxiv.org/abs/2606.16535

作者:Javier Fumanal-Idocin,Javier Andreu-Perez

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC)

关键词:explainable Artificial Intelligence, Artificial Intelligence, explainable Artificial, relevant tool, tool for explainable

备注

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) are a relevant tool for explainable Artificial Intelligence because they make their predictions through human-interpretable symbols. However, high task accuracy does not guarantee that these symbols are detected faithfully: jointly trained CBMs may encode task-specific shortcuts in the bottleneck, making their explanations unreliable. In this paper, we study concept-detection reliability by swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. We use the resulting performance degradation, concept-level metrics, and symbol-wise uncertainty estimates to identify concepts that are especially prone to spurious firing. Finally, we propose a reliability-aware training strategy in which a shared concept detector is optimized with multiple classification heads and penalized for relying on globally or instance-wise unreliable symbols. On CUB-200-2011 with full concept supervision, detectors and heads are almost freely interchangeable (swap drop below one accuracy point, relative retention above $99\%$, and no concept detected below chance), whereas on a controlled synthetic task we show that, as the concept-supervision weight is reduced, models keep near-perfect task accuracy while swapped accuracy and agreement with the ground-truth concepts collapse to chance. Our reliability-aware training substantially mitigates this leakage, roughly doubling swap accuracy in the leaky regime.

45. 【2606.16533】Kairos: A Native World Model Stack for Physical AI

链接https://arxiv.org/abs/2606.16533

作者:Kairos Team,Fei Wang,Shan You,Qiming Zhang,Tao Huang,Zuoyi Fu,Zhisheng Zheng,Yunlong Xi,Feng Lv,Xiaoming Wu,Zeyu Liu,Cong Wan,Pu Li,Ruiqing Yang,Xiaoou Li,Wei Wang,Kangkang Zhu,Yuwei Zhang,Shi Fu,Xiaoning Wu,Xuzeng Fan,Dacheng Tao,Xiaogang Wang

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:real deployment constraints, passive visual generators, natively acquire world, acquire world knowledge, native world model

备注

点击查看摘要

Abstract:World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

46. 【2606.16519】BadWorld: Adversarial Attacks on World Models

链接https://arxiv.org/abs/2606.16519

作者:Linghui Shen,Mingyue Cui,Xingyi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual world models, Visual world, synthesize interactive, single context image, single context

备注: Project Page: [this https URL](https://linghuiishen.github.io/BadWorld/)

点击查看摘要

Abstract:Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.

47. 【2606.16502】Active Reference Acquisition in Few-Shot Font Generation

链接https://arxiv.org/abs/2606.16502

作者:Shinnosuke Matsuo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot font generation, preserving stylistic consistency, font generation aims, reference, Few-shot font

备注: Accepted at ICDAR2026

点击查看摘要

Abstract:Few-shot font generation aims to synthesize the remaining glyphs of a font given one or a few reference glyphs while preserving stylistic consistency, thereby supporting font designers in efficiently completing a typeface. Existing methods primarily focus on improving generation quality given a fixed reference set. However, when the current reference glyphs are insufficient to represent the target style, few-shot font generation may fail to produce satisfactory results. In practical scenarios, additional reference glyphs can often be obtained from the designer when necessary. Accordingly, we propose a new framework, Active Reference Acquisition in Few-Shot Font Generation, in which the model sequentially decides which character to acquire next as an additional reference. Furthermore, we propose a reference part-coverage-based acquisition function to efficiently query the designer. Motivated by the observation that font styles are well characterized by local structural parts, we represent each glyph using a histogram of local features and select query characters that maximize the expected part coverage of the reference set. By prioritizing characters that contain parts not yet covered by the current references, the proposed method progressively expands the diversity of visual parts in the reference set. As a result, generation quality is improved with fewer queries. Experiments on the Google Fonts dataset demonstrate that the proposed method achieves higher generation quality than random querying and reference-agnostic baselines. The code is available at this https URL.

48. 【2606.16494】Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

链接https://arxiv.org/abs/2606.16494

作者:Jieyuan Liu,Jianyang Gu,Shijie Chen,Jefferson Chen,Zhen Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Wikipedia-scale knowledge base, Knowledge-based visual question, vision-language systems answer, Wikipedia-scale knowledge, visual question answering

备注: 15 pages, 9 figures. Under review at EMNLP 2026

点击查看摘要

Abstract:Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

49. 【2606.16484】Unified Multimodal Model for Brain MRI Imputation and Understanding

链接https://arxiv.org/abs/2606.16484

作者:Zhiyun Song,Che Liu,Tian Xia,Avinash Kori,Wenjia Bai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:knowledge from LLM, hold great potential, Multimodal large language, large language models, hold great

备注: Early accepted to MICCAI 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

50. 【2606.16479】Uncertainty Quality of VGGT: An Analysis on the DTU Benchmark Dataset

链接https://arxiv.org/abs/2606.16479

作者:Markus Hillemann,Robert Langendörfer,Steven Landgraf,Markus Ulrich

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Geometry Grounded, Geometry Grounded Transformer, Visual Geometry, Grounded Transformer, Geometry Grounded

备注: Accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) has already attracted a great deal of attention in a short period of time, not least due to the Best Paper Award at CVPR-2025. Similar to DUSt3R and MASt3R, VGGT aims to bring about a paradigm shift by replacing established methods like bundle adjustment and feature matching with a simple, unified, feed-forward neural network that predicts camera poses, depth maps, and dense 3D structure directly from multiple images of a scene in a few seconds. A key aspect is its ability to process an arbitrary number of views consistently in a single forward pass without any post-processing or iterative optimization. For photogrammetry, this opens new possibilities for real-time, scalable, and accessible 3D reconstruction. In this context, not only high reconstruction accuracy but also high-quality uncertainty estimates are crucial, as they foster trust and enable robust quality assurance. This paper therefore investigates the quality of VGGT's uncertainty predictions. The analysis identifies an effective confidence threshold for filtering VGGT's raw output and demonstrates that enhancing uncertainty quality holds strong potential for improving the accuracy of its 3D reconstructions.

51. 【2606.16477】AURA: Active-Response Attribution under Treatment Ambiguity in Bacterial Cytological Profiling

链接https://arxiv.org/abs/2606.16477

作者:Kartik Jhawar,Mrunmayee Deshpande,Wilfried Moreira,Guillermo C. Bazan,Lipo Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:drug necessarily acts, applied drug necessarily, drug necessarily, drug leaves, necessarily acts

备注

点击查看摘要

Abstract:When a bacterial sample is exposed to several antibiotics, not every applied drug necessarily acts: if the organism is resistant to one of them, that drug leaves no morphological trace. The clinically meaningful quantity is therefore not which antibiotics were applied, but which ones were active. We show that these two are sharply decoupled in real E. coli microscopy - naively assuming the applied combination equals the active one is correct only about 37% of the time - yet existing computational tools are ill-suited to recovering the active set. Forward perturbation models such as scGen, CPA, and IMPA are designed to predict appearance from treatment, not the reverse, and inverting them degrades sharply; discriminative image classifiers tend to memorise strain- and batch-specific texture and fail to transfer across experimental replicates. We introduce AURA, which reframes the task as constrained, energy-based inverse attribution. Its central inductive bias is that the active set must be a subset of the applied set; this collapses the candidate space and lets AURA infer the active subset of applied antibiotics by decomposing residual morphology into antibiotic response atoms and selecting the subset with the lowest reconstruction energy, using no strain label at test time. AURA-E adds evidence-aware abstention, withholding a prediction when candidate explanations remain near-equally plausible. On cross-replicate transfer in an E. coli cytological profiling dataset, AURA recovers the active antibiotic combination with 95.47% exact-match accuracy.

52. 【2606.16474】MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

链接https://arxiv.org/abs/2606.16474

作者:Jituo Li,Shunwang Sun,Jialu Zhang,Xinqi Liu,Jinyao Hu,Zhicheng Lu,Sajad Saeedi,Guodong Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Monocular visual odometry, robotic localization, foundational to autonomous, autonomous navigation, navigation and robotic

备注: 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.

53. 【2606.16470】Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands

链接https://arxiv.org/abs/2606.16470

作者:Thanh Nguyen Canh,Thanh-Tuan Tran,Haolan Zhang,Ziyan Gao,Xiem HoangVan,Nak Young Chong

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Translating video demonstrations, executable robot commands, Translating video, demonstrations into executable, executable robot

备注

点击查看摘要

Abstract:Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.

54. 【2606.16457】ResEdit: Residual embeddings for precise generative image editing

链接https://arxiv.org/abs/2606.16457

作者:Ahmet Canberk Baykal,Valentin Deschaintre,Yannick Hold-Geoffroy,Michael Fischer,Anna Frühstück,Cengiz Öztireli,Iliyan Georgiev

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Conditional diffusion image, paired fine-tuning data, large-scale paired fine-tuning, Conditional diffusion, diffusion image generators

备注: Accepted to the EGSR 2026 journal track

点击查看摘要

Abstract:Conditional diffusion image generators can be repurposed for editing through inversion, without the need for large-scale paired fine-tuning data. However, producing high-quality, targeted edits while maintaining image identity and global consistency remains challenging, as weakly conditioned inversion often embeds conflicting image features into the noise. We demonstrate that incorporating a residual image encoding as additional conditioning enables both improved identity preservation and better editability. We optimize this residual encoding to provide a strong conditioning signal for reconstruction, thereby reducing the reliance on inversion and susceptibility to its aforementioned pitfalls. To ensure this residual does not interfere with desired edits, we incorporate a gradient reversal-based optimization strategy that disentangles the residual from the edited condition. We illustrate our method's ability to produce high-fidelity results across precise intrinsic-based editing and relighting, and show proof-of-concept text-guided manipulation.

55. 【2606.16449】PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

链接https://arxiv.org/abs/2606.16449

作者:Shuai Yang,Bingjie Gao,Ziwei Liu,Jiaqi Wang,Dahua Lin,Tong Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:operations requires persistence, editing operations requires, Consistent video generation, modify scene appearance, edits modify scene

备注: Project page: [this https URL](https://ys-imtech.github.io/projects/PermaVid/)

点击查看摘要

Abstract:Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

56. 【2606.16448】Hierarchical Fine-Grained Aerial Object Detection

链接https://arxiv.org/abs/2606.16448

作者:Yan Zhang,Fang Xu,Wen Yang,Gui-Song Xia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced scene understanding, object detection, aerial object detection, remote sensing, real-world object categories

备注: 15 pages

点击查看摘要

Abstract:Fine-grained aerial object detection, driven by the intrinsic granularity of real-world object categories, is crucial for advanced scene understanding in remote sensing. Existing methods largely inherit the paradigm of coarse-grained object detection, relying solely on single-label supervision and thus struggling to distinguish model-level categories with subtle structural differences. However, for each specific model (e.g., Boeing 787), structured prior knowledge such as attributes and hierarchies offers discriminative semantics across multiple granularities. Motivated by this, we present ExpertDet, a scheme that incorporates expert-informed cues to enhance fine-grained aerial object detection. Specifically, we design Vision-aware Masked Attribute Modeling (VMAM), which aligns attribute semantics with visual structures by reconstructing randomly masked attributes from visual cues, enabling the detector to capture subtle structural distinctions. We further propose Hierarchical Visual Instance Promotion (HierVIP), which builds a visual prototype tree based on hierarchical relations and imposes taxonomy-aware constraints to preserve cross-level semantic continuity while enhancing category discrimination. Moreover, we curate a new fine-grained object detection benchmark for Precise recognition of model-specific Ships and Planes from aerial imagery, PSP, covering 106 ship classes and 30 airplane models, respectively, featuring the most extensive collection of model-specific categories among existing aerial object detection datasets to date. We benchmark state-of-the-art object detection algorithms on the PSP benchmark. Extensive evaluation demonstrates that ExpertDet consistently outperforms other fine-grained competitors across hierarchy levels. The dataset, benchmark, and code are available at this https URL.

57. 【2606.16436】V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

链接https://arxiv.org/abs/2606.16436

作者:Kaihan Chen,Yanming Shao,Haifeng Ji,Xiaokang Yang,Yao Mu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Achieving autonomous robotic, human-like action sequences, Achieving autonomous, manipulation requires precise, autonomous robotic dexterous

备注

点击查看摘要

Abstract:Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

58. 【2606.16421】Beer-Lambert Guided Representation Learning for Unsupervised Anomaly Detection in Sub-THz Food Inspection Images

链接https://arxiv.org/abs/2606.16421

作者:Gyutae Hwang,Sang Jun Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:manufacturing requires reliable, detect foreign material, foreign material contamination, maintain product safety, Food manufacturing requires

备注: 6 pages, 3 figures

点击查看摘要

Abstract:Food manufacturing requires reliable inspection systems to detect foreign material contamination and maintain product safety. Sub-THz transmission imaging provides material-dependent attenuation characteristics that are useful for detecting low-density contaminants in food products. However, existing unsupervised anomaly detection methods mainly rely on RGB-pretrained visual representations, which may not adequately capture the transmission behavior of Sub-THz images. This paper proposes a Beer-Lambert guided representation learning framework for unsupervised anomaly detection in Sub-THz food inspection images. The proposed method introduces an attenuation decomposition module as an auxiliary regularization module that constrains student representations through attenuation reconstruction during training. In addition to the conventional one-class setting, we introduce a Leave-One-Food-Out protocol to evaluate generalization capability under unseen food categories. Experimental results on the Inline-Food-Inspection-THz dataset show that the proposed method improves overall anomaly detection performance over the baseline method.

59. 【2606.16414】Instance-Aware Knowledge Distillation for Semi-Supervised Learning of an On-Board Multi-Task Dense Prediction Model for Collision Avoidance System

链接https://arxiv.org/abs/2606.16414

作者:Gyutae Hwang,Sang Jun Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:driving scene understanding, camera-based deep learning, deep learning approaches, scene understanding, evolved toward camera-based

备注: 13 pages, 7 figures

点击查看摘要

Abstract:Collision avoidance systems have evolved toward camera-based deep learning approaches for driving scene understanding. However, deployment in edge environments such as country clubs is constrained by limited computational resources and unreliable communication infrastructure. Moreover, constructing large-scale datasets for the target domain involves substantial annotation cost. To address these limitations, we propose an instance-aware knowledge distillation framework for semi-supervised learning. Specifically, we generate pseudo labels that mitigate teacher bias by leveraging domain priors from the teacher and instance-centric knowledge from foundation models. The trained lightweight student is deployed in the proposed collision avoidance system and performs multiple dense prediction tasks in real-time. The system detects frontal obstacles and encodes their spatial information into controller area network messages for automated guided vehicle operation. To achieve this, we construct a large-scale country club dataset and perform field validation of the proposed system. Experimental results demonstrate that the student outperforms the large teacher in instance segmentation while mitigating performance degradation in monocular depth estimation. Compared with the teacher, the student reduces FLOPs by 22.68$\times$ and parameters by 14.33$\times$, achieving 6.46 FPS on a low-cost edge device.

60. 【2606.16401】RGFVR: Reference-Guided Face Video Restoration with Flow Matching

链接https://arxiv.org/abs/2606.16401

作者:Cem Eteke,Batuhan Tosun,Eckehard Steinbach

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires simultaneously recovering, simultaneously recovering visual, recovering visual fidelity, Face video restoration, degraded observations

备注

点击查看摘要

Abstract:Face video restoration from degraded observations is challenging, as it requires simultaneously recovering visual fidelity, temporal consistency, and subject identity. Existing approaches are often either reference-free, which can lead to identity loss when person-specific facial details are lost, or subject-specific, which limits generalization to unseen identities. We propose a subject-agnostic, reference-guided framework for identity-preserving face video restoration. Our method introduces bimodal perceptual-descriptive identity conditioning into a pretrained flow-based text-to-video generator and employs a two-stage training strategy to strengthen identity guidance during restoration. Experiments show that our approach improves restoration fidelity, temporal consistency, and identity preservation, achieving superior performance under challenging video degradations, including downsampling, blur, noise, and compression artifacts. The code is available under: this https URL.

61. 【2606.16396】SP$^3$: Spherical Priors for Plug-and-Play Restoration

链接https://arxiv.org/abs/2606.16396

作者:Sean Man,Ron Raphaeli,Matan Kleiner,Or Ronai

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Spherical Encoders, denoisers with Spherical, algorithm that accelerates, accelerates maximum, maximum a posteriori

备注

点击查看摘要

Abstract:In this paper, we introduce SP$^3$, a novel Plug-and-Play algorithm that accelerates maximum a posteriori image restoration by replacing denoisers with Spherical Encoders (SE) as generative priors. SP$^3$ approximates the intractable proximal prior step by utilizing the SE tightly structured latent space as a robust projection onto the natural image manifold. Alternating this projection with a closed-form data-consistency step, via Half-Quadratic Splitting, achieves stable convergence without requiring gradient computation during inference. This unique formulation unlocks "anytime" restoration capabilities, producing sharp, plausible images from the first iteration. Evaluations across a variety of image restoration tasks demonstrate that SP$^3$ achieves perceptual quality comparable to state-of-the-art zero-shot diffusion and flow methods while being $3$-$630\times$ faster.

62. 【2606.16392】owards UAV Image Dehazing: A UAV Atmospheric Scattering Model, Benchmark, and Geometry-Aware Deep Unfolding Network

链接https://arxiv.org/abs/2606.16392

作者:Wenxuan Fang,Jiangwei Weng,Yu Zheng,Junkai Fan,Guangfa Wang,Xiang Chen,Jian Yang,Jun Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:obscures distant details, weaken structural information, significantly obscures distant, atmospheric scattering model, haze significantly obscures

备注

点击查看摘要

Abstract:In UAV applications, haze significantly obscures distant details and weaken structural information, hindering the recovery of details. Current UAV scenarios still face two key challenges: (i) paired hazy/clean images from the real world are unobtainable, while the classical atmospheric scattering model is inadequate for modeling the spatially non-uniform haze in UAV imagery; (ii) existing dehazing methods struggle to remove the heavy haze accumulated in the upper regions of UAV images. To address these issues, we first propose a UAV Atmospheric Scattering Model (UASM), which explicitly incorporates flight altitude, viewing pitch, and extinction to characterize the non-uniform haze distribution in UAV imaging. Based on UASM, we develop a physics-driven dehazing framework, termed Geometry-aware Proximal Deep Unfolding Network (GP-DUN). Specifically, GP-DUN consists of three key modules: a Latent Geometry Estimator (LGE) that infers transmittance consistent with UAV imaging geometry, a Geometry-aware Gradient Descent Module (GeoGDM) that embeds UASM into the data-fidelity term and performs physics-consistent closed-form updates, and an Pooling-Expert Proximal Mapping Module (PE-PMM) that learns an implicit prior to restore textures and structures beyond the capability of explicit physical modeling. In addition, we further construct UASM-HazeSet, which provides controllable paired synthetic data together with 2,285 real UAV haze images for testing. Extensive experiments show that GP-DUN consistently outperforms existing methods on both UASM-HazeSet and real UAV haze benchmarks.

63. 【2606.16354】GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

链接https://arxiv.org/abs/2606.16354

作者:Ziying Song,Caiyan Jia,Lin Liu,Shaoqing Xu,Lei Yang,Yadan Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:camera sensors, overlooked challenge, calibration uncertainties, uncertainties between LiDAR, LiDAR and camera

备注: 30 pages, 7 figures

点击查看摘要

Abstract:Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

64. 【2606.16353】What Should a Streaming Video Model Remember?

链接https://arxiv.org/abs/2606.16353

作者:Haonan Ge,Yiwei Wang,Hang Wu,Yujun Cai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video understanding models, computation budgets, fixed memory, memory, understanding models

备注

点击查看摘要

Abstract:Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbf{SelectStream}, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67\% on StreamingBench, 67.03\% on OVO-Bench, and 74.4\% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.

65. 【2606.16342】When the Past Matters: FlashBack Memory for Precipitation Nowcasting

链接https://arxiv.org/abs/2606.16342

作者:Yuhao Du,Boxiao Huang,Chengrong Wu,Jiankai Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long range dependency, range dependency modeling, high spatiotemporal resolution, Accurate precipitation nowcasting, socio-economic planning

备注

点击查看摘要

Abstract:Accurate precipitation nowcasting is crucial for disaster mitigation and socio-economic planning, yet existing methods often struggle with false alarms, missed events, and long range dependency modeling at high spatiotemporal resolution. To address these challenges, we propose FlashBack Memory (FB), a module that dynamically retrieves key historical states and integrates them via an adaptive fusion gate, enhancing the spatiotemporal representation capability of recurrent-based models. We incorporate FB into PredRNN, PredRNNpp, MIM, MotionRNN, and PredRNN-V2, and evaluate on CIKM2017, Shanghai2020, and SEVIR datasets. Experimental results demonstrate that FB significantly improves MSE, MAE, SSIM, and CSI metrics, particularly for high-intensity rainfall and long-sequence predictions, while reducing false alarms and missed events and enhancing temporal consistency and spatial localization. The proposed method provides a general and efficient memory enhancement mechanism, improving the overall performance of recurrent-based precipitation nowcasting models.

66. 【2606.16334】Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

链接https://arxiv.org/abs/2606.16334

作者:Parthaw Goswami,Jaynto Goswami Deep

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scenes is inherently, inherently temporal, visual scenes, Human, CHRONOREVERSE

备注

点击查看摘要

Abstract:Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.

67. 【2606.16333】Differentiable Packing of Irregular 3D Objects with Adaptive Container Estimation

链接https://arxiv.org/abs/2606.16333

作者:Palak Gupta,Shanmuganathan Raman

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:manual tuning problem, outer search loop, single container dimension, remaining dimensions, single gradient-based loop

备注: Comments: 20 pages, 8 figures, 5 tables. Under review at Computers Graphics (Elsevier)

点击查看摘要

Abstract:Most existing approaches either fix the container in advance or optimize only a single container dimension through an outer search loop, leaving the remaining dimensions as a manual tuning problem. We present a differentiable packing framework that jointly optimizes all 6N object pose parameters and all three container side lengths inside a single gradient-based loop. The formulation combines six physics-inspired, differentiable loss terms computed directly on triangle meshes through axis-aligned bounding-box proxies. An adaptive squeezing mechanism periodically tightens the container whenever the overlap loss falls below a pair-count-scaled threshold, producing a large initial drop in container volume, followed by small refinements. All pairwise computations are written in tensor-broadcasting form, giving a 3.4 to 54 times speedup over a reference loop-based implementation. The pipeline is implemented in Python and PyTorch, with no physics engine, FFT library, or convex decomposition. On multiple object categories, the method produces containers that are 11 to 32 percent smaller than time-matched DBLF and simulated-annealing baselines at N =100, while running in under 4 minutes per instance on a single consumer GPU.

68. 【2606.16325】Attention-Based Prototype Calibration for Multi-Rater Few-Shot Medical Image Segmentation

链接https://arxiv.org/abs/2606.16325

作者:Truong Vu,Minh Khoi Ho,Yutong Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:methods typically assume, overlooking systematic variability, expert raters commonly, raters commonly observed, single ground-truth annotation

备注: MICCAI 2026 main track

点击查看摘要

Abstract:Few-shot medical image segmentation methods typically assume a single ground-truth annotation, overlooking systematic variability across expert raters commonly observed in clinical datasets. We propose an attention-based prototype calibration framework for few-shot multi-rater segmentation that models rater-specific deviations from a consensus representation in prototype space. A lightweight yet principled attention operator directly refines rater prototypes without modifying the backbone feature extractor, making the approach fully compatible with existing prototype-based few-shot segmentation methods. This design preserves semantic consistency while enabling personalized segmentation outputs with minimal computational overhead. Experiments on multi-rater medical imaging datasets demonstrate consistent improvements over baseline prototype approaches, highlighting the effectiveness of structured prototype calibration for modeling annotation variability. Our code is available at this https URL.

69. 【2606.16323】HAFMat: Hybrid Priors Guided Adaptive Fusion for Single-Image Human Material Estimation

链接https://arxiv.org/abs/2606.16323

作者:Yu Jiang,Jiahao Xia,Jiongming Qin,Jianchi Sun,Chunxia Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:virtual content creation, digital human rendering, Physically based rendering, fundamental appearance decomposition, appearance decomposition task

备注

点击查看摘要

Abstract:Physically based rendering (PBR) material estimation is a fundamental appearance decomposition task with broad applications in virtual content creation, relighting, and digital human rendering. However, estimating PBR materials from a single human image remains highly ill-posed, since illumination, geometry, and reflectance are heavily entangled in the observed appearance. To mitigate this ambiguity, we propose HAFMat, a hybrid-prior-guided framework for single-image human material estimation. Our method introduces guidance maps that encode complementary cues, including appearance, body geometry, structure, and prior material predictions from pre-trained models. A key observation is that these guidance cues are heterogeneous: some cues mainly provide texture-level constraints, while others convey higher-level semantic information. To exploit this property, we design a Multi-layer Adaptive Feature Fusion Mechanism, which adaptively fuses guidance features with decoder features at different stages. This design enables texture-dominant and semantic-dominant cues to guide material decoding at appropriate levels, leading to more accurate and physically plausible material estimation. Extensive experiments on both synthetic and real data demonstrate that our method achieves state-of-the-art performance in material estimation and downstream relighting.

70. 【2606.16317】raining-free sparse attention based on cumulative energy filtering

链接https://arxiv.org/abs/2606.16317

作者:Chunlu Li,Yixuan Pan,Bai Du,Zhenyuan Chen,Yanzhao Li,Hui Dong,Hui Wang,Zhiqiang Zou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accelerates Diffusion Transformers, Diffusion Transformers, Sparse attention accelerates, attention accelerates Diffusion, accelerates Diffusion

备注

点击查看摘要

Abstract:Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.16317 [cs.CV]

(or
arXiv:2606.16317v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.16317

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
71. 【2606.16302】Explainable Flood Segmentation on Sentinel-1 SAR Imagery: A Comparative Study of CNN and Transformer Architectures

链接https://arxiv.org/abs/2606.16302

作者:Arundhuti Banerjee,David Daou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Rapid and accurate, Synthetic Aperture Radar, mitigation planning, essential for disaster, disaster response

备注

点击查看摘要

Abstract:Rapid and accurate flood prediction is essential for disaster response and mitigation planning. Synthetic Aperture Radar (SAR) sensors in satellites are well-suited for this purpose because they operate independently of weather and daylight conditions. Although SAR-based data enable all-weather flood monitoring, distinguishing flooded land from permanent water remains a significant challenge, particularly when flooding is defined strictly as inundated land. This study provides a comprehensive comparison of convolutional neural network (CNN) and vision transformer architectures for multi-class flood segmentation using Sentinel-1 SAR imagery, specifically trained to separate flooded land from permanent water bodies and land. Three state-of-the-art (SOTA)CNN-based models, U-Net, U-Net++, and DeepLabV3 with ResNet-34 backbone, and three SegFormer variants (b0,b1,b2) were evaluated in two benchmark datasets, the ETCI NASA dataset and SenFloods11, using scene-based data splits to ensure a realistic assessment of spatial generalization. The results demonstrate that SegFormer-b2 significantly outperforms the U-Net baseline on the ETCI dataset (higher flood IoU across all 7 test scenes in the Wilcoxon signed-rank test), while after fine-tuning on Sen1Floods11, the advantage narrows to within the range of scene variability and is concentrated in spatially fragmented flood events. The study includes both qualitative and quantitative explainability techniques to visually comprehend model decisions and systematically assess prediction reliability. Qualitative analysis reveals that SegFormer-b2 produces more spatially coherent Grad-CAM activations focused on flood-relevant features, while U-Net generates more informative uncertainty estimates along flood boundaries.

72. 【2606.16298】DDTNet: Degradation Disentanglement and Transfer Network for Test-Time All-in-One De-weathering Adaptation

链接https://arxiv.org/abs/2606.16298

作者:Kuan-Hung Lin,Fu-Jen Tsai,Yan-Tsung Peng,Min-Hung Chen,Chia-Wen Lin,Yen-Yu Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remove multiple degradations, single unified model, aims to remove, remove multiple, single unified

备注

点击查看摘要

Abstract:All-in-one adverse weather image restoration aims to remove multiple degradations, such as rain, haze, and snow, using a single unified model. Despite their broad applicability, existing methods typically compromise performance, delivering balanced but suboptimal results for individual degradation types. This issue becomes more pronounced when a domain gap exists between training and testing data. Motivated by the observation that modeling degradation patterns is more feasible than recovering clean content, we propose the Degradation Disentanglement and Transfer Network (DDTNet), which focuses specifically on degradation transfer. By disentangling degradation patterns from target-domain degraded images and transferring them to source domain clean images, DDTNet generates domain-adaptive paired training data. These pairs are then used to fine-tune restoration models, significantly enhancing their adaptability across diverse weather conditions and domains. The core of DDTNet is the Degradation Disentanglement Module (DDM), which comprises Degradation Coupled Attention (DCA) to capture both general and weather-specific features, thereby enabling effective disentanglement and transfer of degradation patterns. Experimental results demonstrate that DDTNet significantly and consistently improves existing all-in-one models across real-world deraining, desnowing, and dehazing datasets.

73. 【2606.16295】VisualClaw: A Real-Time, Personalized Agent for the Physical World

链接https://arxiv.org/abs/2606.16295

作者:Haoqin Tu,Jianwen Chen,Zijun Wang,Siwei Han,Juncheng Wu,Hardy Chen,Haonian Ji,Kaiwen Xiong,Jiaqi Liu,Peng Xia,Jieru Mei,Hongliang Fei,Jason Eshraghian,Zeyu Zheng,Yuyin Zhou,Huaxiu Yao,Cihang Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision language models, Vision language, complex multimodal tasks, serving as general-purpose, general-purpose interfaces

备注: H. T. and J. C. contribute to this project equally

点击查看摘要

Abstract:Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

74. 【2606.16294】Sex-based Network-Specific Differences in Connectomes: A Krakencoder-Based Analysis

链接https://arxiv.org/abs/2606.16294

作者:Vibhashree S H,Debanjali Bhattacharya,Vamshi Krishna Kancharla,Neelam Sinha

类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词:brain connectome modality, connectome modality propagate, Human Connectome Project, simulation framework, study examines

备注

点击查看摘要

Abstract:This study examines how deficiencies in one brain connectome modality propagate to the other, using the Krakencoder as a simulation framework. Structural and functional connectomes from 702 healthy participants in the Human Connectome Project were analyzed, with the impact of each of the Yeo-7 functional networks assessed separately. Seven scenarios were considered, each involving the removal of a single network while the remaining networks were preserved. The resulting perturbations in cross-modal predictions were quantified using three complementary metrics: KL divergence on eigenvalue spectra, Frobenius norm, and Wasserstein distance. In addition, the persistence of sex-specific information within the predicted connectomes was evaluated. Across all metrics and both prediction directions, the Default Mode Network produced the largest perturbations, whereas the Somatomotor network yielded the smallest. Sex differences in network-level perturbation signatures were subtle, with the best result being an accuracy of 66.09% from connectomes predicted under network-removal conditions. In contrast, connectomes predicted from intact inputs achieved substantially higher sex classification accuracy, reaching up to 84.76%. These findings confirm that full predicted connectomes retain considerably more sex-discriminative information than perturbation-derived signatures alone.

75. 【2606.16278】RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

链接https://arxiv.org/abs/2606.16278

作者:Zhenhua Wu,Yun Pang,Mingkun Chang,Yuwei Ning,Liangzhi Wang,Yi Xiao,Guanbin Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Long-tail hazardous scenarios, Long-tail hazardous, safety-oriented autonomous driving, reproduce at scale, hazardous scenarios

备注

点击查看摘要

Abstract:Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

76. 【2606.16274】GraphWorld: Long-Horizon Planning with World Models for End-to-End Autonomous Driving

链接https://arxiv.org/abs/2606.16274

作者:Ziying Song,Caiyan Jia,Lin Liu,Lei Yang,Shengkai Zhang,Feiyang Jia,Fengda Zhao,Peiliang Wu,Shaoqing Xu,Chen Lv,Yadan Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made significant progress, short-horizon decision making, single learning framework, achieving strong performance, unifying perception

备注: 16 pages, 5 figures

点击查看摘要

Abstract:End-to-end autonomous driving has made significant progress by unifying perception, prediction, and planning within a single learning framework, achieving strong performance in short-horizon decision making. However, most existing E2E-AD methods remain confined to short-horizon planning and lack the ability to model long-term temporal dependencies, which severely limits their generalization and security in complex and highly interactive driving scenarios. In this work, we propose GraphWorld, an E2E-AD framework that explicitly enhances long-horizon planning through latent world modeling. We introduce an Ego-Centric Interaction Graph, which adaptively models critical neighboring agents based on spatial proximity, and propagates relational context to planning queries via cross-node cross-attention. We present a World-State-Conditioned Planning that learns ego-centric latent world representations by modeling interactions between an ego vehicle and surrounding agents. This latent world state captures key interaction dynamics and safety-relevant semantics, and serves as a conditioning signal to guide long-horizon, safety-aware trajectory planning. Extensive experiments on Bench2Drive, NAVSIMv1/2, and nuScenes demonstrate that GraphWorld significantly reduces collision rates and improves long-horizon planning performance, validating its effectiveness in complex driving environments.

77. 【2606.16271】Contrastive Learning for Seismic Horizon Tracking with Domain-Specific Priors

链接https://arxiv.org/abs/2606.16271

作者:Alexandre Thouvenot,Lionel Boillot,Vincent Gripon

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:reduced trace-level precision, provide accurate trace-level, accurate trace-level alignment, signal-based propagators provide, propagators provide accurate

备注: 5 pages, 5 figures. Submitted to the IEEE GRSL for possible publication

点击查看摘要

Abstract:Unsupervised 3D seismic horizon tracking faces a key limitation: signal-based propagators provide accurate trace-level alignment but often fail near faults, whereas texture-driven deep models are more robust to discontinuities, typically at the cost of labeled data requirements and reduced trace-level precision. We propose a self-supervised fusion of both paradigms in which signal-derived local horizon correspondences act as domain-specific priors to train a texture-based deep learning model. Specifically, we estimate reliable trace-to-trace flows from reflector slopes and use them to form positive pairs in a contrastive objective, while restricting training to high-confidence neighborhoods, optionally augmented with a fault mask. The objective is not to infer ambiguous correspondences close to discontinuities, but to preserve horizon identity across them. As a result, the network learns voxel-wise embeddings that preserve local signal continuity while enabling horizon propagation beyond discontinuities through similarity search. Experiments on the public F3 dataset and a faulted synthetic dataset achieve lower mean absolute error (MAE) than unsupervised baselines and competitive performance against a semi-supervised method using a single labeled slice.

78. 【2606.16256】KeepLoRA++: Continual Learning with Layer-Scaled Residual Gradient Adaptation

链接https://arxiv.org/abs/2606.16256

作者:Mao-Lin Luo,Yi-Lin Zhang,Zi-Hao Zhou,Yankun Hong,Xialiang Tong,Mingxuan Yuan,Tong Wei,Min-Ling Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Continual learning, vision-language models requires, models requires balancing, retaining pre-trained knowledge, maintaining the plasticity

备注

点击查看摘要

Abstract:Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents KeepLoRA++, balancing these objectives through a unified dual-dimensional knowledge retention mechanism. We analyze knowledge distribution of Transformer architecture from both inter-layer and intra-layer perspectives. The inter-layer perspective examines how retention is distributed across layers, while the intra-layer perspective focuses on the parameter space within each layer. Our analysis reveals a structural property: general transferable knowledge is mainly encoded in the shallow layers and the principal subspace of the parameters, while task-specific adaptations are localized in the deep layers and the residual subspace. Motivated by this insight, KeepLoRA++ introduces a layer-scaled residual gradient adaptation method. New tasks are learned by restricting LoRA parameter updates to the residual subspace, combined with a shallow-to-deep layer scaling, to prevent interference with previously acquired capabilities. Specifically, the gradient of a new task is projected onto a subspace orthogonal to both the principal subspace of the pre-trained model and the dominant directions of previous task features, while simultaneously assigning smaller update magnitudes to shallow layers and larger ones to deeper layers. Our theoretical analysis and empirical evaluations confirm that KeepLoRA++ successfully balances these three competing objectives, consistently outperforming representative baselines across image classification, visual question answering, and video understanding tasks.

79. 【2606.16255】UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

链接https://arxiv.org/abs/2606.16255

作者:Shuai Wang,Liang Li,Yang Chen,Ruopeng Gao,Yao Teng,Limin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unified Multimodal Models, general-purpose multimodal intelligence, generation, Multimodal Models, understanding

备注: This work was completed in \textbf{November 2025}

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

80. 【2606.16253】Learned Image Compression for Vision-Language-Action Models

链接https://arxiv.org/abs/2606.16253

作者:Hyeonjun Kim,Jegwang Ryu,Sangbeom Ha,Junhyeok Lee,Jun-Hyuk Kim,Hyemin Ahn,Jaeho Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:models increasingly rely, making visual communication, high-frequency multi-camera observations, models increasingly, increasingly rely

备注

点击查看摘要

Abstract:Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

81. 【2606.16241】Structure-Semantic Co-optimized Latent Diffusion Model for Fast Visual Anagram Synthesis

链接https://arxiv.org/abs/2606.16241

作者:Xiang Gao,Yunpeng Jia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single image presents, flipping or rotation, intriguing form, presents different conceptual, conceptual interpretations

备注

点击查看摘要

Abstract:Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbf{S2CO-Anagram} is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.

82. 【2606.16234】Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

链接https://arxiv.org/abs/2606.16234

作者:Tengfei Ma,Ruiqi Wu,Chenran Zhang,Ye Geng,Na Su,Xiangyuan Duanmu,Tao Zhou,Yi Zhou,Wen Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Fundus fluorescein angiography, fluorescein angiography, retinal vascular abnormalities, critical for assessing, acquisition is invasive

备注: Accepted to MICCAI 2026 (Early Accept)

点击查看摘要

Abstract:Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at this https URL.

83. 【2606.16212】LUCID: Learned Undersampling-Adaptive Consistency-Guided Inference with Deterministic Flow Matching for Sparse-View CT Reconstruction

链接https://arxiv.org/abs/2606.16212

作者:Jigang Duan,Jiayi Wang,Heran Wang,Ping Yang,Genwei Ma,Xing Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:causing streak artifacts, acquiring fewer projection, angular undersampling makes, reconstruction severely ill-posed, fewer projection views

备注

点击查看摘要

Abstract:Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

84. 【2606.16203】DynFS-MoE: Dynamic Functional-Structural Mixture-of-Experts for Post-Traumatic Epilepsy Diagnosis

链接https://arxiv.org/abs/2606.16203

作者:Jun-En Ding,Spencer Chen,Henry Noren,Daniel Valdivia,Christine Yohn,Suhina Patel,Taylor Zink,Hai Sun,Feng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:early identification remains, identification remains challenging, remains challenging due, Post-traumatic epilepsy, traumatic brain injury

备注

点击查看摘要

Abstract:Post-traumatic epilepsy (PTE) is a severe complication of traumatic brain injury (TBI), yet early identification remains challenging due to the complex structural and functional alterations it induces in the brain. To address this, we propose a dynamic multimodal Mixture-of-Experts (MoE) framework that integrates functional and structural MRI through time-aware functional-structural encoding and class-conditioned expert routing. Within this framework, modality-specific and cross-modal experts learn complementary representations, while a Modality-Class MoE (MCoE) module dynamically dispatches expert weights according to each classification objective. Experimental results across three binary classification tasks demonstrate that the framework consistently outperforms static fusion baselines, and high-interpretability analyses further reveal meaningful region-of-interest (ROI) interactions. This dynamic multimodal expert framework effectively captures class-dependent brain interaction patterns and provides an interpretable approach for PTE diagnosis and risk stratification.

85. 【2606.16202】EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

链接https://arxiv.org/abs/2606.16202

作者:Hyunjin Kim,Ri-Zhao Qiu,Guangqi Jiang,Xiaolong Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:faithfully predicting complex, complex deformable dynamics, Humans naturally understand, predicting complex deformable, naturally understand object

备注: Project Page: [this https URL](https://hjhyunjinkim.github.io/EgoPhys)

点击查看摘要

Abstract:Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

86. 【2606.16198】GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

链接https://arxiv.org/abs/2606.16198

作者:Ruoxuan Yang,Tieyuan Chen,Xiaofeng Huang,Haibing Yin,Jun Wang,Xiping Chen,Jun Yin,Xuesong Gao,Weiyao Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Viewer sentiment prediction, latent affective response, affective response evoked, video advertisements aims, Large Language Models

备注: 13 pages, 5 figures

点击查看摘要

Abstract:Viewer sentiment prediction in video advertisements aims to infer the latent affective response evoked in the audience. To bridge the gap between what is shown and what is felt, models must deduce hidden viewer emotions from explicit visual narratives, concrete character-object interactions, and visible textual cues. However, standard Multimodal Large Language Models (MLLMs) typically rely on holistic frame representations, which leave these fine-grained, affect-relevant events implicit and complicate precise emotional reasoning. To address this, we propose a grounded action-centric evidence augmentation framework that enhances video MLLMs' clue extraction and comprehension by introducing explicit event structure and localized visual evidence. Our method extracts temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions, grounds subject and object entities as visual entity crops, and then enables the MLLM to perform clue-enhanced emotional reasoning based on these extracted structured clues. In this way, action triplets specify "what happens", while grounded visual entity crops anchor "who or what participates in each event" to concrete visual evidence. Experiments on the Pitts dataset show consistent improvements over Qwen2.5-VL and Qwen3-VL baselines. Ablation studies, cross-dataset evaluation on AdsQA, and transfer experiments on an emotion-focused TVQA subset further support the effectiveness and generalization of our approach.

87. 【2606.16196】When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

链接https://arxiv.org/abs/2606.16196

作者:Anju Chhetri,Pratik Shrestha,Ramesh Rana,Prashnna Gyawali,Binod Bhattarai

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep neural networks, safe clinical deployment, achieved remarkable performance, distributional shifts poses, Deep neural

备注

点击查看摘要

Abstract:Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.

88. 【2606.16193】Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

链接https://arxiv.org/abs/2606.16193

作者:Yusong Zhao,Hengyi Wang,Tanuja Ganu,Akshay Nambi,Hao Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated strong performance

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn "concepts of concepts" while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.

89. 【2606.16188】asr: training-efficient any-step diffusion transformer for real-world image super-resolution

链接https://arxiv.org/abs/2606.16188

作者:Xiang Gao,Chenxin Zhu,Yushun Fang,Qiang Hu,Xiaoyun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Real-World Image Super-Resolution, powerful generative priors, Image Super-Resolution, Real-World Image, slow iterative sampling

备注

点击查看摘要

Abstract:Diffusion models excel in Real-World Image Super-Resolution (Real-ISR) due to their powerful generative priors but suffer from slow iterative sampling. Although existing one-step distillation methods accelerate inference, they typically require auxiliary teacher models that inflate training memory and restrict scalability to large-scale architectures. Furthermore, these fixed-step models lack the flexibility to trade off speed for quality. In this paper, we propose TEASR, a training-efficient any-step diffusion framework for Real-ISR that enables both one-step and multi-step restoration within a unified model. Our key idea is to perform self-adversarial distillation within a single diffusion model, eliminating the need for auxiliary teachers or discriminators. Specifically, we propose a timestep-aware rectification strategy that stabilizes one-step generation across noise levels. These two designs further enables the distillation of 20B-parameter diffusion models on a single GPU, significantly improving training efficiency. Moreover, we introduce a dual-branch diffusion transformer with decoupled timestep condition to separate the current noise state and the denoising target to enhance sampling quality. Extensive experiments demonstrate that TEASR supports seamless any-step sampling and consistently outperforms state-of-the-art methods across multiple datasets.

90. 【2606.16185】Learned JPEG Compression for DNN Vision

链接https://arxiv.org/abs/2606.16185

作者:Kaixiang Zheng,Ahmed H. Salamah,Siyu Chen,En-Hui Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:JPEG, compression technique designed, DNN inference performance, JPEG codec, JPEG encoding parameters

备注

点击查看摘要

Abstract:JPEG, a lossy image compression technique designed for human viewers, has maintained its dominance for decades. However, in the era of artificial intelligence (AI), a substantial portion of image data, often compressed by JPEG, is and will continue to be consumed by deep neural networks (DNNs) instead of humans, thus creating a need to optimize JPEG for DNN inference performance. To this end, we propose learned JPEG compression for DNN vision (J4D), a novel training framework for determining JPEG encoding parameters to minimize compression rate while maximizing DNN inference performance. The major challenge of solving this optimization problem lies in representing the JPEG codec and compression rate in closed form. By incorporating a differentiable soft quantizer based on a probabilistic quantization scheme, we not only obtain a differentiable proxy for the JPEG codec, but are also able to compute the entropy of the coded source analytically, which is a close estimate of the actual compression rate. Equipped with both the differentiable JPEG codec and the information-theoretic rate estimator, we are then able to solve the aforementioned optimization problem with backpropagation. After training, the learned encoding parameters will be subsequently used in actual JPEG encoding based on probabilistic quantization. Extensive experimental results across multiple datasets and DNN architectures demonstrate that J4D consistently and significantly outperforms the default JPEG and other competitive JPEG codecs optimized for DNNs. Notably, compared to the default JPEG, J4D achieves an increase in accuracy by as much as 11.60% at the same rate, or a reduction of compression rate up to 80.05% at the same accuracy. Additionally, with the help of J4D, we show the potential to design universal JPEG encoding parameters for various DNN architectures for the first time.

91. 【2606.16184】Closed-Loop Triplet Synergistic Generation for Long-Form Video

链接https://arxiv.org/abs/2606.16184

作者:Xinlei Yin,Xiulian Peng,Xiao Li,Zhiwei Xiong,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:remains challenging due, generation remains challenging, Multi-shot long-form video, long-form video generation, video generation remains

备注

点击查看摘要

Abstract:Multi-shot long-form video generation remains challenging due to identity drift and compounding inconsistencies across shots. While storyboard-driven pipelines improve controllability, they are often executed in a feed-forward manner, with limited mechanisms to incorporate generated visual evidence back into subsequent conditioning. We propose CoTriSyGen, an agentic framework that formulates multi-shot long video generation as a closed-loop visual-text-memory synergy process, where planned intent, persistent memory, and generated visuals are jointly leveraged for iterative correction and long-range coherence. A vision-language-model-based analyzer reasons over this triplet and produces updates to both prompts and memory along two pathways: (i) intra-shot refinement, which triggers targeted regeneration when semantic or compositional violations are detected and refines image-to-video prompt for coherent motions; and (ii) inter-shot refinement, which rewrites subsequent-shot prompts to propagate newly manifested entities or attributes and improve prompt quality (e.g., compositional grounding and cinematic fluency) based on generated evidence. The loop is grounded in an entity-centric memory modeled as a mutable visual state that evolves as the story progresses, which is continuously updated by both the generator and the analyzer by adding new and evolved entities to reflect appearance changes, accumulated multi-view evidence, and multi-entity compositions. Experiments on our curated StoryBench benchmark demonstrate substantial improvements in cross-shot consistency, prompt adherence, and cinematic continuity over representative methods.

92. 【2606.16180】o forget is to preserve: Machine Unlearning for 3D medical image segmentation

链接https://arxiv.org/abs/2606.16180

作者:Nitesh Kumar Singh,Akhilesh Singh,Arjun Arora

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Data Protection Regulation, General Data Protection, data privacy laws, Protection Regulation, trained machine learning

备注: 9 pages, 5 figures

点击查看摘要

Abstract:With new data privacy laws such as the General Data Protection Regulation (GDPR) [1] that allow individuals to ask that any of their personal information be erased from trained machine learning models, there has been a push to investigate the unlearning of data from models as a way to comply with these laws. In this regard, based on four mechanics, we consider several approximate unlearning strategies applied to the MRBrainS18 dataset [2]. We use a 3D ResNet-50 [3] as a backbone architecture for segmentation that has been pre-trained with the Med3D framework [4]. Considering the pre-trained model as a baseline, we evaluate respective retention accuracy on 2 types of subjects, i.e., retain and forget. We assess these approaches through their Dice similarity coefficient and mean absolute error (MAE) values using two separate training horizons 20 and 50 epochs. The results show that the Noisy Label strategy had the best overall trade-off with a decrease of 93% in the forget set while maintaining 84% accuracy for the retained set after 50 epochs. All other strategies showed extreme levels of forgetting at higher epoch numbers while also demonstrating catastrophic degradation of their retain set performance. The results of this study provide a strict baseline of performance metrics for unlearning on a subject-specific level and provide practitioners with clear criteria for selecting the proper strategies.

93. 【2606.16168】Fi-Gaussian: Frequency-Aware Implicit Gaussian Splatting for Single Image Dehazing

链接https://arxiv.org/abs/2606.16168

作者:Yuhan Chen,Ying Fang,Guofa Li,Wenxuan Yu,Yicui Shi,Kunyang Huang,Wenbo Chu,Keqiang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:implicit Gaussian splatting, Single image dehazing, implicit Gaussian, frequency-aware implicit Gaussian, image dehazing continues

备注

点击查看摘要

Abstract:Single image dehazing continues to be hindered by the loss of high-frequency details and the difficulty of accurate physical scattering modeling. To address these issues, we propose Fi-Gaussian, a frequency-aware implicit Gaussian splatting network for single image dehazing. Unlike explicit rendering methods that rely on 3D point clouds, our method employs implicit Gaussian splatting to adaptively model the underlying distribution of clear images as a continuous representation in 2D feature space. The core of the network is a frequency-aware implicit Gaussian splatting module, which decouples low-frequency structural information and high-frequency texture information in the frequency domain and then performs adaptive Gaussian aggregation with complex-valued weights to recover fine details. In addition, a physics-driven scattering renormalization mechanism is introduced to estimate the transmission map and atmospheric light under the guidance of implicit Gaussian priors. Extensive experiments on multiple benchmark datasets demonstrate that Fi-Gaussian achieves state-of-the-art quantitative performance and produces visually superior dehazed results, validating the effectiveness of implicit Gaussian splatting for low-level vision tasks.

94. 【2606.16163】Dehaze-GaussianImage: Zero-Shot Dehazing via Efficient 2D Gaussian Splatting Representation

链接https://arxiv.org/abs/2606.16163

作者:Yuhan Chen,Wenxuan Yu,Guofa Li,Kunyang Huang,Ying Fang,Yicui Shi,Wenbo Chu,Keqiang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing single image, Existing single, implicit neural networks, single image dehazing, constrained by computational

备注

点击查看摘要

Abstract:Existing single image dehazing methods are often constrained by computational redundancy in pixel-level optimization and the lack of physical interpretability in implicit neural networks. These limitations hinder the balance between representation efficiency and reconstruction fidelity. To address these issues, we propose Dehaze-GaussianImage, the first zero-shot framework that introduces 2D Gaussian Splatting (2DGS) into the image dehazing domain to break the traditional pixel-grid processing paradigm. Distinct from static convolutional neural networks (CNNs) or Transformers, our approach models hazy images as continuous and dynamically evolvable anisotropic Gaussian fields. Specifically, we propose a novel reconstruction-decoupling zero-shot learning strategy that embeds the atmospheric scattering model into the Gaussian parameter space. This strategy drives Gaussian primitives to adaptively split, clone, and prune during optimization, achieving geometric-level decoupling of the transmission medium and clear textures. Furthermore, explicit structure-preserving constraints are introduced to suppress artifacts commonly caused by traditional physical priors. Experimental results demonstrate that the proposed method achieves state-of-the-art (SOTA) performance in a fully unsupervised manner with minimal parameters, highlighting the potential of explicit Gaussian representation for low-level vision tasks.

95. 【2606.16161】Multimodal LLM-Empowered Re-Ranking for Generalizable Person Re-Identification

链接https://arxiv.org/abs/2606.16161

作者:Jiachen Li,Xiaojin Gong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted growing research, growing research interest, research interest due, Domain Generalizable, person re-identification

备注

点击查看摘要

Abstract:Domain Generalizable (DG) person re-identification (Re-ID) has attracted growing research interest due to its potential for deployment in unseen real-world scenarios. Most existing approaches address DG Re-ID by focusing on training domain-generalizable encoders but ignore the possible refinements in inference stage. In contrast, this work explores an alternative direction which improves inference re-ranking to enhance DG Re-ID. Conventional re-ranking methods typically rely on neighborhood-based distances to refine the initial ranking list, inherently depending on features produced by the Re-ID encoder. However, they deteriorate on target domains since the encoder lacks sufficient generalizability to produce reliable feature distances on unseen scenarios. Inspired by the remarkable generalization capabilities of recent Multimodal Large Language Models (MLLMs), we propose an MLLM-empowered distance metric to improve re-ranking in DG Re-ID. Specifically, we first adapt an MLLM to Re-ID data through supervised fine-tuning, which incorporates a domain-agnostic prompt and a query-candidate hard mining scheme. Then, the adapted MLLM is employed to compute a $\mu$-distance during inference, which is robust to domain gap and significantly enhances subsequent re-ranking performance. Our approach is model-agnostic and can be seamlessly integrated into previous re-ranking frameworks. Extensive experiments demonstrate that our approach consistently yields substantial performance improvements across multiple DG Re-ID benchmarks. The code of this work will be released at this https URL soon.

96. 【2606.16159】Continuous Splatting meets Retinex: Continuous Gaussian Splatting and Implicit Reflectance Modeling for Low-Light Image Enhancement

链接https://arxiv.org/abs/2606.16159

作者:Yuhan Chen,Yicui Shi,Guofa Li,Wenxuan Yu,Ying Fang,Guangrui Bai,Wenbo Chu,Keqiang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:downstream vision tasks, high-level downstream vision, Low-light image enhancement, recover clear images, image enhancement aims

备注

点击查看摘要

Abstract:Low-light image enhancement aims to recover clear images from low-illumination observations and is crucial for high-level downstream vision tasks. However, existing methods frequently encounter color distortion and structural artifacts when balancing global smooth illumination adjustment and local high-frequency detail recovery. To address these issues, we propose CGS-Retinex as the first low-light image enhancement framework based on explicit-implicit joint modeling. Our framework deeply integrates continuous Gaussian splatting with Retinex theory. Specifically, we represent the image grid as a continuous parameter field and propose a continuous Gaussian renderer to estimate the spatially continuous global illumination distribution. This approach fundamentally eliminates grid artifacts caused by discrete Gaussian sampling. Furthermore, we introduce an implicit neural representation to model reflectance independently. We leverage shallow high-frequency features to guide the network in accurately reconstructing degraded texture details. Within the Retinex framework, we incorporate physics-inspired brightness consistency constraints and illumination smoothness regularization to enable explicit illumination and implicit reflectance to maintain proper exposure and achieve high-fidelity recovery of high-frequency structures and colors. Extensive experiments demonstrate that CGS-Retinex significantly suppresses dark-region noise and overexposure while achieving exceptional high-frequency structural fidelity and color restoration by precisely decoupling illumination and texture. This work establishes a novel continuous physical representation paradigm for low-light image enhancement.

97. 【2606.16158】Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

链接https://arxiv.org/abs/2606.16158

作者:Yifan Wang,Peiming Li,Shiyu Li,Zhiyuan Hu,Xiaochen Yang,Wenming Yang,Yang Tang,Zheng Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, perceive fine-grained details

备注

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at this https URL.

98. 【2606.16153】A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

链接https://arxiv.org/abs/2606.16153

作者:Pengyu Zhu,Xiaojing Zhang,Kunbo Zhang,Chunyan Zhang,Zhenyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:neurological disorder identification, treatment planning, disease monitoring, Medical image segmentation, disorder identification

备注: 12 pages,3 figures,1 table. All related resources are available at [this https URL](https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main)

点击查看摘要

Abstract:Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: this https URL.

99. 【2606.16131】Shift-and-Sum Quantization for Visual Autoregressive Models

链接https://arxiv.org/abs/2606.16131

作者:Jaehyeon Moon,Bumsub Ham

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:enables efficient deployment, Post-training quantization, enables efficient, efficient deployment, deployment of deep

备注: ICLR 2026

点击查看摘要

Abstract:Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention-value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, inpainting, outpainting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.

100. 【2606.16124】raining-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

链接https://arxiv.org/abs/2606.16124

作者:Ke Li,Di Wang,Yongshan Zhu,Ting Wang,Weiping Ni,Tao Lei,Quan Wang,Xinbo Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Remote sensing visual, Remote sensing, remote sensing image, natural language expression, aims to localize

备注

点击查看摘要

Abstract:Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

101. 【2606.16119】EdgeZSAD: Practical Zero-Shot Anomaly Detection on Edge Devices

链接https://arxiv.org/abs/2606.16119

作者:Taewan Cho,Andrew Jaeyong Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:zero-shot anomaly detection, edge deployment constraints, anomaly detection, inspection needs zero-shot, zero-shot anomaly

备注

点击查看摘要

Abstract:Industrial inspection needs zero-shot anomaly detection (ZSAD) that remains useful under edge deployment constraints. Recent methods often rely on ViT-L foundation backbones (~300M parameters), which exceed the memory and operator budget of typical embedded hardware. We study this regime through EdgeZSAD, a compact reference system built around a TinyViT-21M-512 backbone, an asymmetric global-local readout (EdgeGLR), and a reproducible source-side training recipe (Real-IAD-DR). We train a single checkpoint in a source-trained, target-unseen protocol and evaluate it across six industrial benchmarks. Across three independent runs, the resulting model reaches an average image AUROC of 91.6 on MVTec-AD and 88.2 on VisA, while remaining directly deployable on Jetson Orin Nano Super (TensorRT FP16) and RB5 Gen2 (QNN GPU FP16). Across the six device-rescored benchmarks, image-AUROC drift stays below 0.2 points, indicating that the exported graph preserves host-side ranking behavior in the evaluated deployment setting.

102. 【2606.16103】SceneCraft: Interactive System for Image Editing via Scene Graph

链接https://arxiv.org/abs/2606.16103

作者:Duc-Manh Phan,Ngoc-Dai Tran,Duy-Khang Do,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabled natural language-driven, Recent advances, natural language-driven image, multiple interacting objects, enabled natural

备注

点击查看摘要

Abstract:Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.

103. 【2606.16101】Effective and Low-cost Lane-based Map Localization for Vehicle-Centric Route Generation

链接https://arxiv.org/abs/2606.16101

作者:Hong-Shiang Lin,Jung-Hsin Chen,Yu-Luen Tzeng,Wei-Hao Chen,Yi-Chen Lee,Li-Jhe Chen,Peng-Yuan Chen

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)

关键词:driving guidance systems, Driver-centric route representation, intuitive driving guidance, route representation plays, guidance systems

备注: 14 pages, 18 figures. Under Review

点击查看摘要

Abstract:Driver-centric route representation plays a vital role in intuitive driving guidance systems. This paper presents OLRA, a low-cost, map-localization-based framework that derives driver-view-aligned routes by matching map-based navigation routes with camera-detected lane markings. This alignment process mutually enhances vehicle localization accuracy and visual route consistency. To bridge the evaluation gap across different paradigms, we introduce practical route evaluation metrics and benchmark OLRA against OpenPilot, a representative direct-generation approach. Experimental results on the nuScenes dataset demonstrate that OLRA outperforms OpenPilot in complex road segments and in route estimation at distance beyond 20 meters, achieving lower overall Euclidean error. This study is expected to promote future research in low-cost, maplocalization-based route generation methods.

104. 【2606.16092】VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

链接https://arxiv.org/abs/2606.16092

作者:Young Rok Jang,Hyesoo Kong,Kyunghwan An,Jae Sub Huh,Gyeonghun Kim,Stanley Jungkyu Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Real-world documents combine, produces text-only responses, predominantly produces text-only, visual elements, Real-world documents

备注: Accepted to CVPR 2026. Main paper: 5 figures, 4 tables; includes supplementary material

点击查看摘要

Abstract:Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

105. 【2606.16082】ool-IQA: Augmenting Image Quality Assessment with Simple Tools

链接https://arxiv.org/abs/2606.16082

作者:Guanyi Qin,Junjie Zhang,Chunming He,Yibing Fu,Jie Liang,Tianhe Wu,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Image Quality, Image Quality Assessment, assess image quality, increasingly adopted, Image

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.

106. 【2606.16075】AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

链接https://arxiv.org/abs/2606.16075

作者:Yang Shi,Songwen Pei,Yang Gao,Bingxue Zhang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:including training data, base models, fine-tuning behaviors, including training, enables value creation

备注

点击查看摘要

Abstract:Generative AI enables value creation through multi-stage collaboration among heterogeneous contributors, including training data, base models, fine-tuning behaviors, and prompts. However, how to fairly allocate the data value remains largely unexplored. This paper formulates multi-stage generative AI value allocation as a new research problem and identifies three core challenges: heterogeneous data contribution valuation, data rights mapping, and trustworthy execution. We propose AME (Attribution-Mapping-Execution) framework, a unified framework that integrates data contribution valuation, data rights mapping, and trustworthy execution into a single workflow. Experimental results demonstrate that AME framework achieves data value allocation outcomes more consistent with human reference judgments while maintaining low-cost trustworthy execution. Our work provides an initial foundation for value assessment and revenue allocation in generative AI data markets.

107. 【2606.16067】Stepwise Token Selection for Efficient Multimodal Large Language Models

链接https://arxiv.org/abs/2606.16067

作者:Landi He,Shawn Young,Lijian Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:making token reduction, large language models, multimodal large language, visual token prefix, inference cost

备注

点击查看摘要

Abstract:In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.

108. 【2606.16048】PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain

链接https://arxiv.org/abs/2606.16048

作者:Chidera Agbasiere,Mikhail Sannikov,Faith Ogunwoye,Erik Shaikhiev,Alex Kozinov,Ilya Mikhalchuk,Iana Zhura,Dzmitry Tsetserukou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing dense, sparse LiDAR point, LiDAR point clouds, autonomous driving, promising solution

备注

点击查看摘要

Abstract:Reconstructing dense 3D scenes from sparse LiDAR point clouds is a fundamental challenge in autonomous driving, where latent diffusion models offer a promising solution. However, existing approaches rely on object-level autoencoders that collapse into unstable global representations at outdoor scale and suffer from ground truth data corrupted by odometry drift that systematically degrades supervision quality. Furthermore, multi-step diffusion inference incurs prohibitive latency for real-time deployment. We propose a novel multi-token Gaussian VAE with cross-attention pooling for stable scene-scale LiDAR compression, combined with an anchor-based ICP ground truth refinement pipeline that eliminates drift-induced noise from training supervision. Together, these components enable a scaffold-free single-step diffusion completion model that achieves an approximately 16x reduction in squared Chamfer distance on SemanticKITTI seq. 08 (0.396 m^2 to 0.024 m^2), surpasses LiDiff and ScoreLiDAR by 17-19% and 10-11%, respectively, and operates at 25-143x lower inference latency. Our results demonstrate that data quality dominates model design in this regime and that multi-token latent spaces provide a stable first stage for latent diffusion-based scene completion.

109. 【2606.16036】rusting Right Predictions for Wrong Reasons: A LIME Based Analysis of Deep Learning Interpretability in Lung Cancer Diagnosis

链接https://arxiv.org/abs/2606.16036

作者:Samarpan Poudel,Vladislav D Veksler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:million deaths annually, making reliable diagnosis, Convolutional Neural Network, million deaths, NCCD lung cancer

备注

点击查看摘要

Abstract:Lung cancer is the leading cause of cancer-related mortality, with approximately 2.5 million new cases and 1.8 million deaths annually, making reliable diagnosis a clinical priority. Although deep learning models have achieved strong performance in lung cancer classification, evaluation has largely focused on predictive accuracy, leaving their decision-making processes insufficiently examined. This study compares three architecturally distinct models: a Convolutional Neural Network (CNN), a pretrained ResNet50, and a Vision Transformer (ViT), trained on the IQ-OTH/NCCD lung cancer CT dataset. Local Interpretable Model-Agnostic Explanations (LIME) were applied to investigate model reasoning. In addition to standard performance metrics, a dual-correlation framework was introduced to measure both prediction agreement and explanation agreement across model pairs. All three models achieved strong classification performance, with ResNet50 attaining 98.61% accuracy, CNN 97.91%, and ViT 93.75%, while all achieved ROC-AUC scores of 0.99. Prediction correlations exceeded 0.99 across all model pairs, indicating highly consistent outputs. However, LIME explanation correlations remained below 0.26, revealing substantial differences in the image regions used to reach those predictions. Analysis of misclassified samples further identified a consistent spatial pattern: incorrect predictions were associated with attention outside the lung parenchyma, whereas correct predictions focused primarily within lung regions. These findings demonstrate that prediction agreement is a poor proxy for reasoning consistency, and that interpretability evaluation must be treated as an independent validation criterion alongside predictive performance in clinical AI systems.

110. 【2606.16031】he Third Challenge on Image Denoising at NTIRE 2026: Methods and Results

链接https://arxiv.org/abs/2606.16031

作者:Lei Sun,Hang Guo,Bin Ren,Shaolin Su,Xian Wang,Danda Pani Paudel,Luc Van Gool,Radu Timofte,Yawei Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image Denoising, specifically focusing, high-noise regime, white Gaussian noise, additive white Gaussian

备注: accepted by cvprw2026

点击查看摘要

Abstract:This paper reports on the NTIRE 2026 Challenge on Image Denoising, specifically focusing on the high-noise regime ($\sigma = 50$). The competition investigates advanced neural architectures designed to restore high-fidelity details from images corrupted by additive white Gaussian noise (AWGN). Unlike constrained benchmarks, this track emphasizes peak quantitative performance, measured by Peak Signal-to-Noise Ratio (PSNR), without limitations on parameter count or computational overhead. By synthesizing contributions from 20 finalist teams out of 116 registrants, this report benchmarks the latest technical innovations and provides a comprehensive snapshot of the current state-of-the-art in unconstrained image restoration.

111. 【2606.16015】Stringalign: Moving beyond summary statistics with a transparent Unicode-aware tool for evaluating automatic transcription models

链接https://arxiv.org/abs/2606.16015

作者:Yngve Mardal Moe,Marie Roald

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Comparing text strings, Comparing text, text processing tasks, handwritten text recognition, crucial when evaluating

备注

点击查看摘要

Abstract:Comparing text strings is crucial when evaluating and understanding the performance of various text processing tasks such as document recognition and audio transcription. With an increasingly complex landscape of AI-based handwritten text recognition (HTR), optical character recognition (OCR) and automatic speech recognition (ASR) models, there is a need for tools that facilitate evaluation in a flexible and reproducible way. This paper presents Stringalign, a Python library designed to simplify the evaluation process for automatic transcription projects and facilitate transparent evaluation. Stringalign's tools to examine and visualise both the rate of errors and the types of errors a model makes, give insights into possible improvements and help inform model selection for a particular task. Widely used string comparison metrics, such as the character and word error rates (CER and WER), although useful, can be ambiguous due to varying definitions of what constitutes a character and a word. Stringalign addresses this challenge by ensuring all preprocessing (i.e. normalisation and tokenisation) is transparent and easily replicable, and by providing tools to move beyond summary statistics and analyse common model errors. Moreover, Stringalign adheres to FAIR (Findable, Accessible, Interoperable, and Reusable) principles for research software while staying lightweight and easy to adapt into researchers existing workflows. In this paper, we discuss challenges with character and word level string comparisons and show through examples that where existing tools can yield opaque and sometimes confusing results, Stringalign provides an easy-to-use and unambiguous alternative.

112. 【2606.15993】Classifying by Proxy: Explainable and Reproducible Ensemble of Proxy Tasks for Child Sexual Abuse Imagery Classification

链接https://arxiv.org/abs/2606.15993

作者:Clara Ernesto,Carlos Caetano,Sandra Avila,João Macedo,Camila Laranjeira,Leo S. F. Ribeiro

类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)

关键词:Sexual Abuse Imagery, Child Sexual Abuse, Child Sexual, Abuse Imagery, Sexual Abuse

备注: 12 pages, 7 figures, 7 tables. Accepted at ACM FAccT 2026

点击查看摘要

Abstract:Child Sexual Abuse Imagery (CSAI) classification systems are needed solutions for lessening the psychological impacts often felt by law enforcement agents responsible for evaluating these materials and for efficient removal of these materials from the web. However, due to the nature of the task, researching and developing such systems is not a trivial endeavor. The images are highly sensitive, and the related datasets are under restrictive access regimes, which means most studies in the area are not reproducible or distributable and are therefore hard to compare and validate. More concerning still, most models for this task today lack an aspect often desired by law enforcement agents: explainability. In this paper, we apply an ensemble of Proxy Tasks -- tasks that correlate to CSAI classification -- yielding improvements in reproducibility, explainability, and security for distribution. This concept is applied for the first time to real CSAI, with a novel selection of relevant Proxy Tasks (selected from the CSAI literature) and training adaptations to the original framework. Our final model achieves competitive results, yielding 91.9% balanced accuracy on the RCPD dataset with the best Proxy Task combination. We furthermore contrast these results with the best-in-class representation learning model, DINO, and show that our ensemble improves accuracy and provides explanations for its classification results, a feature that a single deep learning model can seldom provide.

113. 【2606.15992】Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

链接https://arxiv.org/abs/2606.15992

作者:Jigyashman Hazarika

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:plain RGB video, plain RGB, tennis stroke biomechanics, RGB video, built a multi-task

备注: 14 pages, 9 figures

点击查看摘要

Abstract:We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle's free T4 GPU tier and is fully reproducible.

114. 【2606.15987】A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts

链接https://arxiv.org/abs/2606.15987

作者:Fabio Quattrini,Carmine Zaccagnino,Costanza Bianchi,Silvia Cascianelli,Rita Cucchiara

类目:Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:Handwritten Text Recognition, target Handwritten Text, Text Recognition, Handwritten Text, target Handwritten

备注: Accepted at ICDAR 2026

点击查看摘要

Abstract:In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.

115. 【2606.15982】Mind the Gap: Diagnosing Constraint Discovery Failures in Text-in-Image Editing

链接https://arxiv.org/abs/2606.15982

作者:Rui Gui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recognizing visible content, visible content, key challenge, challenge in multimodal, multimodal reasoning

备注

点击查看摘要

Abstract:A key challenge in multimodal reasoning is determining which visual dependencies become relevant under a specific task, rather than merely recognizing visible content. We study this through edit-induced constraint discovery in text-in-image editing, a controlled diagnostic setting where a local text change can activate secondary consistency constraints: given a valid editing instruction and an image, can a model identify the secondary regions that must also change? Across 461 diagnostic cases, four MLLMs, and 19 constraint subtypes, models recover only 46% case-level macro recall under unguided prompting versus 94% when constraints are explicitly provided, suggesting that a substantial portion of the failure arises when models must decide which unstated dependencies to surface. Oracle-field decomposition shows that case-specific causal explanations are the most effective partial guidance (0.782 recall), above region names (0.610) or type labels (0.646), suggesting that edit-specific causal cues account for much of the oracle gain. A downstream experiment further shows that higher self-discovery recall does not necessarily improve task performance: unverified self-discovery introduces false positives that offset recall gains, motivating precision-aware constraint elicitation.

116. 【2606.15976】HadBalance: A Plug-and-Play Unified Global Geometric Prior Framework for Generalizable Biomedical Segmentation

链接https://arxiv.org/abs/2606.15976

作者:Zhuangzhi Gao,Feixiang Zhou,He Zhao,Wenhan Chen,Ruiyu Luo,Xin Wang,Hongyi Qin,Zhongli Wu,Yanda Meng,Yitian Zhao,Alena Shantsila,Gregory Y. H. Lip,Eduard Shantsila,Yalin Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Precise biomedical image, Precise biomedical, biomedical image segmentation, clinical diagnosis, biomedical image

备注: Provisionally accepted by the 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026). 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Precise biomedical image segmentation is crucial for clinical diagnosis. Geometric cues (e.g., boundary, shape, and topology) can improve structural consistency, yet most are task-specific and lack a unified geometric foundation that generalizes across organs and modalities. We are motivated by the observation that several medical segmentation targets can be approximated as globally near-convex shapes. A convex region is one in which any two interior points can be connected by a line segment entirely contained within the region. In practice, medical targets may exhibit small local concavities or boundary irregularities; we refer to such globally convex-like shapes as near-convex. Motivated by this, we derive Hadwiger Shape Priors from Hadwiger's theorem as an interpretable global regularizer using three 2D measures: area A, perimeter P, and Euler characteristic chi, enabling transfer across organs and modalities. However, because medical datasets are shape-heterogeneous, enforcing near-convex priors uniformly can over-regularize non-convex anatomy with significant concavities, washing out concavities and fine details and degrading segmentation accuracy. To address this challenge, we propose Conflict-Aware Objective Balancing (CAOB), which integrates shape priors with segmentation in a gradient-aware manner. For each prior, CAOB removes only the gradient component that conflicts with segmentation while preserving the remaining aligned component, and adaptively regulates objective influences to prevent prior dominance. This enables stable use of shape priors on shape-heterogeneous data without erasing genuine concavities or fine structural details. We call this plug-and-play framework HadBalance.

117. 【2606.15967】CRIS: Cross-Plane Self-Supervised Isotropic Restoration for Anisotropic Volumetric Imaging Across Modalities

链接https://arxiv.org/abs/2606.15967

作者:Adi Ahituv,Anat Ilivitzki,Moti Freiman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Anisotropic volumetric acquisitions, sparse through-plane sampling, through-plane sampling creates, sampling creates thick, creates thick slices

备注: 22 pages, 8 figures, supplementary material included. Submitted to Medical Image Analysis

点击查看摘要

Abstract:Anisotropic volumetric acquisitions are common in clinical MRI and volume electron microscopy (vEM), where sparse through-plane sampling creates thick slices or sections that degrade orthogonal reformats and downstream analysis. We present CRIS, a cross-plane self-supervised framework for isotropic restoration without paired isotropic ground truth. CRIS casts 3D restoration as 2D stripe completion on orthogonal reformats of an isotropic grid: high-resolution in-plane slices are synthetically degraded and periodically masked for training, while at inference blank slices define the isotropic grid, two orthogonal reformats are restored, and predictions are fused by multi-view averaging. We evaluate CRIS on two MRI cohorts and two microscopy benchmarks up to 8x anisotropy. On brain MRI, CRIS achieves 32.921 +/- 0.436 dB PSNR and 0.9631 +/- 0.0027 SSIM, outperforming interpolation, SMORE4, SIMPLE, SA-INR, and ATME, and gives the best segmentation consistency (Dice 0.940 +/- 0.004, ASSD 0.245 +/- 0.014 mm, HD99 1.275 +/- 0.061 mm). On reference-free abdominal MRI, CRIS reduces FID/KID to 48.714/0.023. On vEM, CRIS outperforms interpolation, NIIV, and vEMINR, reaching 29.133 dB/0.834 3D PSNR/SSIM at 4x, 27.123 dB/0.734 on EPFL at 8x, and 21.915 dB/0.699 on noisy hemibrain data. In a robustness experiment, one variable-gap CRIS model evaluated across gap factors 3--7 and coronal, axial, and sagittal degradations maintained higher PSNR/SSIM than interpolation (36.36--31.14 dB and 0.977--0.932 vs. 33.07--27.85 dB and 0.951--0.853). These results support CRIS as a modality-flexible route to isotropic restoration without paired isotropic targets or configuration-specific retraining. Code is available at this https URL.

118. 【2606.15966】VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

链接https://arxiv.org/abs/2606.15966

作者:Zhengyang Shen,Kai-Hung Chang,Erroll Wood,Deying Kong,Bo Peng,Timo Bolkart,Jinlong Yang,Bowen Zhao,Danhang Tang,Sasa Petrovic,Emre Aksan,Jérémy Riviere,Vassilis Choutas,Delio Vicini,Jay Busch,Shichen Liu,Zhe Cao,Hugh Liu,JingJing Shen,Jonathan Taylor,Mingsong Dou

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:digital human creation, practical multi-view systems, balance rich photometry, human creation, remains challenging

备注

点击查看摘要

Abstract:Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at this https URL.

119. 【2606.15956】You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

链接https://arxiv.org/abs/2606.15956

作者:Ninad Daithankar,Alexi Gladstone,Yann LeCun,Heng Ji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Weakly Supervised Learning, Supervised Learning, inductive biases, Learning, Self-Supervised Learning

备注

点击查看摘要

Abstract:Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.

120. 【2606.15938】Learning Directional Semantic Transitions for Longitudinal Chest X-ray Analysis

链接https://arxiv.org/abs/2606.15938

作者:Zhangfeng Hu,Zefan Yang,Ge Wang,Tanveer Syeda-Mahmood,Anushree Burade,Mannudeep Kalra,Pingkun Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Chest X-ray, requires longitudinal comparison, interpretation often requires, comparison to assess, Chest

备注: MICCAI 2026

点击查看摘要

Abstract:Chest X-ray (CXR) interpretation often requires longitudinal comparison to assess disease progression. Existing approaches typically rely on temporal feature fusion or inter-study discrepancy modeling, yet remain limited in capturing subtle progression semantics and overlook the inherently directional nature of disease trajectories. In this paper, we propose ProTrans, a novel vision-language pretraining framework that formulates disease progression as a directional semantic transition between paired CXR studies. ProTrans leverages radiology reports to anchor individual CXR representations within interpretable disease states, and introduces a learnable progression feature map to explicitly encode semantic shifts between states, aligned with report-derived progression descriptions. To enforce direction-aware perception, ProTrans incorporates a reversed temporal modeling process and imposes bidirectional reconstruction consistency across states and transitions, thereby disentangling directional semantics and promoting coherent trajectory modeling. Extensive experiments on longitudinal downstream tasks, including disease progression classification and progression captioning, demonstrate that ProTrans consistently outperforms existing methods, establishing a unified pretraining framework for longitudinal CXR understanding. this https URL

121. 【2606.15937】GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

链接https://arxiv.org/abs/2606.15937

作者:Jyothiraditya Lingam,Nikhileswara Rao Sulake,Sai Manikanta Eswar Machara

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fine-Grained Semantic Segmentation, Challenge at ICRA, Semantic Segmentation, Fine-Grained Semantic, Feature Refinement Module

备注: This solution has got 3rd position at GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026

点击查看摘要

Abstract:We present GOOSE-M2F, a task-specific adaptation of Mask2Former for the GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026. The GOOSE benchmark spans 64 fine-grained classes across unstructured outdoor terrain with a severely long-tailed distribution, where rare classes occupy fewer than 50 pixels per image. We extend the Swin-Large Mask2Former baseline with three targeted contributions: (1)200 Object Queries to eliminate representational saturation; (2)a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention; and (3)an Auxiliary Supervision Head that delivers direct per-pixel gradients for rare classes. A multi-stage training strategy pairs Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. At inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale TTA adds +10.57\%. GOOSE-M2F achieves 70.08\% Official Composite mIoU (63.55\% fine, 76.61\% coarse), placing 3rd on the GOOSE 2D FGSS leaderboard. Code and trained models are publicly available at: \href{this https URL}{Github GOOSE-M2F Code} and \href{this https URL}{Hugging Face GOOSE-M2F}.

122. 【2606.15924】urboGS: Accelerating 3D Gaussian Splatting via Error-Guided Sparse Pixel Sampling and Optimization

链接https://arxiv.org/abs/2606.15924

作者:Zheng Dong,Daifei Qiu,Pinxuan Dai,Ke Xu,Jiamin Xu,Lili He,Rynson W.H. Lau,Weiwei Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Consumer-level applications require, applications require fast, Consumer-level applications, Gaussian Splatting, require fast optimization

备注: Accepted by ICML2026. Project page: [this https URL](https://zhengdong.site/projects/TurboGS/)

点击查看摘要

Abstract:Consumer-level applications require fast optimization of 3D Gaussian Splatting (3DGS) with high-fidelity novel view rendering. However, existing 3DGS acceleration approaches still incur substantial computation on redundant pixels while sacrificing fine details. In this paper, we present TurboGS, an error-guided training framework that accelerates 3DGS by concentrating optimization on perceptually informative pixels. TurboGS is built upon four core components: (1) a tile-wise sparse pixel sampling, which, driven by multi-view reconstruction errors during training, prioritizes challenging regions and skips well-reconstructed ones to avoid redundant gradient computation; (2) a tile-wise structure-aware loss with sparse Normalized Cross-Correlation, which provides sparse yet effective supervision to preserve fine details and stabilize training; (3) an error-driven Gaussian density control strategy, which dynamically allocates model capacity and removes redundant primitives; and (4) a tailored hybrid optimizer that couples Hessian-informed updates with Adam moment damping to stabilize and improve convergence under sparse supervision. Experiments on standard benchmarks demonstrate that TurboGS can deliver on par or superior rendering quality within 100 seconds on a single RTX 5090 GPU card (up to 10x training speedup over vanilla 3DGS).

123. 【2606.15920】OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

链接https://arxiv.org/abs/2606.15920

作者:Zebang Cheng,Shuimu Chen,Boxue Yang,Yuanshen Guan,Jingyi Chen,Zheng Lian,Xiaojiang Peng,Fei Ma,LaiZhong Cui,Qi Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex reasoning tasks, severe reward sparsity, multimodal large language, large language models, Reinforcement learning

备注

点击查看摘要

Abstract:Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human--AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of $84.19$, and ablations further support the value of rationale-privileged teacher guidance.

124. 【2606.15908】High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

链接https://arxiv.org/abs/2606.15908

作者:Bo Peng,Xu Chen,Yi Gu,Hidenobu Matsuki,Mingsong Dou,Jingjing Shen,Deying Kong,Juyong Zhang,Zhengyang Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hand-object interaction, pre-scanned object templates, demand for high-fidelity, data in embodied, growing demand

备注: Project page: [this https URL](https://zyshen021.github.io/HOSTPG/)

点击查看摘要

Abstract:The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at this https URL.

125. 【2606.15889】SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

链接https://arxiv.org/abs/2606.15889

作者:Adi Rosenthal,Tomer Koren,Nadav Shaked,Doron Friedman,Ariel Shamir

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:impressive rhythmic synchronization, achieved impressive rhythmic, unique non-verbal style, non-verbal style remains, speaker unique non-verbal

备注

点击查看摘要

Abstract:While recent advances in co-speech gesture generation have achieved impressive rhythmic synchronization, synthesizing gestures that are both semantically meaningful and faithful to a speaker's unique non-verbal style remains an open challenge. Semantic gestures, such as iconic shapes or deictic pointing, are statistically sparse, making them difficult to learn effectively within standard generative models. We present SiGnature, a framework for Stylized and Semantic Gesture generation that reconciles precise semantic control with high-fidelity style preservation. Unlike prevalent methods that rely on entangled latent representations, SiGnature operates in an explicit joint-rotation space. This design enables our core contribution, Joint Motion Integration (JMI), a training-free inference mechanism capable of injecting any external motion sequence, particularly in-the-wild semantic gestures, directly into the diffusion process. JMI automatically identifies the specific ``active joints'' conveying a semantic action and injects them into the generation, while relying on the diffusion backbone to synthesize the remaining body dynamics, including posture and flow, in accordance with the pre-learned style of the target speaker. This allows for the plug-and-play integration of arbitrary motions, including complex semantic gestures, without retraining or introducing the ``Frankenstein'' artifacts typical of cut-and-paste methods. Extensive experiments and perceptual studies demonstrate that SiGnature offers superior semantic motion control while maintaining smooth and natural co-speech gesture generation and preserving the distinct characteristics of the speaker, thereby outperforming state-of-the-art baselines.

126. 【2606.15886】xt region detection in historical astronomical diagrams

链接https://arxiv.org/abs/2606.15886

作者:Zeynep Sonat Baltacı,Raphaël Baena,Fei Meng,Somkéo Norindr,Florence Somer,Matthieu Husson,Mathieu Aubry

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Text detection, crucial task, Text, historical documents, text regions

备注

点击查看摘要

Abstract:Text detection is a crucial task in the analysis of historical documents. While datasets and benchmarks exist for text detection in manuscripts and maps, the study of text in mathematical diagrams has received little attention. To address this, we introduce a large-scale, diverse, open-access dataset of 948 historical astronomical diagrams containing 10,940 oriented polygonal text regions. Our dataset spans ten centuries (8th to 18th) and seven main linguistic traditions: Arabic and Persian (115), Chinese (332), Byzantine (233), Latin (185), Hebrew (48), and Sanskrit (35). It captures a wide range of diagram styles and textual content, from symbols to multi-line paragraphs. Each text instance is annotated with ordered polygons that precisely delineate text regions and encode the reading direction. In addition, we annotated the 2,293 regions in Latin diagrams with 20 class labels. We evaluated several strong baselines on our dataset, including TESTR, DeepSolo++, and Poly-DETR, a simple extension of DINO-DETR that we design to predict ordered polygon vertices. Poly-DETR achieves state-of-the-art performance on the MTHv2 and cBAD2019 benchmarks and provides a solid, simple baseline on our dataset. Code and dataset available online.

127. 【2606.15880】Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models

链接https://arxiv.org/abs/2606.15880

作者:Kaiqing Lin,Zhiyuan Yan,Ruoxin Chen,Ke-Yue Zhang,Yue Zhou,Caiyong Piao,Bin Li,Taiping Yao,Bo Wang,Youchang Xiao,Shouhong Ding

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, Multimodal large, large language models, large language, increasingly adopted

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We further perform a layer-wise analysis of forensic signal perception in MLLMs, showing that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across most benchmarks. The code and data are available at this https URL.

128. 【2606.15869】Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

链接https://arxiv.org/abs/2606.15869

作者:Jingyu Li,Zhe Liu,Dongnan Hu,Junjie Wu,Zipei Ma,Wenxiao Wu,Chao Han,Zhihui Hao,Zhikang Liu,Kun Zhan,Jiankang Deng,Xiatian Zhu,Li Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:World action models, shown great promise, World action, video generation, shown great

备注

点击查看摘要

Abstract:World action models~(WAMs) have shown great promise for autonomous driving and urban navigation. Built upon Vision-Language-Action models or video generation models, existing approaches suffer key limitations: (1) High inference latency due to future observation prediction at test time, and (2) tightly coupled video and action modeling leading to representational mismatch and degraded generalization. To address both issues, we propose Metis, an end-to-end WAM framework that decouples video generation and action prediction. Specifically, Metis employs a Mixture-of-Transformers architecture with dedicated experts for video generation and action prediction, preserving the intrinsic distributional properties of each task. To enhance efficiency, we introduce an asymmetric attention mask that enables joint training of both experts while allowing the action model to bypass explicit video generation during inference. This design ensures training-inference consistency and significantly reduces computational costs without compromising planning performance. Extensive experiments demonstrate state-of-the-art performance on the NAVSIM navhard and navtest benchmarks and the CityWalker navigation benchmark, validating both the generalizability and efficiency across diverse tasks. Real-robot deployments further confirm the practical feasibility of our approach.

129. 【2606.15867】CogCanvas: A Benchmark for Evaluating Multi-Subject Reference-Based Image Generation

链接https://arxiv.org/abs/2606.15867

作者:Long-Bao Nguyen,Quang-Khai Tran,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving multiple human, requires jointly preserving, jointly preserving multiple, multiple human identities, Multi-subject reference-based image

备注

点击查看摘要

Abstract:Multi-subject reference-based image generation requires jointly preserving multiple human identities, binding per-person objects and fashion items, and respecting a specified background scene, a regime where current diffusion models remain brittle. Existing benchmarks evaluate only one axis at a time and none jointly captures multi-identity composition with human-object interaction, background grounding, and spatial plausibility. We introduce CogCanvas, a benchmark of 1,952 curated reference images spanning 100 celebrity identities, 115 distinctive objects and fashion items, and 29 real-world background scenes including landmarks, from which we construct 1,361 compositional prompts covering 2-5 person group sizes. The curation pipeline combines DINOv2-based deduplication, two-stage aesthetic filtering, and automated derivation of structured interaction and position graphs that serve as ground-truth supervision. CogCanvas supports three tasks, reference-based multi-human-object generation (primary), text-to-image compositional generation, and reference retrieval, under a unified six-axis evaluation protocol. We introduce two metrics tailored to the multi-reference setting: BG-Sim, which scores background fidelity on SAM 3-masked regions via DINOv3 feature similarity, and Attr-VQA, which uses a multimodal LLM to verify per-subject attribute binding and inter-person interactions against the structured graphs. Benchmarking five SOTA methods reveals that every model degrades substantially as group size grows from 2 to 5, with near-complete failure on object/fashion binding beyond three subjects.

130. 【2606.15861】Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

链接https://arxiv.org/abs/2606.15861

作者:Yiping Li,Ronald de Jong,Romy van Jaarsveld,Franco Badaloni,Gino Kuiper,Jelle Ruurda,Josien Pluim,Marcel Breeuwer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Question Answering, support surgical training, requires high-level understanding, Question Answering, robotic surgery

备注

点击查看摘要

Abstract:Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to support surgical training and intraoperative decision-making. Recent Vision-Language Models (VLMs) have shown promising performance through parameter-efficient fine-tuning; however, most existing approaches rely on coarse visual grounding, typically limited to bounding boxes, which fails to capture the fine-grained spatial structure of surgical objects. In this work, we propose a unified framework that jointly performs pixel-level segmentation and visual question answering within a single framework. Our approach integrates a VLM with a Segment Anything Model (SAM)-based decoder and represents scene elements as object tokens generated by the VLM. These object tokens guide answer prediction and are further projected to the SAM-based decoder to produce segmentation masks. By optimizing the object token embeddings through both segmentation and question answering objectives, the model learns spatially grounded representations that enhance visual reasoning while providing explicit pixel-level grounding. We evaluate the proposed method on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public EndoVis18 dataset, where it consistently outperforms baseline methods for surgical VQA. These results demonstrate that incorporating context-aware object tokens into vision-language models improves fine-grained surgical scene understanding.

131. 【2606.15857】A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

链接https://arxiv.org/abs/2606.15857

作者:Liyuan Cao,Zheng Liu,Guanghao Liao,Yonghui Yang,Qi Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:wavelength dependent light, dependent light absorption, Due to wavelength, underwater image enhancement, limits underwater object

备注

点击查看摘要

Abstract:Due to wavelength dependent light absorption and scattering, underwater images usually suffer from color distortion and blurred details, which limits underwater object detection performance. Existing underwater image enhancement methods mainly focus on visual quality improvement, while it is still difficult to balance enhancement quality, processing efficiency, and downstream detection performance. Therefore, this paper proposes an efficient dual-branch underwater image enhancement framework for object detection. The detail enhancement branch improves brightness and local contrast to recover texture details in dark regions. The color restoration branch uses adaptive compensation to reduce color distortion and improve color gradation. By combining the complementary outputs of the two branches, the proposed framework provides clearer and more informative images for object detection. On the UIEB and EUVP datasets, the proposed method achieves UIQM scores of 2.249 and 2.576. When applied to the YOLOv8 detection task on the URPC dataset, the proposed method improves mAP50 by 2.1\% compared with the baseline. Extensive experiments show that our method improves object detection in complex underwater scenes, while balancing enhancement quality and processing efficiency.

132. 【2606.15848】EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

链接https://arxiv.org/abs/2606.15848

作者:Tingting Chen,Shaojun Wang,Huaye Zhang,Diqiong Jiang,Chenglizhao Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:talking head synthesis, shown strong potential, high-fidelity talking head, Gaussian Splatting, head synthesis

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

133. 【2606.15837】Learning a Sampling-Free Variational DNN Plugin from Tiny Training Sets to Refine OOD Segmentation With Uncertainty Estimation

链接https://arxiv.org/abs/2606.15837

作者:Jimut B. Pal,Suyash P. Awate

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

关键词:Deep neural networks, Deep neural, neural networks, frequently fail, acquisition protocols

备注: Accepted at the Journal of Machine Learning for Biomedical Imaging

点击查看摘要

Abstract:Deep neural networks (DNNs) frequently fail to generalize to out-of-distribution (OOD) medical images because of variations in scanners and acquisition protocols. Retraining DNN models to address these distribution shifts is often impractical due to the high cost of acquiring and annotating new medical datasets. To address this, we introduce VarDeepPCA, a novel lightweight variational DNN framework designed to restore/refine degraded segmentation maps by leveraging intrinsic geometric priors. Unlike existing approaches that require target-domain data or extensive pre-training, our VarDeepPCA explicitly learns a distribution of valid anatomical geometries using only small in-distribution (ID) datasets. Theoretically, our novel variational learning framework leverages a reinterpretation of the softmax mapping to implicitly perform exact distribution modeling, thereby enabling computationally efficient, sampling-free learning and inference. This also enables VarDeepPCA to provide uncertainty estimates associated with its restored segmentation maps. We empirically validate our framework across 4 distinct clinical applications, using 14 publicly available datasets, involving segmentation of the myocardium, neuroretinal rim, prostate, and fetal head. Comparisons against 15 existing methods demonstrate that VarDeepPCA consistently restores segmentation maps produced by the existing methods on OOD data to (i) significantly improve anatomical plausibility of geometries and clinical utility of the segmentations, and (ii) significantly reduce errors, without needing any more training data than that used by existing methods.

134. 【2606.15819】SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

链接https://arxiv.org/abs/2606.15819

作者:Siya Yang,Nanxiang Jiang,Zhaoxin Fan,Yunfeng Diao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:frontier for high-fidelity, generated content, rapid progress, unlocked a transformative, transformative frontier

备注

点击查看摘要

Abstract:The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: this https URL}{this https URL.

135. 【2606.15802】CPS4: Class Prompt driven Semi-Supervised Spine Segmentation with Class-specific Consistency Constraint

链接https://arxiv.org/abs/2606.15802

作者:Qingtao Pan,Hongzan Sun,Bing Ji,Shuo Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Model, Vision Language, Language Model, class prompts, semi-supervised spine segmentation

备注

点击查看摘要

Abstract:Vision Language Model (VLM) has great potential to enhance the quality of pseudo labels in semi-supervised spine segmentation by leveraging textual class prompts to generate segmentation map, but no one has studied it yet. Although promising, it lacks explicit constraints to ensure consistency between spine class prompts and spine unit region, resulting in unsatisfactory performance in multi-class segmentation map generation. In this paper, we propose CPS4, the first text-guided semi-supervised spine segmentation network using class prompts to enhance the quality of spine pseudo labels. Specifically, CPS4 is implemented through two training stages. (i) Class-specific consistency constrained VLM pretraining stage: we propose token- and pixel-level attention loss to optimize the consistency between class prompts and spine units, forcing the textual class prompt to be closely coupled with the target spine unit in the semantic space. (ii) Class Prompt driven semi-supervised spine segmentation stage: using the pretrained vision-text encoder, we derive each class-specific binary segmentation map for the unlabeled spine image and integrate them into an unified multi-class segmentation map, improving the quality of the spine pseudo label generated by the semi-supervised spine segmentation network. Experimental results show that our CPS4 achieves superior spine segmentation performance with Dice of 80.44%, only using 5% labeled data on the public spine segmentation dataset, surpassing popular semi-supervised learning and VLM methods. Our code will be available.

136. 【2606.15796】DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

链接https://arxiv.org/abs/2606.15796

作者:Artyom Mazur,Nina Konovalova,Aibek Alanov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Mechanistic interpretability seeks, explain neural network, Mechanistic interpretability, neural network behavior, interpretability seeks

备注

点击查看摘要

Abstract:Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at this https URL

137. 【2606.15786】Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes, Visualization, and Hybrid Prompts

链接https://arxiv.org/abs/2606.15786

作者:Aniq Ahmad,Heather Bedle,Ahmad Mustafa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)

关键词:large pretrained foundation, visual data interpretation, advent of large, large pretrained, computer vision

备注

点击查看摘要

Abstract:The advent of large pretrained foundation models for computer vision has significantly improved the efficiency of visual data interpretation. The Segment Anything Model (SAM), in particular, offers powerful zero shot segmentation capabilities through prompt based interaction, thus making it a promising tool for seismic interpretation. However, most existing applications of SAM rely on fine tuning for specific geological targets, which requires extensive labeled data, incurs high computational cost, and often compromises the model's generalization capability. In this study, we introduce a principled framework for zero shot adaptation of foundation models to seismic data. The framework is built on two key components: (1) aligning seismic attributes and visualization choices (e.g., colormaps) with the geological target of interest, and (2) employing a hybrid prompting strategy that combines sparse user defined point prompts with dense mask prompts derived from SAM's internal feature activations. We systematically evaluate this framework across multiple geological targets, datasets, prompt configurations, and seismic attribute representations. Our results demonstrate that geologic target aware selection of seismic attributes and colormaps, combined with hybrid prompting, enhances the separability of geological features and improves boundary delineation and segmentation accuracy relative to point based prompting alone. Our findings show that, when these components are jointly applied, SAM can achieve competitive segmentation performance in a fully zero shot setting, thereby eliminating the need to retrain SAM for each geologic feature. This work establishes a practical and scalable pathway to leverage foundation models in seismic interpretation, reducing reliance on labeled data while preserving model generality.

138. 【2606.15782】Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference

链接https://arxiv.org/abs/2606.15782

作者:Pratheswaran Hariharan,Haiping Xu,Donghui Yan

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong capabilities, demonstrated strong, strong capabilities, capabilities in vision-language, natural-language response generation

备注: 28 pages, 9 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and hallucination-like outputs, particularly when the visual evidence is weak, ambiguous, or semantically inconsistent. Most existing approaches focus on improving multimodal representation alignment or retrieval-augmented generation, while providing limited mechanisms to quantify instance-level prediction reliability or identify incorrect visual outputs. This work proposes a retrieval-augmented reliability-aware inference framework for trustworthy multimodal visual understanding. The proposed framework constructs an external visual evidence database using pretrained visual embeddings and nearest-neighbor retrieval over normalized feature representations. Retrieved evidence is used to estimate prediction trustworthiness through multiple reliability indicators, including similarity strength, class-support agreement, evidence margin, entropy-based uncertainty, and an aggregate reliability score. Based on these signals, a decision gate determines whether the system should accept the prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then produces a final user-facing response conditioned on the reliability decision. Experiments on ImageNet-100 demonstrate that the proposed reliability-aware framework improves accepted prediction accuracy from 85.84\% to 88.88\% at 89.04\% coverage. The hallucination-like accepted wrong-answer rate is reduced from 14.16\% to 11.12\%. These results show that integrating retrieval evidence, reliability estimation, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models.

139. 【2606.15779】Faithful Action-unit Causal Reasoning for Counterfactually Faithful Emotion Explanations

链接https://arxiv.org/abs/2606.15779

作者:Van Thong Huynh,Hong Hai Nguyen,Thuy Pham,Trong Nghia Nguyen,Soo-Hyung Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal models, structural AU-emotion causal, AU-emotion causal graph, typically plausible, structure marks causal

备注

点击查看摘要

Abstract:Multimodal models can name the action units (AUs) behind a facial emotion, but their AU-emotion rationales are typically plausible rather than faithful: nothing forces the AUs a model invokes to be the AUs that actually drive its prediction. We cast AU-emotion reasoning as a counterfactual-consistency problem between the rationale, the label, and a structural AU-emotion causal graph G, and propose FACR, which grounds the reasoner in an independently induced, polarity-aware G and trains a counterfactual-faithfulness objective: a do-intervention on an AU that G marks causal for a class must move the prediction, while one it marks irrelevant must leave it unchanged. Faithfulness is thereby both trainable and measurable through a matching interventional metric, which we evaluate against a known causal structure, the PSPI pain-AU composition, as no existing affective-reasoning benchmark allows. We are explicit that this metric tests fidelity to the supplied structure rather than its rediscovery: it asks whether the trained reasoner invokes the AUs the structure marks causal, on held-out subjects and a second dataset. Under subject-independent evaluation on UNBC-PAIN, the objective raises the agreement between the invoked AUs and the PSPI composition from a no-objective baseline of 0.08 to 0.57, at a small detection cost; an unfaithfulness control attributes the gain to the objective. On a cross-dataset emotion transfer, the objective likewise raises fidelity to G on a seven-class task (0.50 to 0.84). Finally, we attach a language verbalizer and extend the audit to the generated text: biasing each action unit's emission by its latent activation makes the rationale faithful by construction, so that ablating an AU removes it from the explanation, a property that transfers to a second language-model backbone, whereas a freely generated rationale is unfaithful.

140. 【2606.15772】Ellipse Meets Bit-Planes: A Novel Approach to RNFL based Glaucoma Detection Using Advanced Image Processing and Deep Learning

链接https://arxiv.org/abs/2606.15772

作者:Snigdha Paul,Sambit Mallick,Anindya Sen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Nerve Fiber Layer, Retinal Nerve Fiber, Fiber Layer, colour fundas images, Retinal Nerve

备注

点击查看摘要

Abstract:This work proposes an integrated pipeline for automatic glaucoma detection method from easily available colour fundas images based on an adaptive algorithm for ellipse-based polar transformation, to enhance the analysis of the Retinal Nerve Fiber Layer (RNFL) as the primary biomarker for observing glaucomatous changes, regardless of optic disc and macula position. Utilizing this transformation, we introduce two distinct frameworks tailored to different operational needs. The first framework, a deep learning-inspired feature fusion approach, achieves a 99.3% detection rate, ideal for settings where high precision is essential, despite higher computational demands. The second framework employs a novel image-processing algorithm based on bit-plane slicing, offering 92.31% accuracy and optimized for environments requiring rapid inference with minimal resource consumption. Both frameworks provide scalable and cost-effective solutions for early glaucoma detection. This study highlights the potential of RNFL-based diagnostic tools in addressing the global challenge of glaucoma, particularly in underserved regions.

141. 【2606.15765】ask-Instructed Causal Routing of Vision Foundation Models for Multi-Task Learning

链接https://arxiv.org/abs/2606.15765

作者:Donghyun Han,Yuseok Bae,Jung Uk Kim,Hyung-Il Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong robustness, robustness and transferability, wide range, Vision foundation models, demonstrated strong

备注: 17 pages, 6 figures

点击查看摘要

Abstract:Vision foundation models (VFMs) have demonstrated strong robustness and transferability across a wide range of visual tasks. However, each model typically encodes strong inductive biases shaped by its pre-training objective and data domain, resulting in fragmented yet complementary visual knowledge. As a result, a single model often struggles to capture the diverse visual representations required across multiple dense prediction tasks. To address this limitation, we propose TIGER (Task-Instruction-Guided Expert Routing), a framework that coordinates multiple heterogeneous VFMs for multi-task dense prediction. Instead of naively aggregating expert features, TIGER leverages natural-language task instructions to guide a routing network that assigns token-level expert weights conditioned on task semantics, enabling adaptive integration of complementary expert features. TIGER further introduces a counterfactual loss that aligns routing decisions with each expert's causal contribution by measuring prediction changes when experts are excluded, encouraging more reliable and interpretable routing. We evaluate TIGER on two multi-task dense prediction benchmarks, NYUD-v2 and Pascal Context, where it consistently outperforms recent multi-task learning baselines while keeping all VFMs frozen. These results demonstrate that combining instruction-guided expert routing with counterfactual causal alignment enables effective coordination of heterogeneous vision foundation models.

142. 【2606.15763】he Circumplex Degeneracy Behind the Rare-Class Limit in Affect Recognition

链接https://arxiv.org/abs/2606.15763

作者:Van Thong Huynh,Hong Hai Nguyen,Soo-Hyung Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recognition persistently fails, expression recognition persistently, class imbalance, recognition persistently, persistently fails

备注

点击查看摘要

Abstract:In-the-wild expression recognition persistently fails on a few rare emotions, and the standard explanation is class imbalance. Through a controlled multi-task study on two benchmarks, we show the failure is instead a property of affect geometry: the rare classes are degenerate on Russell's circumplex, and that degeneracy bounds what any loss or cost can achieve. Our instrument is a circumplex-cost optimal-transport term that prices expression confusions by their valence-arousal distance. The term improves the official score and expression macro-F1, but a control most studies omit shows the gain is not geometric: a uniform cost, equivalent to a generic confidence penalty, matches it on Aff-Wild2 (p=0.625) and significantly exceeds it on AffectNet (+0.057 over base, larger than the circumplex). What the geometry reshapes is the structure of the errors, making them affectively nearer the truth on Aff-Wild2 (p=0.031 against the uniform control), an effect that does not survive on AffectNet, where a visual confound at the far corner of the circumplex overwhelms it. The rare-class failure, by contrast, is stable across both datasets we examine: the degenerate pairs (anger-fear on Aff-Wild2, anger-contempt on AffectNet) resist frequency-based interventions, the transport term, and an action-unit-augmented cost built specifically to separate them. We conclude that progress on rare expressions requires representations that distinguish the classes, not supervision that reprices their confusions, and we provide the controls and metrics needed to tell the two apart.

143. 【2606.15749】OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning

链接https://arxiv.org/abs/2606.15749

作者:Maonan Wang,Zhengyan Huang,Kemou Jiang,Yuhang Fu,Jiayue Zhu,Yuxin Cai,Xingchen Zou,Qiaosheng Zhang,Yi Yu,Ding Wang,Xi Chen,Ben M. Chen,Yuxuan Liang,Zhiyong Cui,Man On Pun,Yirong Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

关键词:including lane topology, scene understanding requires, understanding requires models, reason beyond object, Traffic

备注: 34 pages, 28 figures

点击查看摘要

Abstract:Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view--BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human--model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.

144. 【2606.15694】MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLMs

链接https://arxiv.org/abs/2606.15694

作者:Hangling Xie

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:large language models, complex multimodal content, demonstrated remarkable capabilities, Multimodal large language, understanding complex multimodal

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently suboptimal for capturing the nuanced multimodal cues that vary across inputs. To address this limitation, we propose a Multimodal Adaptive Few-Shot Prompting (MAF) framework, which dynamically retrieves and integrates query-relevant demonstrations to elicit the sentiment reasoning capabilities of MLLMs in a context-sensitive manner. MAF constructs a demonstration retrieval module that holistically encodes facial expressions, scene context, and textual semantics, with a lip movement amplitude detection mechanism introduced for accurate speaker identification in multi-person scenarios. Departing from conventional fixed-weight fusion, a lightweight coefficient generation network is trained to output query-conditioned fusion weights in real time, enabling weighted aggregation of multimodal similarity scores to retrieve the top-K most informative demonstrations. Prediction stability is further enhanced through majority voting over multiple candidate outputs generated by the MLLM. Extensive experiments on public benchmark datasets demonstrate that MAF achieves substantial and consistent performance improvements over the corresponding backbone variants and remains competitive with strong multimodal sentiment-analysis baselines.

145. 【2606.15685】Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

链接https://arxiv.org/abs/2606.15685

作者:Shuaike Zhang,Shaokun Wang,Haoyu Tang,Jianlong Wu,Liqiang Nie

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Embodied Continual Learning, Embodied Continual, Continual Learning, retaining previously learned, previously learned behaviors

备注: 13 pages, 5 figures

点击查看摘要

Abstract:Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: this https URL.

146. 【2606.15681】3D Consistency Optimization for Self-Supervised Monocular Video Depth Estimation

链接https://arxiv.org/abs/2606.15681

作者:Yuanye Liu,Ke Zhang,Junzhe Jiang,Li Zhang,Vishal Patel,Xiahai Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reliable monocular video, Reliable monocular, monocular video depth, crucial for downstream, reasoning and embodied

备注

点击查看摘要

Abstract:Reliable monocular video depth estimation is crucial for downstream 3D reasoning and embodied AI in endoscopic navigation. However, existing self-supervised approaches typically treat video frames independently or rely on weak temporal regularization. These methods, lacking a holistic perception of the underlying 3D scene, inevitably suffer from geometrically inconsistent predictions and severe cross-frame drift. To address these limitations, we introduce a new paradigm that recasts sequential video depth estimation as an unconstrained multi-view 3D reconstruction problem, enabling full exploitation of the powerful geometric priors embedded in recent 3D foundation models. The core of our approach is a 3D consistency optimization framework driven by three constraints: image-level photometric rendering, explicit world-coordinate geometric alignment, and multi-scale temporal gradient consistency. Such unified optimization elegantly anchors isolated frames to a globally coherent 3D structure. Our method has been validated in both the self-supervised training scenarios and challenging zero-shot clinical environments. Results show that the proposed approach achieves state-of-the-art spatial accuracy, outperforming the frame-based, video-based depth estimators and the multi-view 3D reconstruction baselines.

147. 【2606.15667】CEVAR: Centerline Embedding Extraction for Endovascular Aneurysm Repair

链接https://arxiv.org/abs/2606.15667

作者:Roman Naeem,Timo Niiniskorpi,Charlotte Sandström,Naman Desai,Anders Jeppsson,Ida Häggström,Fredrik Kahl,Håkan Roos,Jennifer Alvén

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Long-term mortality rates, endovascular aneurysm repair, remain elevated due, post-EVAR rupture caused, Long-term mortality

备注: Submitted Version. Accepted at MICCAI 2026

点击查看摘要

Abstract:Long-term mortality rates after endovascular aneurysm repair (EVAR) remain elevated due to post-EVAR rupture caused by loss of seal in stent graft sealing zones. Structured CT review using centerline measurements improves detection, but current workflows require manual centerline editing and expert operators. We propose a transformer framework for automated, protocol-driven sealing zone assessment that combines 3D centerline tracking with embedding-based geometric prediction. Two state-of-the-art image-to-graph models are evaluated for aorto-iliac centerline extraction from follow-up CT and for measurement of stent position, vessel diameters, and seal lengths according to EVAR4C protocol. Across the full test set and a challenging no-contrast subset, the proposed fully automatic method outperforms the commercial semi-automatic workflow.

148. 【2606.15663】OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

链接https://arxiv.org/abs/2606.15663

作者:Jiali Wen,Hongxia Gao,Litao Li,Yixin Chen,Kaijie Zhang,Qianyun Liu,Xiaoqin Wen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:conventional detectors struggle, emerging contraband types, X-ray contraband detection, lack fundamental visual, logistics and transportation

备注: 17 pages, 10 figures

点击查看摘要

Abstract:X-ray contraband detection is critical for security in large-scale logistics and transportation, yet conventional detectors struggle to adapt to emerging contraband types and lack fundamental visual understanding. Vision-language models (VLMs) offer strong generalization but are hindered by the scarcity of high-quality X-ray image-caption data. To bridge this critical gap, we present MMXray, a meticulously curated benchmark of 52,124 image-caption pairs spanning 28 fine-grained classes of X-ray contraband. To enrich MMXray with realistic occlusion patterns, we further introduce CleanDET, a dedicated synthesis dataset containing clean foreground contraband images from 28 categories and background images with diverse density levels, together with AnyContraSyn, a controllable synthesis method designed to operate on CleanDET. We also develop OnePipe, an extensible pipeline for systematic data curation. Built on MMXray, we propose OneFocus, a unified VLM that supports four core tasks: visual question answering, contraband localization, classification, and image understanding. OneFocus achieves state-of-the-art performance in X-ray contraband understanding and demonstrates robust cross-domain generalization, establishing a strong vision-language baseline for security screening.

149. 【2606.15659】SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

链接https://arxiv.org/abs/2606.15659

作者:Yiran Wang,Zeyu Zhang,Yuanming Li,Ziming Wang,Yang Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:head avatars, central to telepresence, digital-human interaction, portraits are central, Gaussian Splatting

备注

点击查看摘要

Abstract:High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: this https URL.

150. 【2606.15651】Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

链接https://arxiv.org/abs/2606.15651

作者:Saraswathy Amjith

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:require chaining multiple, chaining multiple steps, Relative Policy Optimization, Group Relative Policy, images and text

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.

151. 【2606.15648】Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

链接https://arxiv.org/abs/2606.15648

作者:Haochen Hu,Yanrui Bin,Zhengyan Zhang,Minchen Wei,Chih-yung Wen,Bing Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse water-medium conditions, including color bias, low contrast, water-medium conditions, leading to complex

备注

点击查看摘要

Abstract:The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: this https URL.

152. 【2606.15647】owards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

链接https://arxiv.org/abs/2606.15647

作者:Cheng Zhang,Qing Cai,Xingzheng Wu,Xun Yang,Xiaojun Chang,Bingkun Bao,Liqiang Nie,Xinwang Liu,Yi Yang

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:demonstrated impressive performance, Foundation models, enhancing healthcare efficiency, models have demonstrated, demonstrated impressive

备注: 19 pages, 9 figures

点击查看摘要

Abstract:Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at this https URL.

153. 【2606.15632】Open-World Video Segmentation

链接https://arxiv.org/abs/2606.15632

作者:Qing Su,Kaiyang Li,Yuan Zhuang,Fei Miao,Shihao Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains largely unexplored, segmentation remains largely, video segmentation, video segmentation remains, open-world video segmentation

备注

点击查看摘要

Abstract:While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ$_\infty$, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.

154. 【2606.15629】XPASS-Vis: A Dataset for Cross-Domain Personalized Image Aesthetic Assessment

链接https://arxiv.org/abs/2606.15629

作者:Takato Hayashi,Hiroaki Takahara,Candy Olivia Mawalim,Hiromi Narimatsu,Akisato Kimura,Shiro Kumano,Shogo Okada

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image aesthetic assessment, cross-domain PIAA, individual level, artworks and photographs, Personalized image aesthetic

备注

点击查看摘要

Abstract:Personalized image aesthetic assessment (PIAA) seeks to model, at the individual level, the subjective nature of aesthetic judgments toward artworks and photographs. Aesthetic preference is known to be both deeply personal and partially consistent across visual domains. Yet existing PIAA datasets and methods are largely confined to a single domain, or provide too few samples per annotator within each domain to enable personalization across domains. Consequently, the cross-domain generalization of personalized aesthetic preferences remains largely unexplored. To address this gap, we introduce XPASS-Vis, the first dataset explicitly designed for cross-domain PIAA. XPASS-Vis comprises 6,526 stimuli from three visual domains -- art, fashion, and landscape -- rated by 129 annotators, yielding 87,836 user-stimulus interactions, each annotated with an overall aesthetic score and nine aesthetic-emotion ratings. Notably, each annotator rated more than 200 stimuli per domain, providing sufficient per-domain coverage to support personalization both within and across domains. Moreover, we establish baseline models for cross-domain PIAA under unsupervised domain adaptation (UDA), where a model trained on a labeled source domain is transferred to an unlabeled target domain. A systematic evaluation of representative UDA approaches shows that the best-performing method recovers approximately 60\% (Spearman's $\rho$ = .28) of the supervised upper bound under a fully unsupervised setting. This provides encouraging evidence that personalized aesthetic preferences are, to a meaningful extent, transferable across visual domains. At the same time, a substantial gap remains, highlighting the need for PIAA-specific adaptation strategies. XPASS-Vis and the accompanying baselines provide a foundation for future research on cross-domain PIAA. All datasets and code will be made publicly available upon acceptance.

155. 【2606.15617】NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

链接https://arxiv.org/abs/2606.15617

作者:Hongxi Yang,Yiwen Jiang,Siyuan Yan,Jamie Chow,Eunis Li,Charlotte Poon,Stephanie Fong,Xiangyu Zhao,Deval Mehta,Yasmeen George,Zongyuan Ge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:trustworthy medical image, medical image diagnosis, Concept Bottleneck Models, essential for trustworthy, trustworthy medical

备注

点击查看摘要

Abstract:Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

156. 【2606.15615】MoECa: Aligning Feature Reuse with Expert Decomposition in Diffusion Transformers

链接https://arxiv.org/abs/2606.15615

作者:Maoliang Li,Haojing Chen,Jiayu Chen,Zihao Zheng,Xinhao Sun,Hailong Zou,Xiang Chen

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformers, improve model capacity, sparse activation, capacity under sparse, bottlenecked by redundant

备注: under review

点击查看摘要

Abstract:Diffusion Transformers with Mixture-of-Experts (DiT-MoE) improve model capacity under sparse activation, but diffusion inference is still bottlenecked by redundant computation across timesteps. Existing caching methods mainly operate at the token level, which becomes suboptimal in DiT-MoE because each token update is internally decomposed into multiple routed expert branches. Our analysis shows that cross-timestep redundancy in DiT-MoE is better characterized at the expert-branch level than at the whole-token level. Based on this observation, we propose MoECa, a fine-grained caching framework that performs branch-level feature reuse across timesteps. MoECa further introduces expert-aware adaptive control and synchronized cache updates across MoE and attention paths to maintain stable intermediate states. Experiments on multiple DiT-MoE models show that MoECa consistently achieves a better speed-quality trade-off than prior caching methods, with up to 2.83$\times$ inference speedup and minimal quality degradation.

157. 【2606.15614】Variational Test-time Optimization for Diffusion Synchronization

链接https://arxiv.org/abs/2606.15614

作者:Hyunsoo Lee,Farrin Marouf Sofian,Kushagra Pandey,Stephan Mandt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extend the capabilities, powerful paradigm, coordinates multiple diffusion, Collaborative generation, diffusion

备注: Preprint. Project website: [this https URL](https://hleephilip.github.io/ScaleEdit/)

点击查看摘要

Abstract:Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

158. 【2606.15611】Mutual Distillation of Dual-Foundation Models for Semi-Supervised PET/CT Segmentation

链接https://arxiv.org/abs/2606.15611

作者:Fuyou Mao,Beining Wu,Yanfeng Jiang,Bohan Xu,Lixin Lin,Naye Ji,Hao Zhang,Yan Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:planning in oncology, Organ segmentation, critical for quantitative, quantitative analysis, analysis and radiotherapy

备注: MICCAI 2026

点击查看摘要

Abstract:Organ segmentation from PET/CT is critical for quantitative analysis and radiotherapy planning in oncology. To ease the high annotation cost of PET/CT segmentation, semi-supervised learning (SSL) provides a practical and effective solution for developing deep models with limited labeled data. Recent developments in visual foundation models have demonstrated remarkable adaptability with improved efficiency. In this work, we propose a mutual distillation framework that seamlessly exploits both structural and functional foundation models, which act as modality-specific generalists for distilling knowledge from structural CT and metabolic PET imaging. By bridging the gap between the task-specific precision of student models and the segmentation priors of generalist foundation models, we propose \textbf{MuDuo}, a mutual distillation framework that synergistically leverages SAM-Med3D for CT and SegAnyPET for PET to distill their knowledge into a lightweight student network. Our approach eliminates the need for manual prompts while maximizing the utility of unlabeled data for automatic segmentation, achieving state-of-the-art performance on the AutoPET dataset with only 5 labeled cases. Our source code is available at this https URL.

159. 【2606.15608】On the Adversarial Robustness of Multimodal LLM Judges

链接https://arxiv.org/abs/2606.15608

作者:Zihan Wang,Guansong Pang,Zelin Liu,Wenjun Miao,Jin Zheng,Xiao Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly used as automated judges, e.g., for image quality and safety assessment. However, their adversarial robustness remains largely unexplored, threatening the fairness and reliability of automated judging. To bridge this gap, we introduce RobustMLLMJudge, the first general framework for evaluating the adversarial robustness of general-purpose MLLMs when functioning as judges. It covers diverse attacks against popular judge approaches across quality and safety evaluation scenarios. Using RobustMLLMJudge, we reveal that i) different MLLM judges are highly vulnerable to score-inflating adversarial attacks; and ii) although effective, these attack methods face a critical challenge due to unique constraints in the evaluation protocols of MLLM judges. We further propose MGSIA, namely Manifold-Guided Semantic Induction Attack, a novel method that bypasses these constraints to enable more effective and transferable attacks on MLLM judges. The core idea of MGSIA is to combine affirmative semantic induction with high-score manifold alignment: it maximizes the probability that judges yield affirmative responses (e.g., "Yes") to binary semantic queries, while regularizing adversarial representations toward high-score centers estimated from proxy protocols. Together, these objectives yield transferable score-inflating perturbations. Extensive experiments demonstrate the superiority and generalizability of MGSIA in deceiving advanced MLLM judges under different evaluation scenarios, highlighting the need for robust MLLM judges. Code and data will be made available at this https URL.

160. 【2606.15604】Parameter-Efficient Adaptation of SAM 3 for Automated ITV Generation from 4DCT Images

链接https://arxiv.org/abs/2606.15604

作者:Changwoo Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current Internal Target, Internal Target Volume, Four-dimensional computed tomography, Target Volume contouring, contouring workflows process

备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Four-dimensional computed tomography (4DCT) captures the full respiratory cycle of thoracic anatomy, yet current Internal Target Volume contouring workflows process each phase in isolation, discarding temporal coherence and leaving contours vulnerable to phase-specific artifacts. We present a lightweight framework that applies parameter-efficient fine-tuning to the Segment Anything Model 3 (SAM 3) via low-rank adaptation (LoRA) to align its text-prompted segmentation with the medical domain using only seven annotated 3D CT volumes. Furthermore, the framework incorporates a hard negative mining strategy to improve boundary discrimination in low-contrast thoracic regions. At inference, phase-wise predictions are refined through phase-coherent temporal filtering and spatial connectivity analysis. Since respiratory motion is continuous and periodic, genuine anatomy appears in contiguous blocks of phases, whereas transient artifacts appear sporadically and are thus effectively suppressed. Experiments on pulmonary and cardiac structures yield median Dice scores of 0.968 and 0.910 with 95th-percentile Hausdorff distances of 0.998 mm and 2.931 mm, respectively. The proposed framework effectively eliminates the severe false-positive predictions inherent in the zero-shot inference of the unadapted SAM 3. With only seven annotated volumes, the framework retains over 95% of full-data accuracy, and the entire pipeline is trainable on a single consumer-grade GPU, demonstrating a scalable, data-efficient solution for adaptive radiotherapy.

161. 【2606.15597】Fusion-E2Pulse: A Multimodal Event-RGB Fusion Network for Non-contact Pulse Wave Reconstruction

链接https://arxiv.org/abs/2606.15597

作者:Qian Feng,Hao Guo,Yan Niu,Zhenhuan Xu,Yidi Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Non-contact pulse wave, pulse wave reconstruction, wave reconstruction hinges, Non-contact pulse, including the dicrotic

备注: Accepted by MICCAI 2026. The final version will appear in the official MICCAI proceedings published by Springer

点击查看摘要

Abstract:Non-contact pulse wave reconstruction hinges on the precise recovery of waveform morphology, including the dicrotic notch. Conventional Red-Green-Blue (RGB)-based methods, which extract physiological signals from recorded facial videos, are constrained by the integral imaging mechanism of standard cameras, where the exposure process induces a smoothing effect that attenuates subtle vascular pulsation details. Conversely, neuromorphic event cameras, while offering exceptional sensitivity to intensity fluctuations, are inherently susceptible to noise and artifacts induced by minor motion. To exploit the synergy between frame-based integration and event-based differential sensing, we propose a novel multimodal network named Fusion-E2Pulse. This framework utilizes filtered RGB signals as structural priors to suppress motion artifacts, while leveraging the high-sensitivity of event streams to recover fine-grained morphological details. Experimental results demonstrate that Fusion-E2Pulse achieves state-of-the-art performance, effectively balancing noise suppression and morphological fidelity, achieving a mean absolute error of 0.78 bpm for heart rate estimation, a waveform correlation of 0.89, and a systolic phase duration error of 16.74 ms, validating its efficacy in reconstructing fine-grained pathological features.

162. 【2606.15594】Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

链接https://arxiv.org/abs/2606.15594

作者:Devesh Nath,Anutam Srinivasan,Haoran Yin,Ruitong Jiang,Jeffrey Fang,Glen Chou

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)

关键词:safe feedback motion, feedback motion planning, robust model predictive, model predictive control, framework for safe

备注

点击查看摘要

Abstract:We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.

163. 【2606.15592】DenseControl: Instance-Level Controllable Synthesis of Dense Crowd Image

链接https://arxiv.org/abs/2606.15592

作者:Juncheng Wang,Lei Shang,Wang Lu,Baigui Sun,Shujun Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pipeline for generating, Isolated Object Embedding, generating dense crowd, IOE map, Implicit Scale Embedding

备注: Accepted to IEEE TMM

点击查看摘要

Abstract:In this paper, we introduce DenseControl, a novel pipeline for generating dense crowd images. Specifically, DenseControl meticulously positions and sizes each generated instance to align precisely with the predefined coordinates and scales. Based on this, we further allow for control over the background, style, and attributes of instances. The motivation behind DenseControl stems from the observation of two main challenges in synthesizing crowd images: controlling signal embedding and maintaining topological integrity when imparting instance scale guidance. To address these, we first introduce the Isolated Object Embedding (IOE) map, a novel representation that facilitates spatial location control while mitigating the difficulties associated with learning projections for model. Secondly, we propose an Implicit Scale Embedding (ISE) strategy that seamlessly integrates with the IOE map to encode precise scale information. To further enhance the efficacy of combining ISE with the IOE map, we incorporate a Position Shortcut mechanism that enhances cross-attention to alleviate projection challenges. We evaluate DenseControl through two lenses: synthesis quality and applicability in latent applications. Experiments across different control conditions demonstrate DenseControl achieves state-of-the-art results in dense crowd image synthesis. Furthermore, we showcase applications in augmenting crowd analysis under data scarcity, transfer learning, and weather generalization scenes, to highlight the practical utility of DenseControl. The codebase will be released.

164. 【2606.15590】Unlocking Diffusion Hierarchies: Adaptive Timestep Selection for Zero-Shot Segmentation

链接https://arxiv.org/abs/2606.15590

作者:Ramin Nakhli,Mahesh Ramachandran,Luca Ballan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently shown notable, shown notable improvement, rich visual priors, Stable Diffusion, priors in large-scale

备注

点击查看摘要

Abstract:Zero-shot segmentation has recently shown notable improvement by leveraging the rich visual priors in large-scale text-to-image diffusion models, such as Stable Diffusion. However, current diffusion-based methods often face limitations due to the trade-off between spatial resolution and contextual information, as well as their reliance on a single static timestep for feature extraction. To overcome these challenges, our work introduces two key advancements. First, our Contextual Similarity Maps fuse high-resolution attention maps with rich U-Net encoder features, providing both fine-grained and robust per-pixel representations. Second, we identify an emergent hierarchical semantic progression within the denoising process of various diffusion models: representations transition from part-level abstractions at earlier timesteps to object-level abstractions at later stages. Leveraging this insight, we introduce a mechanism to adaptively select the optimal timestep for each pixel. Extensive experiments demonstrate that our method consistently outperforms existing zero-shot segmentation baselines, validating the efficacy of combining contextual features with dynamic, hierarchical timestep selection.

165. 【2606.15574】oward the Whole Picture: Accumulative Fingerprint Mapping and Reconstruction for Small-Area Mobile Sensors

链接https://arxiv.org/abs/2606.15574

作者:Xiongjun Guan,Jianjiang Feng,Jie Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complete fingerprint representation, matching ultimately requires, mobile devices creates, sufficiently complete fingerprint, pose-varying local patch

备注

点击查看摘要

Abstract:Small-area fingerprint sensing on mobile devices creates a fundamental mismatch between acquisition and recognition: each touch captures only a tiny, pose-varying local patch, while reliable biometric matching ultimately requires a stable and sufficiently complete fingerprint representation. Existing pipelines largely cope with this mismatch by treating repeated touches as independent partial templates, which leads to repeated registration, repeated matching, and no guarantee of adequate global coverage. In this paper, we advocate a different formulation, namely \emph{accumulative fingerprint mapping and reconstruction} for small-area mobile sensing. Rather than matching every partial patch separately, the proposed perspective converts a sequence of local observations into a unified fingerprint state that is progressively refined as new touches arrive and can be matched only once after consolidation. As a concrete baseline, we present a classical pipeline that performs patch-wise structural feature extraction, feature-level registration and fusion, fingerprint map construction, and phase-based ridge reconstruction. More importantly, we position this baseline within a broader mobile fingerprint framework that integrates structured token learning, two-stage pose reasoning, and diffusion-based generative reconstruction. This viewpoint reframes mobile fingerprint recognition from multi-capture multi-match processing to accumulative map building, state refinement, and one-shot matching, offering a principled route toward efficient, pose-robust, and deployment-friendly biometrics for small-area mobile platforms. The baseline implementation has been publicly released at this https URL.

166. 【2606.15570】An Extensive Benchmark for Single-round and Multi-round Instruction-based Image Editing

链接https://arxiv.org/abs/2606.15570

作者:Yiwei Ma,Ke Ye,Weihuang Lin,Jiayi Ji,Xiaoshuai Sun,Tat-Seng Chua,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:IIE models, IIE, recent years, notable advancements, automatic alteration

备注: Accepted by International Journal of Computer Vision (IJCV), 2026

点击查看摘要

Abstract:In recent years, there have been notable advancements in the area of instruction-based image editing (IIE), which focuses on the automatic alteration of input images using a model. Nevertheless, assessing the effectiveness of these editing models poses a considerable challenge due to the intricate nature of instructions and the wide variety of edits. To tackle this problem, one urgent task in this domain is the development of a robust evaluation framework that can precisely gauge the quality of editing outcomes and offer valuable benchmarks to guide future improvements. To address this challenge, we present a comprehensive evaluation benchmark named I2EBench2.0, designed for single-round and multi-round assessment of IIE models. I2EBench2.0 has four key features: 1) Evaluation Across Single and Multi-rounds: I2EBench2.0 simultaneously evaluates both single-round and multi-round instruction-based edits, assessing the precision and consistency of the edits. 2) Extensive Evaluation Criteria: I2EBench2.0 encompasses a broad range of criteria, evaluating both high-level and low-level aspects of each IIE model. Specifically, it incorporates 16 dimensions for single-round evaluations and 7 for multi-round evaluations. 3) Alignment with Human Judgment: To ensure our benchmark aligns with human evaluation, we conducted a comprehensive user study for each criterion. 4) Research-driven Insights: By analyzing the strengths and weaknesses of current IIE models across all 16 single-round and 7 multi-round dimensions, we provide critical insights aimed at directing future research in this area. We tested eight recently developed IIE models using I2EBench2.0 and derived academic insights through meticulous comparison and analysis. The related code, dataset, and images generated by all IIE models are available on GitHub: this https URL.

167. 【2606.15554】RaLMPH: Reliability-aware Learning for Multi-Pathologist Harmonization in Whole-Slide Image Classification

链接https://arxiv.org/abs/2606.15554

作者:Sungrae Hong,Jiwon Jeong,Soeun Cheon,Donghee Han,Sol Lee,Jisu Shin,Kyungeun Kim,Mun Yong Yi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Instance Learning, achieved strong results, Whole-Slide Image, Instance Learning, Multiple Instance

备注: Accepted by MICCAI 2026

点击查看摘要

Abstract:Multiple Instance Learning (MIL) is a standard paradigm for Whole-Slide Image (WSI) analysis and has achieved strong results in computational pathology. However, most MIL pipelines assume a single "gold" label per slide, which conflicts with clinical practice where substantial inter-pathologist variability is common. Existing multi-annotator learning and label-refinement methods typically estimate global annotator reliability or rely on single-instance assumptions, making them poorly suited to MIL and to localized diagnostic contexts where experts disagree. We propose RaLMPH (Reliability-aware Learning for Multi-Pathologist Harmonization), a MIL-based label reconciliation framework for WSIs annotated by multiple pathologists. RaLMPH introduces a reliability field that jointly models (i) local neighborhood structure in WSI feature space and (ii) expert uncertainty (entropy), enabling per-sample identification of trustworthy reference neighborhoods. Leveraging this field, RaLMPH performs sample-wise local annotator ranking to select reliable opinions per slide and applies an adaptive gating mechanism to fuse labels conditioned on local reliability. Experiments on a clinical WSI dataset with labels from six pathologists, as well as controlled simulated benchmarks, show that RaLMPH consistently outperforms existing approaches. Further analyses clarify how our reliability-aware mechanism improves label reconciliation and downstream MIL performance.

168. 【2606.15547】EcoBin: A Two-Stage Deep Convolutional Neural Network for Contamination-Aware Waste Classification

链接https://arxiv.org/abs/2606.15547

作者:Raghav Senthil Kumar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Waste classification models, highly accurate, accurate at sorting, Waste, contamination

备注: 7 pages, 8 figures

点击查看摘要

Abstract:Waste classification models have become highly accurate at sorting waste, often exceeding 95% on benchmark datasets. However, these models fail to account for contamination in recyclable waste. We present EcoBin, a two-stage deep convolutional neural network that classifies household waste by its disposal pathway and that explicitly accounts for contamination. The first stage is a base waste classifier built on an EfficientNetV2-S backbone that assigns each of the thirty waste categories in our dataset to one of four disposal pathways. The second stage is a contamination classifier that inspects any item routed toward recycling and overrides the decision to garbage when contamination is detected. Because no public dataset of contaminated recyclables exists, we synthesize one by segmenting images of clean recyclable objects with a U2-Net model and compositing realistic contamination textures onto their surfaces. The first stage achieves 87.42% test accuracy and a 96.13% pathway-adjusted accuracy. Meanwhile, the contamination stage distinguishes clean from contaminated items with a 0.99 ROC-AUC. On a test set of contaminated recyclables, the complete pipeline routes 24 of 25 items correctly, compared with only 1 of 25 for the base classifier alone. A McNemar's test confirms that the improvement contributed by the contamination stage is statistically significant (p 0.001).

169. 【2606.15534】rack2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

链接https://arxiv.org/abs/2606.15534

作者:Feng Qiao,Zhaochong An,Zhexiao Xiong,Serge Belongie,Nathan Jacobs

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:camera viewpoint requires, prescribed camera trajectory, viewpoint requires, requires the output, output to follow

备注

点击查看摘要

Abstract:Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: this https URL

170. 【2606.15527】Selective Synergistic Learning for Video Object-Centric Learning

链接https://arxiv.org/abs/2606.15527

作者:WonJun Moon,Jae-Pil Heo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Typical video object-centric, reconstruction-driven encoder-decoder architectures, approaches employ slot-based, employ slot-based frameworks, video object-centric learning

备注

点击查看摘要

Abstract:Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at this http URL.

171. 【2606.15486】ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

链接https://arxiv.org/abs/2606.15486

作者:Brian Nlong Zhao,Ozgur Kara,Junho Kim,James M. Rehg

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:study the problem, aims to generate, produces while observing, gaze, Ranked Probability Score

备注

点击查看摘要

Abstract:We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze. Project webpage: this https URL

172. 【2606.15468】Analyzing Visual Aircraft Representations with Sparse Autoencoders

链接https://arxiv.org/abs/2606.15468

作者:Deepshik Sharma

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:achieve strong performance, internal representations supporting, difficult to interpret, achieve strong, strong performance

备注: 18 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Vision models can achieve strong performance on classification tasks, but the internal representations supporting their predictions are often difficult to interpret. This work investigates whether sparse autoencoders can decompose intermediate representations of a vision model into interpretable features. We train a ConvNeXt classifier on the FGVC-Aircraft dataset, extract spatial activations from its final feature stage, and train a sparse autoencoder on these activations. The learned sparse features are analyzed using top-activating image patches, activation strength, and class selectivity. Qualitative visual inspection reveals that several features correspond to recognizable aircraft structures and visual patterns. We evaluate a subset of selected features using input-space and feature-space ablations, measuring how blurring image patches and suppressing sparse features affect class logits, classification margins, and prediction confidence. The results suggest that sparse autoencoders can reveal partially interpretable, class-relevant visual features associated with aircraft recognition, while also exposing limitations such as polysemanticity and coarse spatial localization.

173. 【2606.15457】Lesion-DDPM: Lesion-Enhanced 3D Diffusion for MS MRI Synthesis

链接https://arxiv.org/abs/2606.15457

作者:Weidong Zhang,Yongchan Jung,Shafayat Mowla Anik,Furen Xiao,Vasudevan Janarthanan,Enkhzaya Chuluunbaatar,Byeong Kil Lee,Jeeho Ryoo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:standard MRI sequences, acquisition protocols, multiple sclerosis, vary across scanners, widely recommended

备注

点击查看摘要

Abstract:3D FLAIR MRI is widely recommended as one of the standard MRI sequences for brain imaging in multiple sclerosis (MS), but publicly available MS datasets remain relatively small and vary across scanners, acquisition protocols, and lesion patterns. This scarcity and variability hinder the development of robust neuroimaging machine learning models and are particularly challenging for generative models that aim to synthesize images while preserving small, sparse lesions. We propose Lesion-DDPM, a 3D conditional diffusion framework for lesion-aware FLAIR synthesis that incorporates multi-level anatomical mask injection together with a lesion-weighted reconstruction loss to emphasize lesion voxels while maintaining global brain structure. Using a curated subset of the MSLesSeg dataset, we compare Lesion-DDPM with representative state-of-the-art GAN- and diffusion-based models, assessing both image-generation metrics and downstream 3D U-Net segmentation. In our experiments, Lesion-DDPM achieved the lowest lesion-region reconstruction error among all methods. In a downstream 3D U-Net lesion segmentation task, a model trained only on Lesion-DDPM-generated scans and evaluated on real MRIs reached a Dice score of 0.616 compared with 0.569 for the best competing synthetic dataset. When Lesion-DDPM images were added to the real training set, the Dice score further increased to 0.685.

174. 【2606.15427】Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

链接https://arxiv.org/abs/2606.15427

作者:Nicholas A. Welsh,Lennon J. Shikhman,Monty Nehru Attazs,Seemanthini K. Putane,Van Minh Nguyen,Ryan T. White

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Spaceborne inspection systems, Spaceborne inspection, deploy perception models, perception models prior, expanding fixed label

备注: 5 pages, 1 figure, 2 tables. Equal contribution by Nicholas A. Welsh and Lennon Shikhman. Published in the CVPR2026 Workshop on AI4Space

点击查看摘要

Abstract:Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight, adding new semantic capabilities in orbit requires retraining and re-uploading parameters. We investigate whether prompt-driven vision--language models can enable post-launch semantic expansion, allowing new spacecraft components to be specified via natural-language prompts without modifying onboard weights. We evaluate zero-shot instance segmentation of spacecraft components under a strictly frozen, single-pass inference protocol on a test set of $129$ images of previously unseen satellites. Under fixed global thresholds and no post-processing, SAM3 achieves $0.385$ mAP@$0.5$ and $0.267$ mAP@$0.5{:}0.95$. Performance is strongly scale-dependent: large structural elements like spacecraft bodies ($0.639$ AP@$0.50$) and solar arrays ($0.598$ AP@$0.5$) localize reliably, while relatively small appendages like antennas ($0.221$ AP@$0.5$) and thrusters ($0.081$ AP@$0.5$) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to $82%$ improvement over short category-name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt-driven grounding can provide a practical mechanism for post-launch semantic extension of dominant spacecraft structures while highlighting limitations of zero-shot localization for fine-scale components under orbital domain shift.

175. 【2606.15417】From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models

链接https://arxiv.org/abs/2606.15417

作者:Bessie Dominguez-Dager,Francisco Gomez-Donoso,Miguel Cazorla,Marc Pollefeys,Daniel Barath,Zuria Bauer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires capturing fine-grained, capturing fine-grained transitions, egocentric video requires, video requires capturing, general-purpose Vision-Language Models

备注

点击查看摘要

Abstract:Action reasoning in egocentric video requires capturing fine-grained transitions of hand-object interactions, a task where general-purpose Vision-Language Models (VLMs) often struggle when operating directly on raw pixels. We propose to decouple visual perception from symbolic reasoning by converting videos into Temporal Action Graphs. In a multi-stage prompting pipeline, we first generate dense natural language narratives over short temporal windows as a semantic bottleneck, then formalize them into structured, open-vocabulary graph representations. On the EGTEA and Epic-Kitchens-100 datasets, the symbolic representation unlocks efficient in-context learning: few-shot graph demonstrations yield substantial accuracy gains over zero-shot frame and graph-based inference alike. Even in the zero-shot setting, graph-based reasoning remains competitive with pixel-based inference despite potential pretraining contamination favoring the latter. Across 11 open-weight VLMs from 6 model families ranging from 2B to 235B parameters, our findings indicate that current VLMs are more effective as symbolic reasoners than as direct visual observers. By projecting video into the language domain, we provide a scalable, fine-tuning-free alternative to end-to-end approaches that better leverages these models' latent reasoning strengths. The code will be made public.

176. 【2606.15409】Segmentation-based Detection for Efficient Multi-Task Spacecraft Perception

链接https://arxiv.org/abs/2606.15409

作者:Sivaperuman Muniyasamy,Surendar Devasundaram

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Space Situational Awareness, Situational Awareness, autonomous on-orbit operations, Vision-based perception, Awareness and autonomous

备注: 8 pages, 2 figures, 6 tables. CVPRW AI4SPACE-SPARK 2026 Challenge Stream-1 First Place Winners. Code is available at [this https URL](https://github.com/sivaastro/segdet-spark)

点击查看摘要

Abstract:Vision-based perception is fundamental to Space Situational Awareness and autonomous on-orbit operations such as rendezvous, docking, servicing, and navigation. However, progress in this area is limited by the scarcity of annotated space imagery and by challenging visual-domain characteristics including severe illumination changes, low signal-to-noise ratio, and high contrast. We address Stream 1 of the SPARK 2026 Challenge, which requires a single model for spacecraft classification, detection, and fine-grained component segmentation across multiple target types. We propose a compact architecture that integrates a MobileNetV3 encoder with a U-Net-style decoder, combining computational efficiency with accurate dense prediction. Detection is derived analytically from the union of predicted component masks, avoiding a separate bounding-box regression head in the single-spacecraft setting. Our method achieved an overall leaderboard score of 0.9482, with task-specific scores of 1.0000 in classification, 0.9788 in detection, and 0.8917 in segmentation. The proposed approach ranked second overall in the SPARK 2026 Challenge, demonstrating that lightweight encoder-decoder architectures can deliver strong multi-task performance for practical onboard space vision systems.

177. 【2606.15389】mestep Rescheduling in Diffusion Inversion

链接https://arxiv.org/abs/2606.15389

作者:Shangquan Sun,Ting Gong,Zhirui Liu,Jiamin Wu,Runkai Zhao,Mianxin Liu,Wenqi Ren,Xiaochun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian latent space, Gaussian latent, maps images back, latent space, critical task

备注: Accepted by ICML 2026. 23 pages, including appendices

点击查看摘要

Abstract:Diffusion inversion, which maps images back to the Gaussian latent space of a diffusion model, is a critical task for image reconstruction and editing. While DDIM enables fast deterministic inversion, it inherently introduces deviations that accumulate into noticeable inversion errors. Existing methods often address this by solving a fixed-point problem but largely overlook how the selection of the diffusion timestep in the noise scheduler influences inversion fidelity. In this work, we reveal that the deviation scale in diffusion inversion is strongly dependent on the timestep size, and exhibits a parabolic trend, with larger errors concentrated at both small and large timesteps. Based on this finding, we propose a simple yet effective nonuniform timestep scheduler that integrates a global rescaling with a local dynamic programming based rescheduling, enabling a strategic allocation of computational effort that minimizes the overall inversion error and preserves higher inversion accuracy. Our method serves as an off-the-shelf enhancement for existing inversion techniques and requires no extra parameters or computational overhead. Through extensive experiments, we verify that integrating our scheduler consistently boosts the performance of existing inversion methods, achieving superior results in image reconstruction and editing.

178. 【2606.15370】MNet++: Extended 2D/3D Networks for Anisotropic Medical Image Segmentation

链接https://arxiv.org/abs/2606.15370

作者:Kirsten Odendaal,Rade Bajic

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:convolutional network designed, medical image segmentation, anisotropic medical image, convolutional network, full reproduction

备注

点击查看摘要

Abstract:This work demonstrates a full reproduction and extension of MNet, a hybrid 2D/3D convolutional network designed for anisotropic medical image segmentation. The original architecture was re-implemented within the nnU-Net framework to verify its reported performance and robustness to variable voxel spacing, known as anisotropy. Experiments were conducted on PROMISE prostate MRI and a controlled subset of LiTS liver CT under matched preprocessing and compute constraints. The reproduced MNet achieved a Dice similarity coefficient (DSC) of 89.0 +/- 0.9% on PROMISE, within 0.8% of the published result, and 94.3 +/- 1.9% / 54.6 +/- 3.1% for liver and tumor segmentation on LiTS, respectively. Two lightweight extensions were further introduced: (1) a learned Fusion Gating mechanism enabling adaptive 2D-3D feature blending, and (2) a VMamba state-space module for efficient long-range depth modelling. The Spatial Gating variant improved DSC by +0.8% with less than 3% inference overhead, while VMamba improved performance consistency, reducing PROMISE Dice variation to +/- 0.7% and achieving the strongest LiTS liver performance at 95.8% Dice. Both extensions preserved MNet robustness to anisotropy, with delta Dice = 1.5% across 1-4 mm voxel spacing. Overall, the study confirms MNet reproducibility and demonstrates that adaptive fusion and state-space modelling have the potential to further strengthen segmentation reliability under anisotropic conditions. However, further tests are required to provide definitive conclusions.

179. 【2606.15355】Sustainable Face Recognition on Low-Power Devices with VQ-VAE Embeddings

链接https://arxiv.org/abs/2606.15355

作者:Christos Chronis,Georgios Th. Papadopoulos,Iraklis Varlamis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:heavy carbon footprint, computationally intensive models, intensive models deployed, increased network traffic, high energy consumption

备注

点击查看摘要

Abstract:Face recognition has become a cornerstone of modern AI applications, yet conventional approaches often rely on computationally intensive models deployed in cloud environments, leading to increased network traffic, high energy consumption, and a heavy carbon footprint. This work introduces a sustainable, edge-deployable face recognition framework based on Vector-Quantized Variational Autoencoders (VQ-VAE), which generates compact and semantically rich latent representations of facial images. By leveraging the compression capacity and reconstruction quality of VQ-VAE embeddings on the edge and combining them with the power of pre-trained face embeddings in a knowledge distillation setup, our system achieves comparable accuracy to state-of-the-art face embedding models while significantly reducing memory and computation requirements on the edge, making it suitable for low-power edge devices. The integration of VQ-VAE compression minimizes network overhead while keeping the matching accuracy high by retaining only the most informative facial features in the latent space. As a result, the reconstructed images preserve the key identity characteristics, improving the robustness and overall performance of the face embeddings.

180. 【2606.15351】Facial Affect Analysis for Service-Oriented Systems: Advances, Challenges, and Future Visions

链接https://arxiv.org/abs/2606.15351

作者:Spyridon Georgiou,Aggelos Psiris,Thomas Lagkas,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Service-Oriented Software Ecosystems, Facial Affect Analysis, stand-alone recognition task, reusable perception capability, Facial Affect

备注

点击查看摘要

Abstract:Facial Affect Analysis (FAA) is evolving from a stand-alone recognition task into a reusable perception capability for Service-Oriented Software Ecosystems (SoSE). This paper preserves the FAA methodological core while reframing recent advances through systems-engineering requirements for composable and dependable services. We review representative progress in static and dynamic expression analysis, action-unit and micro-expression modeling, and modern CNN, Transformer, graph, and hybrid architectures, then interpret these advances by their operational fit in edge, cloud, and hybrid service pipelines. The synthesis emphasizes SoSE concerns that determine deployability: service contracts for uncertainty-aware outputs, latency and availability envelopes, lifecycle monitoring and recalibration, governance-aware integration, and interoperability across independently evolving components. Our analysis shows that benchmark gains alone are insufficient for SoSE readiness; robustness under shift, intervention stability, fairness, privacy posture, and runtime guarantees are equally critical. We conclude with a roadmap for treating FAA as an operational service component with explicit interfaces, measurable quality attributes, and accountable lifecycle management.

181. 【2606.15346】DYNA-PRUNER: Input-Adaptive Data-Model Co-Pruning for Efficient and Scalable Spatio-Temporal Media Prediction

链接https://arxiv.org/abs/2606.15346

作者:Fuyan Zhang,Yuqi Li,Yingli Tian,Edmond S.L. Ho

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Spatio-temporal prediction supports, prediction supports radar, city-scale traffic monitoring, Spatio-temporal prediction, supports radar

备注: ICME 2026 Spotlight Paper

点击查看摘要

Abstract:Spatio-temporal prediction supports radar/satellite nowcasting and city-scale traffic monitoring, but modern models are often too expensive for real-time deployment. This stems from a mismatch between dense computation and strong input-dependent redundancy (e.g., calm seas or clear skies). To enable automated, resource-aware architecture optimization in scalable media analysis, we propose Dyna-Pruner, an end-to-end framework for input-dependent co-pruning of data and model structure. A shared-importance synchronization mechanism generates coupled masks that prune redundant regions and their corresponding computational units (e.g., convolutional filters), yielding per-sample sparse sub-networks at inference time. Experiments on WeatherBench, SEVIR, and TaxiBJ show seamless integration with CNN, RNN, and Transformer backbones, reducing FLOPs by up to $70\%$ and achieving a $2.5\times$ speedup on NVIDIA Jetson AGX Orin with negligible accuracy loss ($1\%$).

182. 【2606.15341】CausalDrive: Real-time Causal World Models for Autonomous Driving

链接https://arxiv.org/abs/2606.15341

作者:Tianyi Yan,Huan Zheng,Dubing Chen,Meizhi Qu,Yingying Shen,Lijun Zhou,Mingfei Tu,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun,Cheng-zhong Xu,Jianbing Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scaling autonomous driving, models fall short, promising paradigm, paradigm for scaling, scaling autonomous

备注

点击查看摘要

Abstract:World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.

183. 【2606.15328】SGFormer++: Semantic Graph Transformer for Incremental 3D Scene Graph Generation

链接https://arxiv.org/abs/2606.15328

作者:Mengshi Qi,Changsheng Lv,Zijian Fu,Xianlin Zhang,Huadong Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:point cloud scenes, scene graph generation, parse point cloud, nodes denote detected, denote detected object

备注

点击查看摘要

Abstract:In this paper, we propose SGFormer++, a novel Semantic Graph Transformer for 3D scene graph generation (SGG), which aims to parse point cloud scenes into semantic structural graphs, where nodes denote detected object instances and edges encode their pairwise relationships, with the core challenge lying in modeling complex global scene structure. While existing graph convolutional network (GCN)-based methods suffer from over-smoothing and limited receptive fields, SGFormer++ leverages Transformer layers as its backbone to enable global message passing. Specifically, we introduce two key components tailored for 3D SGG: (1) a Graph Embedding Layer++ that efficiently integrates edge-aware global context with linear computational complexity, and (2) a Semantic Injection Layer++ that enriches visual features with linguistic priors from large language models (LLMs) and vision-language models (VLMs), boosting semantic representation without introducing extra trainable parameters. To further address the practical challenge of incremental SGG (I-SGG), where new relationship categories arrive sequentially, we equip SGFormer++ with a novel Spatial-guided Feature Adapter, which calibrates predicate features using subject-object spatial geometry to counter scale variation, and a Cascaded Binary Prediction Head that mitigates catastrophic forgetting via task-incremental classifier expansion and logit distillation. Extensive experiments on the 3DSSG benchmark demonstrate that SGFormer++ achieves state-of-the-art performance in both standard and incremental settings: it yields a significant 4.49% absolute improvement in Predicate A@1 under the incremental setting. Code and data are available at: this https URL.

184. 【2606.15323】PPDM: Pixel Puzzling Diffusion Model for Speed and Memory Efficient Volumetric Medical Image Translation

链接https://arxiv.org/abs/2606.15323

作者:Tianqi Chen,Jun Hou,Yinchi Zhou,James S. Duncan,Chi Liu,Bo Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated superior fidelity, GPU memory requirements, prohibitive computational cost, extension to high-resolution, volumes is severely

备注: 12 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Diffusion models have demonstrated superior fidelity for medical image-to-image translation, but their extension to high-resolution 3D volumes is severely constrained by prohibitive computational cost and GPU memory requirements. Existing memory-efficient strategies often compromise global volumetric consistency or fine anatomical detail. In this work, we propose the Pixel Puzzling Diffusion Model (PPDM), a simple and effective framework for memory- and speed-efficient 3D medical image translation. PPDM introduces a reversible pixel puzzle-unpuzzle operator that trades spatial resolution for channel dimensionality, substantially reducing activation memory while preserving global context. To further improve efficiency and stability, we adopt a direct bridge diffusion formulation that starts from the conditional input rather than pure noise, enabling the model to focus on task-relevant residuals. In addition, a puzzle-gradient loss is incorporated to enforce spatial coherence and suppress grid-like artifacts introduced by spatial rearrangement. We evaluate PPDM on multiple challenging 3D medical image translation tasks, including low-count PET denoising, joint PET denoising and attenuation correction, and cross-modal MRI translation. Across all tasks, PPDM consistently matches or outperforms full 3D diffusion models while reducing training GPU memory usage by up to an order of magnitude and significantly accelerating inference, and it outperforms existing memory-efficient diffusion approaches based on latent compression or frequency decomposition. These results demonstrate that PPDM provides a practical and scalable solution for high-fidelity 3D diffusion-based medical image translation under limited computational resources.

185. 【2606.15320】Conditional Multi-Event Temporal Grounding in Long-Form Video

链接https://arxiv.org/abs/2606.15320

作者:Yuanhao Zou,Arthad Kulkarni,Lucas Tonanez,Lincoln Spencer,Guangyu Sun,Tianxingjian Ding,Andong Deng,Yi Li,Shuangjun Liu,Yuan Li,Dashan Gao,Ning Bi,Taotao Jing,Shuai Zhang,Chen Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, made rapid progress, applications routinely require, routinely require localizing, Multimodal large

备注

点击查看摘要

Abstract:Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing benchmarks fall short: they localize only a single moment per query, count without temporal conditions, or treat grounding and counting as disjoint tasks. We introduce CoMET-Bench for Conditional Multi-Event Temporal Grounding in long-form video, comprising 2789 queries over 600 videos averaging 33.8 minutes across five real-world domains, with each query composed from 4 temporal conditions, 3 spatial conditions, and a dedicated negative-query subset. We further propose a unified evaluation protocol jointly measuring counting, grounding, and negative-query recognition, including a new Rejection-F1 metric that prevents trivial gaming by lazy "always-empty" models. Benchmarking a broad suite of MLLMs, agent-based, and grounding-specialized methods reveals that existing approaches remain far from solving this task. Building on these findings, we propose CoMET-Agent, a training-free agentic framework that reformulates the task as structured search-and-aggregate, improving F1@0.5 by 6.1% over GPT-5 purely through structural reasoning. Failure analysis further surfaces three open directions: fine-grained entity tracking, position-uniform retrieval, and causal event pairing.

186. 【2606.15305】CoMNeT: A MedNeXt-CorrDiff Framework for Volumetric Brain Tumor Segmentation

链接https://arxiv.org/abs/2606.15305

作者:Michael L. Evans,MD Fayaz Bin Hossen,MD Shibly Sadique,Walia Farzana,Khan M. Iftekharuddin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:magnetic resonance imaging, quantitative neuro-oncology research, multiparametric magnetic resonance, Accurate brain tumor, Accurate brain

备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Accurate brain tumor segmentation from multiparametric magnetic resonance imaging (MRI) is critical for treatment planning, response assessment, and quantitative neuro-oncology research. However, automated segmentation remains a difficult task in computer vision because of variation in tumor appearance and MRI protocols across patient scans. Moreover, clinically important regions such as enhancing tumor (ET) and tumor core (TC) are often small relative to the full brain volume, furthering increasing the difficulty of achieving high voxel-level precision. In this paper, we show that combining a modern 3D convolutional segmentation model with corrective diffusion-based refinement and ensembling improves volumetric glioma segmentation on the UTSW-Glioma dataset. We propose CoMNeT, a MedNeXt-CorrDiff framework that uses four MRI modalities as input and predicts ET, TC, and whole tumor (WT) regions for automated brain tumor segmentation. MedNeXt is used as the primary segmentation model with Global Response Normalization for feature learning, while CorrDiff is trained as a postprocessing residual refinement method to correct errors in the probability maps before final thresholding. Using five-fold cross-validation, CoMNeT achieved the highest Dice score for most tumor regions, with ET, TC, WT, and average Dice scores of 0.7543 +/- 0.0261, 0.6806 +/- 0.0166, 0.9049 +/- 0.0128, and 0.7798 +/- 0.0184, respectively. CoMNeT outperformed two selected baseline models: SegResNet (0.7555 +/- 0.0190 average Dice) and standalone MedNeXt (0.7697 +/- 0.0154 average Dice). Our findings support the use of corrective diffusion and fold-level probability ensembling as practical additions to existing state-of-the-art 3D convolutional models for automated glioma segmentation.

187. 【2606.15304】HemExp: Clinically-Guided Latent Diffusion for Modeling Hematoma Expansion

链接https://arxiv.org/abs/2606.15304

作者:Orhun Utku Aydin,Satoru Tanioka,Tzu I Chuang,Alexander Koch,Dimitrios Rallios,Marie Gultom,Begum Tahhan,Fujimaro Ishida,Dietmar Frey,Adam Hilbert

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:spontaneous intracerebral hemorrhage, neurosurgical care, spontaneous intracerebral, major determinant, determinant of acute

备注

点击查看摘要

Abstract:Hematoma expansion (HE) after spontaneous intracerebral hemorrhage (ICH) is a major determinant of acute triage and treatment decisions in neurosurgical care. However, most existing methods provide either a binary expansion risk or a single follow-up volume, limiting uncertainty-aware decisions. We introduce HemExp, a clinically-guided latent diffusion model that generates patient-specific follow-up non-contrast CT images, along with segmentations of intraparenchymal and intraventricular hemorrhage. Generation is conditioned on baseline imaging, clinical variables, and an explicit expansion indicator, enabling controllable simulation of realistic clinical scenarios. HemExp uses a hemorrhage-aware multi-head variational autoencoder and models progression as the difference between baseline and follow-up latent representations with a conditional diffusion model. The model is trained on paired scans from 450 patients across multiple centers and evaluated on 107 patients from a held-out institution. HemExp produces spatial HE probability maps by generating multiple synthetic follow-up images per patient to estimate distributions of plausible follow-up hematoma volumes. Perturbing clinical inputs such as symptom-onset-to-imaging time or anticoagulant status shifts the predicted follow-up volume distribution. HemExp extends binary predictors and demonstrates robust estimation of clinically relevant outcomes in the imaging space, such as hematoma volume, intraventricular involvement, and mass effects. Overall, our results support controllable latent diffusion as a promising direction for uncertainty-aware modeling of early ICH progression.

188. 【2606.15287】G2IA: Geometry-Guided Instance-Aware Retrieval and Refinement for Cross-Modal Place Recognition

链接https://arxiv.org/abs/2606.15287

作者:Xianyun Jiao,Jingyi Xu,Zhongmiao Yan,Xieyuanli Chen,Lin Pei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous navigation scenarios, enables camera-only robots, Cross-modal place recognition, pre-built LiDAR maps, enables camera-only

备注

点击查看摘要

Abstract:Cross-modal place recognition (CMPR) enables camera-only robots to localize against pre-built LiDAR maps in autonomous navigation scenarios. This image-to-point-cloud setting is challenged by two coupled ambiguities: the modality gap between perspective RGB appearance and sparse metric geometry, and perceptual aliasing among urban places with similar roads, facades, intersections, and object arrangements. Instead of treating CMPR as a single global descriptor matching problem, we argue that reliable retrieval requires both geometry-aware representation alignment and fine-grained candidate verification. In this paper, we propose G2IA, a geometry-guided instance-aware framework for image-to-point-cloud place recognition. In the retrieval stage, visual geometry priors from VGGT and instance features are integrated to construct place descriptors that are more compatible with LiDAR-derived map representations. In the refinement stage, the retrieved candidates are re-ranked by explicitly verifying whether local instance shapes and their relative spatial layouts are consistent across modalities. Experiments on public benchmarks demonstrate that G2IA consistently improves image-to-point-cloud place recognition under different localization thresholds, and exhibits strong cross-dataset generalization.

189. 【2606.15286】Decoupled Motion Representation Learning for Moving Infrared Small Target Detection

链接https://arxiv.org/abs/2606.15286

作者:Guoyi Zhang,Peiwen Wu,Han Wang,Xiangpeng Xu,Xiaohu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Infrared small target, small target detection, highly coupled motions, coherent background dynamics, remains challenging due

备注

点击查看摘要

Abstract:Infrared small target detection in dynamic scenes remains challenging due to the highly coupled motions among targets, imaging platforms, and dynamic backgrounds. Existing multi-frame methods usually perform implicit temporal modeling, where coherent background dynamics dominate motion correspondence learning, leading to an inherent trade-off between detection and false alarms. In this work, we observe that background motions exhibit strong global coherence, whereas small targets mainly correspond to sparse local motion anomalies. Moreover, many false-alarm responses maintain high consistency with globally coherent motion patterns, indicating that they mainly originate from coherent background dynamics rather than genuine target motions. Based on these observations, we propose a decoupled motion representation learning framework for moving infrared small target detection. Specifically, an explicit motion branch is introduced to model globally coherent motion dynamics using pretrained optical flow priors, together with a structure-preserving self-supervised adaptation strategy for infrared motion correspondence learning. Meanwhile, an implicit motion branch based on deformable feature alignment is designed to capture target-sensitive local motion anomalies under coherent motion guidance. Furthermore, a coherent-motion-guided local anomaly reasoning module is proposed to identify and suppress coherent-motion-induced false responses during localized motion modeling. Extensive experiments on two challenging infrared small target detection benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches, particularly in dynamic scenes with complex motions, while maintaining favorable inference efficiency.

190. 【2606.15282】Enhancing Precision Agriculture with a Hybrid Deep Learning Framework for Multi-Class Plant Disease Classification and Interpretability

链接https://arxiv.org/abs/2606.15282

作者:Hasibul Islam Sufi,Ridam Roy,Shayla Alam Setu,Mahimul Islam Nadim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformer, high-resolution leaf imagery, deep learning architecture, hybrid ResNet, study proposes

备注

点击查看摘要

Abstract:This study proposes an overall deep learning architecture for multi-class classification of plant diseases from high-resolution leaf imagery, with a particular interest in investigating the behavior of ResNet-50 and a hybrid ResNet + Vision Transformer (ViT) design. A specially gathered image database with 15,200 training images and 3,800 validation images spanning 38 classes across multiple crops, including tomato, apple, grape etc. were subjected to preprocessing steps such as resizing, normalization, and data augmentation to enhance model robustness. Multiple architectures, including ResNet-50, MobileNetV2, and EfficientNet-B0, were trained and compared with the hybrid ResNet + ViT model. All models were fine-tuned using the AdamW optimizer and cross-entropy loss, with early stopping applied to prevent overfitting and ensure generalization. Furthermore, interpretability techniques such as Grad-CAM and saliency maps were implemented to indicate disease-relevant regions, while segmentation-based analysis was performed to identify the affected parts of a leaf. For every one of the considered architectures, ResNet-50 led to the highest accuracy of 98.74%, whereas the hybrid ResNet + ViT model achieved a competitive accuracy of 98.58%, showing that the hybrid architectures were effective in capturing both local and overall information. The experimental results showcase the promise of transformer-based models to achieve highly accurate, interpretable, and computationally efficient computer-based multi-class multi-disease classification systems, providing helpful assistance for cultivation management practices as well as for precision farming.

191. 【2606.15275】MamBOA: State-Space Architecture for Video Recognition

链接https://arxiv.org/abs/2606.15275

作者:Mustafa Bora Çelik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fine-grained action recognition, dense operators couple, action recognition demands, general-purpose architectures address, operators couple computation

备注: 15 pages, 7 figures. Codes available at [ [this https URL](https://github.com/BOA-clk/MamBOA) ]

点击查看摘要

Abstract:Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

192. 【2606.15265】rusted Multi-View Deep Learning Classification of Fetal Congenital Heart Disease with Feature-level and Decision-level Fusion

链接https://arxiv.org/abs/2606.15265

作者:Tan Zhou,Shifa Yao,Suncheng Xiang,Dahong Qian,Baoying Ye

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Congenital heart disease, abnormal anatomical structure, anatomical structure caused, Congenital heart, abnormal development

备注

点击查看摘要

Abstract:Congenital heart disease (CHD) refers to the abnormal anatomical structure caused by the abnormal development of the heart and great vessels during embryonic development. Traditional diagnostics often fail to achieve high accuracy and efficiency, especially given the complexity of cardiac anatomy. This study presents a specialized multi-view deep learning framework for CHD binary classification using echocardiographic images. A large-scale CHD dataset, including five views, was used to train the model, enabling it to integrate multi-angle image data. The framework utilizes advanced feature extraction and attention mechanisms to improve diagnostic precision and reliability. An uncertainty-based decision-making component is also integrated to handle low-quality images, enhancing diagnostic outcomes. Experimental results show that this method achieves top-tier performance on our dataset and provides a robust tool for early CHD detection, underscoring its potential for clinical use. The dataset and source code will be released upon paper acceptance.

193. 【2606.15253】Focus, Align, and Sustain: Counteracting Gradient Dilution in Incremental Object Detection

链接https://arxiv.org/abs/2606.15253

作者:Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Yu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adapting Detection Transformers, Incremental Object Detection, Adapting Detection, Detection Transformers, Object Detection

备注: Accepted by ICML2026

点击查看摘要

Abstract:Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as set-based optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion of preservation gradients in magnitude, direction, and support coverage, driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query-target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that Focuses, Aligns, and Sustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query-target assignments and enforce semantic consistency across stages under unstable matching. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10x4 incremental setting.

194. 【2606.15250】Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

链接https://arxiv.org/abs/2606.15250

作者:Zhisen Hu,Antti Kemppainen,David Johnson,Egor Panfilov,Huy Hoang Nguyen,Timothy Cootes,Claudia Lindner,Aleksei Tiulpin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:predicting joint health, Radiographic assessment, total knee arthroplasty, assessment of lower-limb, important for predicting

备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

195. 【2606.15243】SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation

链接https://arxiv.org/abs/2606.15243

作者:Mohamed Jismy Aashik Rasool,Shabir Ahmad,Gisong Oh,Teag Kuen Whangbo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-bit quantization enables, quantization enables deployment, introduces rounding noise, disproportionately degrades high-frequency, Low-bit quantization

备注: 13 pages, 3 figures,5 tables ,BMVC submission

点击查看摘要

Abstract:Low-bit quantization enables deployment of image restoration (IR) networks on resource-constrained devices, but introduces rounding noise that disproportionately degrades high-frequency regions such as edges and fine textures. Existing knowledge distillation (KD) methods apply distillation signals uniformly across all spatial locations, overlooking the varying reconstruction difficulty across image regions. To address this, we propose SPARK (Spatial Policy-driven Adaptive Reinforcement Learning for Knowledge Distillation), a framework that adaptively allocates distillation effort using a lightweight reinforcement learning (RL) policy network. At each training step, a difficulty feature extractor computes four signals, namely Laplacian variance, pixel variance, student reconstruction error, and teacher-student knowledge gap, which are fed into a compact policy CNN that produces a stochastic spatial weight map to modulate the KD loss during quantization-aware training (QAT). SPARK is IR task-agnostic, adds no inference cost, and integrates into any existing QAT pipeline without architectural changes. Experiments on benchmark datasets demonstrate that SPARK consistently outperforms PTQ, QAT, and state-of-the-art (SOTA) KD approaches across multiple student architectures, achieving reconstruction quality closest to the full-precision teacher under significant computational constraints.

196. 【2606.15238】HairLRM: Strand-based Hair Modeling via Large Reconstruction Models

链接https://arxiv.org/abs/2606.15238

作者:Yuefan Shen,Yican Dong,Xiufeng Huang,Zhongtian Zheng,Youyi Zheng,Kui Wu

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:simply data scarcity, traditional strand-based modeling, data scarcity, fundamental limitation, limitation of traditional

备注: ACM SIGGRAPH 2026 Conference Paper

点击查看摘要

Abstract:The fundamental limitation of traditional strand-based modeling is not simply data scarcity, but the ill-posedness of inferring complex 3D fields from 2D imagery without structural constraints. This unconstrained regression leads to catastrophic failures in resolving both global occlusion (e.g., in ponytails) and local directionality (e.g., in curls), resulting in over-smoothed, plausible-but-incorrect geometries. To resolve this, we integrate the strong geometric priors of Large Reconstruction Models (LRMs) into the strand generation pipeline. Using the LRM mesh as a structural anchor, we employ a novel Dual Orientation AutoEncoder to lift coarse geometry into high-fidelity strands. By resolving vector field singularities through latent-space optimization and surface-guided refinement, our method effectively disentangles complex topological structures, setting a new benchmark for robustness and accuracy in hair reconstruction.

197. 【2606.15236】Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

链接https://arxiv.org/abs/2606.15236

作者:Weichen Fan,Haiwen Diao,Penghao Wu,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strongly frequency dependent, full-bandwidth noisy images, frequency dependent, trained on full-bandwidth, strongly frequency

备注: Code link: [this https URL](https://github.com/WeichenFan/Spectral_Forcing)

点击查看摘要

Abstract:Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/\alpha}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

198. 【2606.15202】Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

链接https://arxiv.org/abs/2606.15202

作者:Marta Vallejo,Siwen Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:plays an important, important role, people perceive, perceive and respond, visual attention plays

备注: 30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request

点击查看摘要

Abstract:Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 +- 0.117), Normalised Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns.

Comments:
30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.15202 [cs.CV]

(or
arXiv:2606.15202v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.15202

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
199. 【2606.15200】Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams

链接https://arxiv.org/abs/2606.15200

作者:Yun Wang,Junbin Xiao,Han Lyu,Yifan Wang,Jing Zuo,Zhanjie Zhang,Hong Huang,Dapeng Wu,Angela Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diagnosing User-Centric Continual, User-Centric Continual Spatial, User-Centric Continual, Continual Spatial intelligence, Continual Spatial

备注: 45 pages. [this https URL](https://icml.cc/virtual/2026/poster/63682)

点击查看摘要

Abstract:We introduce UCS-Bench, a dataset spanning 170+ hours of egocentric visual observations with 8.1K+ timestamped questions for diagnosing User-Centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users' real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to the user's movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adapting to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatially aware and long-form streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code are available at this https URL.

200. 【2606.15198】City landscape in sight: A crowdsourced framework for unlocking urban-scale window view perceptions from real estate imagery

链接https://arxiv.org/abs/2606.15198

作者:Chucai Peng,Sijie Yang,Ang Liu,Yang Xiang,Zhixiang Zhou,Filip Biljecki

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:scale remain understudied, urban scale remain, quality of life, remain understudied, viewed through home

备注

点击查看摘要

Abstract:City landscapes viewed through home windows influence quality of life, yet perceptions of actual window views at the urban scale remain understudied. This study presents an approach for large-scale mapping of perceptions using 12,334 window view images (WVIs) collected from actual residential properties listed on real estate platforms in Wuhan, China, representing a rarely explored form of urban view imagery that offers advantages over the rendered or simulated window views commonly examined in previous studies. Through a non-immersive virtual reality platform, we collected 27,477 pairwise comparisons across six perceptual dimensions (e.g.\ Vivid) from 304 participants based on 499 WVIs. A hybrid neural network model was trained to predict human perceptions of all crowdsourced WVIs and map their spatial distribution. Results reveal significant spatial autocorrelation with distinct hot and cold spots across the whole city. Floor level strongly influences human perceptions: while higher floors offer more preferred and extensive window views, lower-floor windows provide residents with quiet and vivid views. An inference model further shows that window view composition matters considerably: high ratios of sky, trees, and low-rise buildings enhance people's preferences and perceptions of vividness, whereas high ratios of high-rise buildings increase perceptions of monotony and oppression. Importantly, these effects are non-linear: the excessive presence of certain elements can alter their impact on human perception. This work advances urban-scale understanding of residents' visual experiences and provides evidence-based guidance for human-centric urban planning and real estate to optimise visual landscapes from windows.

201. 【2606.15188】Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

链接https://arxiv.org/abs/2606.15188

作者:Yue Yu,Yang Jiao,Jiayu Wang,Qi Dai,Jingjing Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made notable progress, Instruction-based image, Instruction-based image editing, made notable, notable progress

备注

点击查看摘要

Abstract:Instruction-based image editing has made notable progress with recent advances in generative models. However, the quality of the edited result is still influenced by the randomly sampled initial noise, particularly in complex editing scenarios. An unsuitable initial noise may lead to unsatisfactory editing results. Recent inference-time scaling methods address this issue by sampling multiple initial noises and selecting better candidates. Nevertheless, most of them follow a decode-then-verify scheme which introduces an efficiency-accuracy trade-off. When decoding is performed after limited inference steps, the decoded images often remain too noisy for reliable assessment, whereas sufficiently denoised images require much higher computational cost. To address this issue, we propose VeriLatent, a plug-and-play adaptive inference-time scaling framework with early-step latent verification for image editing. Specifically, we propose a novel verifier that scores each initial noise through a latent-space editing activation map at an early stage. It identifies promising candidates by assessing whether they can induce an effective edit in the correct region. This enables efficient early pruning without decoding latents into images. Building on this, we further develop an adaptive search strategy for inference-time scaling. It allocates inference budgets according to editing difficulty, thereby reducing the number of function evaluations (NFE). Extensive experiments on multiple benchmarks and different base models demonstrate that VeriLatent consistently improves both editing performance and inference-time scaling efficiency.

202. 【2606.15176】Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

链接https://arxiv.org/abs/2606.15176

作者:Weihao Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deployment remains constrained, widely adopted medical, adopted medical modality, medical modality globally, modality globally due

备注: 15 pages,4 figures

点击查看摘要

Abstract:Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

203. 【2606.15169】Label Shift Aware Adaptation for Online Zero-shot Learning with Contrastive Language-Image Pre-Training (CLIP)

链接https://arxiv.org/abs/2606.15169

作者:Pengxiao Han,Changkun Ye,Yanshuo Wang,Jinguang Tong,Miaohua Zhang,Xuesong Li,Jie Hong,Lars Petersson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Contrastive Language-Image Pre-Training, Contrastive Language-Image, Vision-language models, CLIP, Language-Image Pre-Training

备注

点击查看摘要

Abstract:Vision-language models like Contrastive Language-Image Pre-Training (CLIP) have been extensively studied in data-scarce scenarios. A particularly challenging and realistic task in this area is online zero-shot learning with CLIP, where unknown test samples are predicted sequentially in random order by CLIP while keeping the feature extraction and model parameters fixed during the sequential inference phase. Most existing approaches in this setting address the problem by adapting representations online using incoming test samples, while neglecting the distribution of the data on which CLIP was initially trained. This mismatch can lead to degraded performance when the label distribution in the test data differs from that of the training domain. To address this gap, we propose Label Shift Aware (LSA), which formulates the online zero-shot classification task as a domain adaptation problem. Specifically, LSA adapts the predictions computed by CLIP, which was trained on an unknown source distribution, to a target distribution using only unlabeled test data, and applies label shift correction to mitigate the mismatch between the source and target domains. The extensive experiments across multiple datasets demonstrate that the proposed LSA consistently outperforms state-of-the-art online zero-shot learning methods based on CLIP.

204. 【2606.15167】Variational Network with Wavelet-based UNET in Accelerated MRI Reconstruction from Under Sampled K-space Data

链接https://arxiv.org/abs/2606.15167

作者:Yasir Arafat Prodhan(1),Shaikh Anowarul Fattah(1) ((1) Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fully sampled MRI, reduced clinical throughput, long scan times, sampled MRI requires, MRI requires dense

备注: 14 pages, 9 figures

点击查看摘要

Abstract:Fully sampled MRI requires dense k-space acquisition, leading to long scan times, reduced clinical throughput, and increased sensitivity to patient motion. Accelerated MRI addresses this by acquiring undersampled k-space data and reconstructing the missing information computationally. However, reconstruction from undersampled measurements is highly ill-posed and can introduce aliasing artifacts, noise amplification, and loss of anatomical detail. Although conventional parallel imaging and compressed sensing methods mitigate these issues, and deep learning methods have further improved reconstruction quality, preserving high-frequency structures under aggressive undersampling remains challenging. In this work, we propose a Variational Network with a Wavelet-based U-Net (W-UNet) for accelerated MRI reconstruction. The framework combines physics-guided iterative reconstruction with learnable multi-scale frequency representations. Standard pooling operations are replaced with Discrete Wavelet Transform and Inverse Wavelet Transform modules, enabling lossless downsampling while preserving low-frequency structure and high-frequency edge details. Integrated into the refinement and sensitivity map estimation stages, the proposed design improves artifact suppression, feature preservation, and reconstruction fidelity in both single-coil and multi-coil settings. Experiments on fastMRI knee and M4Raw brain datasets show state-of-the-art performance. Ablation studies further confirm the effectiveness of wavelet-based feature decomposition for accelerated MRI reconstruction.

205. 【2606.15162】GeoStream: Toward Precise Camera Controlled Streaming Video Generation

链接https://arxiv.org/abs/2606.15162

作者:Yizhou Zhao,Yifan Wang,Xiaoyuan Wang,Yushu Wu,Hao Zhang,Moayed Haji-Ali,Rameen Abdal,Ashkan Mirzaei,Yanyu Li,Willi Menapace,Laszlo Jeni,Sergey Tulyakov,Peter Wonka,Chaoyang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate interactive camera, existing approaches learn, camera motion implicitly, learn camera motion, approaches learn camera

备注

点击查看摘要

Abstract:Accurate interactive camera control is essential for video-based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out-of-distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non-autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric-scale camera control in autoregressive streaming video generation. Our method maintains a self-refreshing 3D cache that is periodically updated online from the model's own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student's own generated frames, yielding a fully on-policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off-policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second-order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.

206. 【2606.15160】DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

链接https://arxiv.org/abs/2606.15160

作者:David Huang,Lianlei Shan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:large language models, multimodal large language, recent years, large language, improved considerably

备注: Preprint. 9 pages main text, 15 pages total including appendix, 2 figures

点击查看摘要

Abstract:Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance multi-step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real-world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent-space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality-based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource-constrained sequential decision problem and introduce a resource-aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2-5 points in accuracy while reducing memory usage by 24%.

207. 【2606.15158】RefGC-SR$^2$: Reference-guided Generated Content Super-Resolution and Refinement

链接https://arxiv.org/abs/2606.15158

作者:Jeahun Sung,Dahyeon Kye,Soo Ye Kim,Jihyong Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current pipelines share, reference-guided generated content, progressed rapidly, fundamental limitation, provided by users

备注: The first two authors contributed equally to this work. The last two authors are co-corresponding authors. Please visit our project page at [this https URL](https://cmlab-korea.github.io/RefGC-SR2/)

点击查看摘要

Abstract:Reference-guided generation (e.g., object compositing, customization) has progressed rapidly, yet current pipelines share a fundamental limitation: the object-centric high-resolution reference image (HRRI) provided by users is downsampled to a fixed low-resolution (LR) before being fed into the model, so the fine-grained details are discarded before the output is even produced. In addition, the generation step then introduces its own artifacts (e.g., identity distortion) on top of this loss. Existing reference-guided generated content refinement (RefGCR) methods can correct some of these artifacts but still operate in the LR domain; reference-guided super-resolution (RefSR) methods recover resolution but assume natural-image degradations and ignore the artifact distribution of generative pipelines. To address both gaps in a single formulation, we introduce a new task: reference-guided generated content super-resolution-refinement (RefGC-SR$^2$), where the original HRRI is reused at the post-processing stage to recover lost details, refine generative artifacts, and upscale the output simultaneously. We construct the first real-world triplet data generation pipeline for this RefGC-SR$^2$ task, training a diptych-conditioned generator to synthesize paired low-quality anchors that public pretrained models cannot provide. We further present a frequency-aware diffusion transformer model for RefGC-SR$^2$ that selectively injects fine details from the HRRI while removing generative artifacts. Extensive experiments demonstrate that our RefGC-SR$^2$ model successfully (i) refines the object identity faithfully with respect to the reference, and (ii) recovers high-resolution details, so that the final result is significantly higher quality and practically more usable compared to existing RefGCR and RefSR baselines.

208. 【2606.15151】HiRo: A Compact Four-Directional Hierarchical Reservoir Token-Mixer for Efficient Image Classification

链接https://arxiv.org/abs/2606.15151

作者:Md Farhadul Islam,Ishan Thakkar,J. Todd Hastings

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent image classification, local feature modeling, balance local feature, Recent image, feature modeling

备注: Accepted at ICONS 2026

点击查看摘要

Abstract:Recent image classification models must balance local feature modeling, cross-window interaction, and parameter efficiency. Many high-performing architectures rely on fully trainable token-mixers, which improve representation learning but increase parameter count, optimization complexity and computational cost. We propose a parameter-efficient image classification model called HiRo that integrates shifted-window partitioning with multi-directional hierarchical reservoir computing. Images are divided into non-overlapping patches (treated as tokens), linearly projected, normalized, and enriched with 2D sinusoidal positional encodings, then processed within local windows. Inside each window, tokens are scanned in four directions and passed through a two-stage slice-and-mix reservoir module. In the first stage, directional sequences are split into contiguous slices, each processed by its own fixed reservoir with a trainable closed-loop readout. The resulting slice outputs are summarized using the start, end, and mean representations, and then mixed by a second-stage fixed reservoir for each direction. The mixed slice representations are expanded back to the token level and fused with the first-stage outputs, after which the four directional outputs are realigned and averaged. Consecutive blocks alternate between regular and shifted windows to enable cross-window interaction, followed by layer normalization, a residual feed-forward network, and global pooling for classification. This design combines regular and shifted window partitioning with hierarchical multi-directional reservoirs to make an efficient local-to-cross-window token-mixing framework for image classification. Despite using under 1M trainable parameters and significantly lower memory and time than transformer-style baselines, HiRo also achieves 99.46%, 85.57%, and 59.10% accuracy on MNIST, CIFAR-10, and CIFAR-100, respectively.

209. 【2606.15142】MotionVLA: Vision-Language-Action Model for Humanoid Motion

链接https://arxiv.org/abs/2606.15142

作者:Nonghai Zhang,Siyu Zhai,Yanjun Li,Zeyu Zhang,Zhihan Yin,Yandong Guo,Boxin Shi,Hao Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Generating realistic humanoid, Generating realistic, low-frequency pose semantics, realistic humanoid motion, high-frequency physical dynamics

备注

点击查看摘要

Abstract:Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: this https URL. Website: this https URL.

210. 【2606.15134】Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

链接https://arxiv.org/abs/2606.15134

作者:Shubhang Bhatnagar,Dheeraj Baiju,Narendra Ahuja

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:training pair reduces, differed or matched, typically trained, trained with class-label, Relative Policy Optimization

备注

点击查看摘要

Abstract:Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

211. 【2606.15133】DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

链接https://arxiv.org/abs/2606.15133

作者:Tianshan Zhang,Yijia Duan,Yanjun Li,Zeyu Zhang,Hao Tang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:compliant contact patterns, important for household, parallel-jaw grasping, patterns beyond parallel-jaw, contact

备注: Code: [this https URL](https://github.com/AIGeeksGroup/DragMesh-2) . Website: [this https URL](https://aigeeksgroup.github.io/DragMesh-2)

点击查看摘要

Abstract:Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand--handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand--object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand--object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand--object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.

212. 【2606.15129】EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

链接https://arxiv.org/abs/2606.15129

作者:Zhuo Deng,Ruiheng Zhang,Ziheng Zhang,Weihao Gao,Yitong Li,Qian Wang,Lei Shao,Jiaoyue Dong,Zhixi Zeng,Lijian Fang,Haibo Wang,Xiaobin Lin,Tao Liu,Zhicheng Du,Zhengwei Zhang,Lin Yang,Zheng Gong,Xinyu Zhao,Zhenquan Wu,Fang Li,Zhiguang Zhou,Guoming Zhang,Sun Jing,Han Lv,Wenbin We,Lan Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Color fundus photography, Color fundus, depth-resolved structural information, CFP, CFP representations

备注

点击查看摘要

Abstract:Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP--OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP--OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

213. 【2606.15118】Multi-view feature High-order Fusion for Space Weak Object Detection and Segmentation

链接https://arxiv.org/abs/2606.15118

作者:Weilong Guo,Yuhan Sun,Shengyang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weak objects, common in images, multi-view, MHF, Weak

备注

点击查看摘要

Abstract:Weak objects are common in images and videos of space applications. However, it is hard to learn proper representations from their limited appearance information. Inspired by multi-view learning, we develop simple multi-view attentions, treating their outputs as multi-view features. We also propose a multi-view feature high-order fusion method (MHF) to aggregate more accurate and richer features of weak objects. Our MHF extends the commonly used low-order feature fusion method to higher orders. It enhances the model's capacity to capture relevant and complementary information about weak objects. This is achieved by introducing high-order multi-view features perception and a recursive task-contribution gated selection of multi-view features. The new operation is highly flexible and customizable. It is compatible with various variants of multi-view feature representations. We conduct extensive experiments on two newly constructed space science datasets and an open, large-scale satellite video dataset. Our MHF serves as a plug-and-play module and significantly improves various vision transformers and convolution-based detection and segmentation models. We achieve all state-of-the-art accuracies on both tasks across three datasets. Our MHF can be a new basic module for visual modeling that effectively represents weak objects in terms of multi-view learning. The code will be available at this https URL.

214. 【2606.15117】acher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

链接https://arxiv.org/abs/2606.15117

作者:Elham Abolhasani,Maryam Ramezani,Hamid R. Rabiee

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)

关键词:realistic deepfake media, encompassing the manipulation, rapid advancement, advancement of generative, deepfake media

备注

点击查看摘要

Abstract:The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.

215. 【2606.15112】Learn Temporal Consistency For Robust Satellite Video Detector

链接https://arxiv.org/abs/2606.15112

作者:Weilong Guo,Shengyang Li,Yanfeng Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Satellite video object, plays an important, important role, fine-grained objects plays, video object detection

备注: 11 pages, 8 figures

点击查看摘要

Abstract:Satellite video object detection (SVOD) for oriented and fine-grained objects plays an important role in satellite applications. Most existing SVOD methods only focus on one or a few coarse-grained categories of moving objects and represent objects with horizontal bounding boxes. They have difficulty extracting complete, accurate, and consistent information about objects in whole satellite videos. In this paper, we propose a satellite video object detection framework based on Temporal Consistency Learning (TCL). TCL adeptly detects oriented and fine-grained objects by leveraging the rich temporal contexts within satellite videos. The framework integrates three key modules: temporal and fine-grained feature aggregation (TFA), structure encoding (SE), and temporal consistency constraint (TCC). TFA and TCC modules facilitate consistent representation learning across frames, while the SE module encodes both appearance and structural information for precise fine-grained recognition. Experimental results on the SAT-MTB benchmark dataset demonstrate TCL's superior performance, achieving a new state-of-the-art oriented and fine-grained detection accuracy of 47.7% mAP--a 4.8% improvement over the baseline. Furthermore, our TCL framework readily accommodates existing image-based detectors, leading to enhanced detection accuracies.

216. 【2606.15110】Physics-Driven Zero-Shot MRI Reconstruction with Non-local Image Priors

链接https://arxiv.org/abs/2606.15110

作者:Lingtong Zhang,Wenlei Li,Mu He,Li Xiao,Yang Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic Resonance Imaging, accelerated Magnetic Resonance, Resonance Imaging, Magnetic Resonance, fully-sampled external datasets

备注

点击查看摘要

Abstract:Zero-Shot Self-Supervised Learning (ZS-SSL) has emerged as a promising paradigm for accelerated Magnetic Resonance Imaging (MRI) reconstruction, eliminating the reliance on fully-sampled external datasets. However, learning solely from a single under-sampled scan suffers from supervision scarcity and optimization instability, often leading to overfitting or artifacts. To address these challenges, we propose a robust physics-driven ZS-SSL framework that synergizes physical consistency with image-domain non-local priors. Our method introduces three core innovations: (1) a Coil Sensitivity Map (CSM)-Guided Dynamic Repository, which stabilizes the training trajectory by filtering physically inconsistent artifacts based on coil sensitivity constraints; (2) a SPIRiT-based regularization, which enforces k-space self-consistency via a learned correlation kernel and stochastic masking; (3) a Non-Local Self-Similarity (NSS) Pixel Bank, which leverages the high-fidelity reference established by the former modules to explicitly mine non-local anatomical similarities, thereby augmenting supervision in the image domain. Extensive experiments on the FastMRI dataset demonstrate that our approach achieves state-of-the-art performance, particularly under high acceleration factors, effectively bridging the gap between zero-shot learning and supervised methods. The code is available at this https URL.

217. 【2606.15104】xt-Driven Fusion for Infrared and Visible Images: Achieving Image Scene Adaptation on Hyperbolic Space

链接https://arxiv.org/abs/2606.15104

作者:Huan Kang,Hui Li,Tianyang Xu,Tao Zhou,Xiao-Jun Wu,Josef Kittler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:integrate complementary modalities, impose rigid distance, distort multi-modal interactions, existing Euclidean methods, Euclidean methods impose

备注: 14 pages, 8 figures

点击查看摘要

Abstract:Infrared and visible image fusion aims to integrate complementary modalities, while existing Euclidean methods impose rigid distance metrics that distort multi-modal interactions and parent-to-child semantic hierarchies. To overcome these limitations, we introduce a text-driven fusion framework empowered by hyperbolic manifold learning. During training, BLIP-extracted text prompts serve as topological anchors within the hyperbolic space, guiding vision-attribute alignment through hyperbolic embeddings that naturally accommodate varying semantic granularities. By exploiting the exponential volume growth dictated by the Poincaré ball's negative curvature, this approach seamlessly embeds hierarchical trees to encode coarse-to-fine semantics without metric saturation, while the vast peripheral space prevents texture distortion during cross-modal fusion. At inference, the fusion process autonomously adapts to input content using the learned text-attribute priors, completely eliminating the need for textual input. Experimental results show our method outperforms state-of-the-art approaches on benchmark datasets, with code available at this https URL.

218. 【2606.15099】hink Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models

链接https://arxiv.org/abs/2606.15099

作者:Dianqiao Lei,Lianlei Shan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:models predominantly rely, Variable Alignment VLA, perception and action, predominantly rely, bridge perception

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.

219. 【2606.15072】xture-Shape Bias Balancing for Robust Synthetic-to-Real Semantic Segmentation in Automotive NIR Imagery

链接https://arxiv.org/abs/2606.15072

作者:Felix Stillger,Ben Hamscher,Lukas Hahn,Annika Mütze,Tobias Meisen,Kira Maag

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern automotive systems, enabling pixel-level scene, pixel-level scene understanding, enabling pixel-level, Semantic segmentation

备注: Accepted at ECML PKDD 2026 (ADS Track)

点击查看摘要

Abstract:Semantic segmentation is a fundamental component of visual perception in modern automotive systems, enabling pixel-level scene understanding. Near-Infrared imaging (NIR) offers stable detection under difficult illumination conditions, but the development of domain-specific semantic segmentation models remains challenging due to the lack of high-quality annotated data from real-world scenarios. Synthetic datasets offer a scalable alternative, but models trained on synthetic images often suffer performance degradation when transferred to real domains. We present the first systematic study on synthetic to real domain adaptation for semantic segmentation in NIR images in the automotive domain. We propose a generative augmentation framework that transforms synthetic images into realistic NIR-style variants via our introduced target style adaptation (TSA). TSA fine-tunes a latent diffusion model via low-rank adaptation on a small curated set of real NIR images and applies it to synthetic training data using structure-preserving multi-signal conditioning. To reduce texture bias and improve segmentation robustness, we further apply a Voronoi-based style diversification strategy (VSD) that modifies the original textures while preserving scene geometry. Experiments with multiple model architectures on NIR data from vehicle interiors and street scenes show that balancing inductive bias during training leads to noticeably more robust semantic segmentation and effectively reduces the domain gap in our real-world scenarios by up to 63.6% on exterior and 28.4% on interior data. The code is available at GitHub.

220. 【2606.15055】Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

链接https://arxiv.org/abs/2606.15055

作者:Xinze Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:underpins evidence-based decisions, streetscapes underpins evidence-based, public health, Visual perception, underpins evidence-based

备注

点击查看摘要

Abstract:Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 -- a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

221. 【2606.15049】Gaussian Spatial Priors for Anatomy-Aware Object Detection in Surgical Videos

链接https://arxiv.org/abs/2606.15049

作者:Yunfan Li,Artem Shmelev,Himanshu Gupta

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Detecting anatomical structures, Myopectineal Orifice, Critical View, View of Myopectineal, intraoperative safety frameworks

备注

点击查看摘要

Abstract:Detecting anatomical structures in surgical video is essential for intraoperative safety frameworks such as the Critical View of Myopectineal Orifice (CVMPO) in inguinal hernia repair. While prominent structures like the Cooper's Ligament and Triangle of Doom are reliably detected by standard methods, smaller structures such as the epigastric vessels remain challenging due to their visual ambiguity and intermittent visibility. We observe that the spatial relationship between structures is anatomically constrained, and propose a Gaussian Spatial Prior (GSP) module that encodes this relationship as a compact, parametric bias injected into the self-attention of a DAB-DETR decoder. The prior is computed offline from training annotations as a small set of frozen Gaussian parameters and recomputed at each decoder layer using the iteratively refined reference points. On a dataset of inguinal hernia repair videos with 5-fold cross-validation, GSP improves dependent class detection by $+33.5\%$ ($\text{AP}_{50}$) over DAB-DETR and $+53.9\%$ over YOLOv26, while also improving anchor detection by $+6.0\%$. These gains are statistically significant across all folds ($p=0.012$, paired $t-$test).

222. 【2606.15048】mporal Difference Learning for Diffusion Models

链接https://arxiv.org/abs/2606.15048

作者:Qizhen Ying,Yangchen Pan,Victor Adrian Prisacariu,Junfeng Wen

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:local denoising targets, individual time steps, adjacent pairs, typically trained, focus on local

备注: 15 pages, 4 figures. Accepted at ICML 2026

点击查看摘要

Abstract:Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model's multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.

223. 【2606.15037】ReportQA: QA-Based Radiology Report Evaluation

链接https://arxiv.org/abs/2606.15037

作者:Yiming Shi,Shaoshuai Yang,Xi Chen,Haolin Li,Hengyu Zhang,Che Jiang,Kaiwen Wang,Xun Zhu,Dong Xie,Fei Wang,Dejing Dou,Miao Li,Ji Wu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:advancing automated report, automated report generation, essential for advancing, advancing automated, Radiology report

备注

点击查看摘要

Abstract:Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

224. 【2606.15019】owards Global AI-Driven Cervical Cancer Screening

链接https://arxiv.org/abs/2606.15019

作者:Thuy Nuong Tran,Ömer Sümer,Evangelia Christodoulou,Lennart Nauschütte,Simon Kalteis,Martin Paulikat,Esmira Pashayeva,Klara Steinheuer,Isabella Borges,Piotr Kalinowski,Hermann Bussmann,Sieng Sokmney,Poeung Kuong,Sathiarany Vong,Achim Schneider,Magnus von Knebel-Doeberitz,Patrick Godau,Lena Maier-Hein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:World Health Organization, key public health, public health goal, Health Organization, World Health

备注: 20 pages, 9 figures

点击查看摘要

Abstract:The global elimination of cervical cancer is a key public health goal set by the World Health Organization (WHO), with screening programs reducing mortality by up to 80%. However, access to experts and biopsy services is limited in low- to middle-income countries (LMICs). Deep learning (DL)-based algorithms offer promising support for screening, but most existing approaches have been developed and validated on private datasets from single countries. We present the first DL-based approach to cervical cancer screening validated on data from multiple countries. Technically, we phrase the problem of detecting and classifying lesions in colposcopy images as a multi-task learning problem, in which we simultaneously perform image-level classification and lesion segmentation. Our model was trained on a private data set of acid stain colposcopy images with manually generated lesion segmentation masks and corresponding histopathological results, employing extensive data augmentation to address image variability. In an in-distribution validation with pathology results serving as ground truth, our algorithm outperformed medical experts (Balanced Accuracy: 0.68 vs 0.64) in CIN1- (Cervical intraepithelial neoplasia grade 1 or lower) versus CIN2+ (grade 2 or higher) classification. External validation on four colposcopy data sets from four countries featuring radical differences in prevalence and patient characteristics yielded superior performance of our method compared to baseline methods. Performance variability across countries was high with AUC values ranging from 0.54 - 0.80. Overall, algorithm performance varied with age, transformation zone (cervical area most prone to lesion development), presence of comorbidities and pathognomonic signs, with comorbidities having by far the largest negative effect. Future work should focus on improving model robustness and generalizability.

225. 【2606.15015】NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

链接https://arxiv.org/abs/2606.15015

作者:Qizhen Ying,Guangming Wang,Yangchen Pan,Victor Adrian Prisacariu,Yixiong Jing

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remain physically consistent, generation requires controllable, Physics-grounded video generation, requires controllable, external forcing

备注: 18 pages, 4 figures, 6 tables. Preprint

点击查看摘要

Abstract:Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

226. 【2606.14972】ReGenHuman: Re-Generating Human Appearances for Realistic Full-Body Video Anonymization

链接https://arxiv.org/abs/2606.14972

作者:Adam Sun,Eshaan Barkataki,Arnold Milstein,Gordon Wetzstein,Ehsan Adeli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Anonymizing human-centric video, Anonymizing human-centric, human-centric video data, understudied problem, Anonymizing

备注

点击查看摘要

Abstract:Anonymizing human-centric video data is an understudied problem. Prior anonymization techniques either blur or redact pixels at the cost of realism and downstream utility, or generate frame-by-frame at the cost of temporal coherence. We introduce ReGenHuman, the first full-body video anonymization pipeline that is simultaneously realistic, temporally consistent, and anonymous by construction. Contrary to past approaches which redact or edit the inputs directly, we propose a regenerate, don't edit paradigm. Our approach composites 2D pose, segmentation, and monocular depth into two complementary conditioning streams - StructAll and StructHuman, which are used to fine-tune a video-to-video diffusion backbone on in-the-wild human videos, synthesizing the human regions entirely from identity-free structural cues. We evaluate our model on privacy, quality, and utility, and show that our ReGenHuman achieves the best tradeoff across all three axes against current baselines. We further show that our anonymized videos remain effective for downstream tasks, including video question answering.

227. 【2606.14963】Multi-Modal Attention for Automated Disaster Damage Assessment Using Remote Sensing Imagery and Deep Learning

链接https://arxiv.org/abs/2606.14963

作者:Tewodros Syum Gebre,Jagrati Talreja,Leila Hashemi-Beni

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Timely and accurate, resource allocation, crucial for effective, damage, Timely

备注: This paper has been accepted for publication in ISPRS Congress 2026 and the 47th Canadian Symposium on Remote Sensing (CSRS 2026) Annals

点击查看摘要

Abstract:Timely and accurate disaster damage assessment is crucial for effective emergency response, resource allocation, and recovery. Traditional methods, which often rely on manual inspections or sparse data, are typically slow and error-prone. This paper introduces a novel framework leveraging remote sensing imagery and deep learning to automate building damage classification. Using pre- and post-disaster satellite imagery, our model categorizes buildings into four damage levels: no damage, minor damage, major damage, and destroyed. The core innovation is a multi-modal attention mechanism that fuses bi-temporal features to explicitly detect and assess structural changes. We employ a lightweight ConvNeXT-Tiny backbone to ensure efficient processing without compromising performance. Key contributions include: (1) a cross-attention module for multi-modal data fusion, (2) an optimized preprocessing pipeline for large-scale datasets, and (3) robust data augmentation techniques. Experiments on a large-scale disaster dataset demonstrate an overall classification accuracy of 94.90%. The model effectively discriminates between damage categories and remains resilient to incomplete data. This system significantly improves assessment speed and accuracy, aiding emergency responders in prioritizing interventions. This work advances automated disaster damage detection by integrating multi-temporal imagery with deep learning, offering a scalable solution for real-time response.

228. 【2606.14958】MVEB: Massive Video Embedding Benchmark

链接https://arxiv.org/abs/2606.14958

作者:Adnan El Assadi,Roman Solomatin,Isaac Chung,Chenghao Xiao,Deep Shah,Manan Dey,Shriya Sudhakar,Zacharie Bugaud,Wissam Siblini,Ayush Sunil Munot,Yashwanth Devavarapu,Rakshitha Ireddi,Michelle Yang,Márton Kardos,Niklas Muennighoff,Kenneth Enevoldsen

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Massive Video Embedding, embeddings spanning classification, Video Embedding Benchmark, video-centric question answering, pair classification

备注

点击查看摘要

Abstract:We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at this https URL.

229. 【2606.14957】Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

链接https://arxiv.org/abs/2606.14957

作者:Haoxu Huang,Long Chen,Jingyun Chen,Jinu Hyun,James Ryan Loftus,Kara Melmed,Daniel Orringer,Jennifer Frontera,Seena Dehkharghani,Arjun Masurkar,Narges Razavian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unique contrast weighting, multiple complementary sequences, anatomic and fluid-sensitive, MRI contrast mechanisms, brain MRI

备注: Under Review Preprint

点击查看摘要

Abstract:Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

230. 【2606.14926】FlexPooling with Simple Auxiliary Classifiers in Deep Networks

链接https://arxiv.org/abs/2606.14926

作者:Muhammad Ali,Omar Alsuwaidi,Salman Khan(Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:feature extraction layers, neural networks consists, extraction layers, subsequent layer, multiple feature extraction

备注

点击查看摘要

Abstract:In computer vision, the basic pipeline of most convolutional neural networks consists of multiple feature extraction layers, where the input signal is downsampled to a lower resolution in each subsequent layer. This downsampling process is commonly referred to as pooling, which is an essential operation in CNNs. Pooling improves robustness against transformations, reduces the number of trainable parameters, increases the receptive field, and lowers computation time. Since pooling is a lossy process but remains important for extracting high-level information from low-level representations, it is important to preserve the most prominent information from previous activations to improve network discriminability. Standard pooling is usually performed using dense pooling methods, such as max pooling or average pooling, or through strided convolutional kernels. In this paper, we propose a simple yet effective adaptive pooling method, called FlexPooling, which generalizes average pooling by learning a weighted average over activations jointly with the rest of the network. We further show that attaching Simple Auxiliary Classifiers (SAC) to the CNN improves performance and demonstrates the effectiveness of the proposed method compared with standard pooling methods. Experiments on multiple popular image classification datasets show that FlexPooling consistently outperforms baseline networks, achieving approximately 1 to 3 percent improvement in accuracy.

231. 【2606.14912】Mask Proposal Voting Based on Geodesic Framework for Robust Image Segmentation

链接https://arxiv.org/abs/2606.14912

作者:Li Liu,Mingzhu Wang,Zhenjiang Li,Da Chen,Laurent D. Cohen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:finding accurate segmentation, accurate segmentation remains, complex intensity variations, great advances, finding accurate

备注

点击查看摘要

Abstract:Despite great advances, finding accurate segmentation remains a challenging task, especially in scenarios with cluttered backgrounds, complex intensity variations and topology appearance. Minimal path models have exhibited their strong ability in addressing image segmentation tasks. However, the performance of minimal paths-based segmentation approaches is heavily influenced by model initialization, hence limiting their application scope in practice. In this work, we propose a novel mask proposal voting framework that overcomes the major drawback of classical approaches, allowing robust segmentation even in complicated scenarios. Firstly, we introduce an efficient method for constructing adaptive domain cuts as a constraint for initializing the region-based min-cut evolution, by which diverse and reliable mask proposal candidates can be generated, substantially increasing the possibility of accurately covering the objective region by these proposals. Secondly, we propose a new mask voting scheme to build a voting score map encoding the final segmentation information. In contrast to classical path voting methods, our model allows incorporating priors to assign different importance to each individual mask. As a consequence, the proposed segmentation model is capable of accurately delineating object boundaries under complex scenarios, and is insensitive to initialization. Experiments demonstrate that our method consistently outperforms state-of-the-art minimal path-based approaches in both accuracy and robustness.

232. 【2606.14905】Deep Learning in Seismic Interpretation: Federated Advances in Salt Dome Segmentation

链接https://arxiv.org/abs/2606.14905

作者:Muhammad Zain Mehdi,Muhammad Zaid,Owais Aleem

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reservoir modeling, high-impact task, driving decisions, hydrocarbon exploration, drilling safety

备注: 7 pages, 8 figures

点击查看摘要

Abstract:Salt-dome delineation is a critical, high-impact task in subsurface geological interpretation, driving decisions in hydrocarbon exploration, reservoir modeling, and drilling safety. While convolutional encoder-decoder architectures have delivered significant improvements in automated salt segmentation, their widespread application is severely limited by data sovereignty concerns, dataset bias, and the scarcity of labeled seismic volumes. This paper introduces FedSaltNet, a Federated Learning (FL) framework explicitly engineered for robust, generalizable, and privacy preserving salt-dome segmentation. We couple a lightweight Small U-Net backbone, chosen for its efficiency and regularization properties with a novel Foreground-Weighted (FG-WEIGHTED) aggregation strategy designed to tackle domain-specific class imbalance. Through an extensive comparative study emulating non-IID conditions across four diverse seismic datasets (TGS, SEAM, F3, GBS), we demonstrate two critical findings: The FG-WEIGHTED algorithm effectively mitigates data heterogeneity, yielding a 4.0% relative improvement in Intersection over Union (IoU) over the best conventional FL method. The simple U-Net architecture proved essential, outperforming the higher capacity ResNet-18 U-Net variant by 166% in average IoU, underscoring the necessity of architectural simplicity in data-constrained federated environments. FedSaltNet provides a validated, high-performance solution that establishes the viability of federated deep learning for collaborative, next-generation subsurface interpretation.

233. 【2606.14886】Improved Knowledge Distillation for Land-Use Image Classification

链接https://arxiv.org/abs/2606.14886

作者:Arundhuti Sur,Abhiroop Chatterjee,Susmita Ghosh,Emmett Ientilucci

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deep convolutional neural, image classification task, convolutional neural networks, present article, deep convolutional

备注: Accepted by IGARSS 2026

点击查看摘要

Abstract:In the present article, an improved Knowledge Distillation (KD) framework has been proposed for efficient compression of deep convolutional neural networks for land-use image classification task. Motivated by the need to achieve competitive classification accuracy while reducing computational complexity, a teacher-student learning paradigm is adopted in which a VGG16 network transfers knowledge to a lightweight MobileNetV2 model. The proposed framework integrates hard supervision from ground truth labels with a soft supervision strategy that combines Kullback-Leibler divergence and Cosine Similarity losses. Experiments conducted on three land-use datasets show that the proposed KD-based method yields improved performance, and achieves an accuracy of 99.04%, outperforming both baseline student training and single-loss distillation approaches, while retaining substantial model compression.

234. 【2606.14883】Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

链接https://arxiv.org/abs/2606.14883

作者:Salimeh Sekeh,Mary Wisell

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:previously acquired knowledge, preserve previously acquired, paradigm enables adaptation, previously learned environments, sequential fine-tuning

备注

点击查看摘要

Abstract:Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

235. 【2606.14879】VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

链接https://arxiv.org/abs/2606.14879

作者:Venkata Naren Devarakonda,Raktim Gautam Goswami,Prashanth Krishnamurthy,Farshad Khorrami

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:autonomously plan tasks, Mobile agents require, Mobile agents, plan tasks, autonomously plan

备注

点击查看摘要

Abstract:Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor-constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre-trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor-constrained agents.

236. 【2606.14871】An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

链接https://arxiv.org/abs/2606.14871

作者:Shayan Abrar,Sudeepta Mandal,Abdul Awal Yasir,Sonjoy Bhattacharjee,Sadman Haque Bhuiyan,Samanta Ghosh,Rafi Ahamed

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Early detection, plant diseases, Early, Plant diseases reduce, diseases

备注: 5 pages, 12 figures, 3 Tables, Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

点击查看摘要

Abstract:Early detection of plant diseases is crucial to plants and for the farmers. Plant diseases reduce fruit yield and quality, and plants are more susceptible to other stresses when they are infected. The lemon leaf disease dataset contains 1354 images. The dataset has 9 classes. Among the 9 classes only one class is for healthy leaf, and the other 8 classes are leaf diseases. The dataset was split into training (70%), testing (15%) and validation (15%) sets after comprehensive preprocessing. Two pretrained models (InceptionV3 and MobileNetV2) were applied and then combined these models using an ensemble technique to boost robustness. Ensemble models showed a promising performance of 99.27% accuracy. Adversarial Training is applied to improve models' ability and ensure reliable predictions under noisy data. Grad-CAM visualization highlights the important regions of leaf images that validate the model prediction with confidence level.

237. 【2606.14841】Multi-HMR 2: Multi-Person Camera-Centric Human Detection, Mesh Recovery and Tracking

链接https://arxiv.org/abs/2606.14841

作者:Guénolé Fiche,Philippe Weinzaepfel,Romain Brégier,Fabien Baradel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:social scene understanding, camera coordinate system, coordinate system, scene understanding, key factors

备注

点击查看摘要

Abstract:Most advances in human mesh recovery (HMR) have focused on pelvis-centered recovery, overlooking metric 3D localization and detection accuracy in the camera coordinate system - two key factors for real-world applications such as human-robot interaction and social scene understanding. Current evaluation protocols often ignore these aspects, emphasizing per-person, root-centered recovery rather than camera-space perception. As a result, existing approaches rely on fixed camera assumptions or handcrafted post-processing, limiting their robustness and practical deployment. We introduce Multi-HMR 2, a simple yet robust DETR-based framework for Multi-person Camera-centric Human detection, mesh Recovery, and tracking. Multi-HMR 2 predicts a scene-consistent camera together with human meshes, enabling metric 3D localization without ground-truth intrinsics. Moreover, by distilling image-based memory features from SAM2, Multi-HMR 2 extends to tracking, achieving consistent identity association without video supervision. Despite its conceptual simplicity - no handcrafted components, no video input, and no ground-truth cameras - Multi-HMR 2 achieves state-of-the-art pelvis-centered performance while substantially improving detection accuracy and metric 3D localization.

238. 【2606.14811】S23DR 2026: End-to-End 3D Wireframe Prediction via DETR-Style Set Prediction with Contrastive Denoising

链接https://arxiv.org/abs/2606.14811

作者:Nitiz Khanal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:COLMAP point clouds, multi-view COLMAP point, Structured Semantic, multi-view COLMAP, point clouds

备注: Technical report; S23DR 2026 Challenge submission

点击查看摘要

Abstract:We present WireframeDETR, our submission to the Structured Semantic 3D Reconstruction (S23DR) 2026 Challenge, which requires predicting a 3D building wireframe from multi-view COLMAP point clouds. Our method applies DETR-style set prediction directly to 3D point clouds, producing wireframes as sets of edge coordinate pairs without any intermediate vertex detection stage. We introduce three technical contributions: (1) contrastive denoising training that stabilises noisy Hungarian matching in early epochs; (2) a multi-scale encoder that aggregates the last encoder layer outputs via learned scalar weights; and (3) progressive auxiliary loss weighting that concentrates gradient signal on the decoder layers that most benefit from it. Our model achieves a public test HSS of 0.575 (F1~=~0.664, IoU~=~0.516) and a best validation HSS of 0.534 on the cleaned val split.

239. 【2606.14803】HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

链接https://arxiv.org/abs/2606.14803

作者:Shivum Telang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diabetic Retinopathy, aggressive retinal disease, global blindness, black-box nature, nature of diagnostic

备注

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is an aggressive retinal disease and a leading cause of global blindness, yet its clinical management is currently hindered by the black-box nature of diagnostic AI. While deep learning models achieve high classification accuracy, there is a critical lack of explainability methods capable of detailing the exact anatomical landmarks and lesion distributions that lead to a clinical decision for DR. Therefore, we propose HSQ-VLM, a novel quadrant segmentation pipeline on fundus images that utilizes a Landmark-Anchored Cartesian Cross-Attention mechanism to unify visual feature extraction with structured clinical reasoning. Unlike traditional methods that rely on arbitrary image partitioning, our pipeline implements 4-quadrant Topological Latent Partitioning (TLP) to dynamically align retinal features with a fovea-centered coordinate system. This allows the Vision-Language Model to generate natural language reports that quantify pathology with anatomical precision. On a dataset of 3,500 high-resolution fundus images, this innovative methodology achieved a lesion detection sensitivity of 99.6% for hemorrhages and 96.4% for microaneurysms, while demonstrating a significant reduction in boundary-ambiguity errors compared to standard segmentation baselines.

240. 【2606.14795】Position: The Systemic Lack of Agency in Visual Reasoning

链接https://arxiv.org/abs/2606.14795

作者:Yizhao Huang,Haoyang Chen,Shiqin Wang,Pohsun Huang,Jiayuan Li,Haoyuan Du,Yandong Shi,Zheng Wang,Zhixiang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lack of Agency, Agency constrains, implicit reasoning capabilities, implicit reasoning, Implicit Reasoning Diagnosing

备注: Accepted by ICML 2026

点击查看摘要

Abstract:This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to approach visual reasoning primarily as passive semantic retrieval, rather than as active, situated reasoning that depends on autonomous visual exploration. As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs. More information can be found at this https URL

241. 【2606.14792】Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

链接https://arxiv.org/abs/2606.14792

作者:Yoonjeon Kim,Yuhta Takida,Chieh-Hsin Lai,Eunho Yang,Yuki Mitsufuji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:RL-based post-training, widely adopted, adopted to enable, enable interleaved visual, multimodal models capable

备注

点击查看摘要

Abstract:RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

242. 【2606.14787】Vision-Encoder Behavioral Fingerprints of Image-to-Image Generative Models: A Training-Paradigm-Driven Taxonomy of Six Commercial APIs

链接https://arxiv.org/abs/2606.14787

作者:Hunter Hill

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Qwen Image Edit, adversarial perturbation pipeline, content-adaptive sub-JND adversarial, sub-JND adversarial perturbation, Flux Kontext

备注

点击查看摘要

Abstract:We study six production image-to-image AI systems (gpt-image-1, Gemini 2.5 Flash Image, Flux Kontext, SDXL img2img, SD3 img2img, and Qwen Image Edit) under a content-adaptive sub-JND adversarial perturbation pipeline, scoring all outputs by frozen DINOv2 ViT-B/14 token distances against clean references. Across a 3,588-call corpus spanning COCO photographs, CelebA-HQ portraits, and AI-generated inputs, the six systems partition into two image-invariant behavioral bands on a 2D (patch_mean, ssim_clean) plane: edit-trained models (Flux Kontext, Qwen Edit, Gemini) cluster in a tight band, while T2I-base models adapted at sampling time (SDXL, SD3, gpt-image-1) cluster in a drift band.

243. 【2606.14786】MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

链接https://arxiv.org/abs/2606.14786

作者:Xiaotian Fan,Hiok Hian Ong,David Yuchen Wang,Zirui Zhu,Kanchan Sarkar,Kun Xu

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:ensure content safety, positive user experiences, sustain positive user, protect creators, moderation is critical

备注

点击查看摘要

Abstract:Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

244. 【2606.14783】he Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models

链接https://arxiv.org/abs/2606.14783

作者:Chenyu Zhou,Qiliang Jiang,Shuning Wu,Xu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:preserving semantic content, attenuating pixel-local detail, pixel-local detail required, exact text recovery, compresses image pixels

备注

点击查看摘要

Abstract:A vision encoder compresses image pixels into semantic embeddings, implicitly acting as a privacy boundary by preserving semantic content while attenuating pixel-local detail required for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary by routing image patches directly into the language-model token stream, thereby exposing an architectural privacy attack surface: intermediate visual tokens become a pre-output side channel. Under a token-access adversary, decoders invert visual-token streams from two encoder-free VLMs, Gemma4 and Fuyu, recovering recognizable image structure and readable held-out access codes, whereas matched encoder-based controls localize target regions but recover no exact strings. Within-model ablations show that the operative factor is spatial sampling fidelity of the visual-token grid, especially character-direction sampling density, rather than token or value count. The leakage is not limited to exported tokens: Gemma4 layer-0 key-value cache tensors are directly invertible, placing the side channel within KV caches commonly persisted by production serving stacks for decoding efficiency. The attack survives clutter, realistic document degradation, and zero-shot transfer to public document images, and it resists value-level defenses such as additive noise and quantization. Effective mitigation must therefore reduce spatial sampling, making removal of the vision encoder a first-class privacy decision in VLM deployment.

245. 【2606.14782】Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

链接https://arxiv.org/abs/2606.14782

作者:Tianhao Chen,Yuheng Wu,Kelu Yao,Xiaogang Xu,Xiaobin Hu,Dongman Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, Large Language, achieve strong vision-language, strong vision-language reasoning

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%.

246. 【2606.14781】Variational Deep Unfolding with Mamba-Based Nonlocal Modeling for Underwater Image Enhancement

链接https://arxiv.org/abs/2606.14781

作者:Daniel Torres,Julia Navarro,Catalina Sbert,Joan Duran

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Underwater imaging plays, ocean engineering, color distortion, imaging plays, plays a crucial

备注

点击查看摘要

Abstract:Underwater imaging plays a crucial role in ocean engineering, although captured data often suffer from poor visibility and color distortion. To address these challenges, we propose a model-based deep unfolding network for underwater image enhancement that integrates variational modeling into a learnable architecture. The framework is guided by a variational formulation based on a dehazing decomposition, incorporating a multiplicative residual component to absorb remaining artifacts and a nonlocal gradient-type constraint to preserve structural details and enhance edge sharpness. We provide a theoretical analysis establishing the existence of solution for the associated minimization problem. The proposed unfolding method incorporates Mamba layers to efficiently capture self-similarities in the scene. In addition, we introduce a proximal trajectory loss that enforces consistency between the unfolding stages and the iterations of an ideal restoration regularizer. Experimental results demonstrate that the proposed unfolding approach achieves improved visual quality and competitive quantitative performance compared with recent state-of-the-art methods. The source code will be available at this https URL .

247. 【2606.14780】YTClickbait21K: Human-Annotated Multimodal Dataset for YouTube Clickbait Detection Across Diverse Channels and Content Categories

链接https://arxiv.org/abs/2606.14780

作者:Md. Minhazul Islam,Md. Tanbeer Jubaer,Amith Khandakar,Shovon Sarker,Sumaiya Rahman,Md. Masum Mia,Mohamed Arselene Ayari,Hamed Noori

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:video-sharing platforms poses, information reliability, lack of large-scale, video-sharing platforms, platforms poses

备注

点击查看摘要

Abstract:Clickbait content on video-sharing platforms poses a significant challenge to information reliability, yet progress in automated detection has been constrained by the lack of large-scale, high-quality multimodal datasets. We present YTClickbait21K, a human-annotated YouTube clickbait dataset comprising 21,238 videos collected from 40 channels across 29 countries, covering diverse content categories such as news, entertainment, education, and gaming. Each sample includes structured metadata (title, description, engagement statistics) along with associated thumbnail images, enabling comprehensive multimodal analysis. To ensure annotation quality, every video was independently labeled by three annotators using a standardized decision framework that incorporates textual, visual, and cross-modal consistency cues, with final labels determined through majority voting. The dataset exhibits substantial inter-annotator agreement (k=0.65), confirming reliable labeling despite the inherent subjectivity of clickbait detection. By combining scale, annotation rigor, and multimodal richness, this dataset provides a robust benchmark for developing and evaluating machine learning models, facilitating research in cross-modal semantic understanding, and advancing automated content moderation systems.

248. 【2606.14778】FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

链接https://arxiv.org/abs/2606.14778

作者:Rui Cao,Jiannong Cao,Bo Yuan,Zhiyuan Wen,Mingjin Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:History Action Abstract, History Action Graph, Long-term action anticipation, History Action, partially observed video

备注

点击查看摘要

Abstract:Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

249. 【2606.14777】JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

链接https://arxiv.org/abs/2606.14777

作者:Dingyu Yao,Junhao Zhou,Chenxu Yang,Chuanyu Qin,Haowen Hou,Zheming Liang,Congcong Wang,Yuhang Cao,Shenglong Ye,Shuai Xie,Shuhuan Gu,Haoyang Huang,Qingyi Si,Nan Duan,Jiaqi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:model, background model, world, background, stay silent

备注

点击查看摘要

Abstract:Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

250. 【2606.14773】Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

链接https://arxiv.org/abs/2606.14773

作者:Jinwen Wen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:present Double-Helix Vision, geometry-based visual sampler, Double-Helix Vision, signals using paired, spiral trajectories

备注: 5 pages, 3 figures, 5 tables. Code and benchmarks: [this https URL](https://github.com/JackJ-C/double-helix-vision-tool)

点击查看摘要

Abstract:We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline -- including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation -- runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

251. 【2606.14772】ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

链接https://arxiv.org/abs/2606.14772

作者:Wenhao Lu,Zhengqiu Zhu,Xiaofeng Wang,Xiaoran Zhang,Yatai Ji,Yong Zhao,Yue Hu,Yingzhen Nie,Jinlong Zhu,Zheng Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Unmanned Aerial Vehicles, requires Unmanned Aerial, Aerial Embodied Question, Embodied Question Answering, Aerial Vehicles

备注

点击查看摘要

Abstract:Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV's field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance'' of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model's multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48$\boldsymbol{\times}$ higher average strict success rate and a 7.72$\boldsymbol{\times}$ higher average QA correctness.

252. 【2606.14770】An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

链接https://arxiv.org/abs/2606.14770

作者:Houssam El Mir

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Pedestrian Attribute Recognition, enabling forensic search, Pedestrian Attribute, Attribute Recognition, video surveillance

备注

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

253. 【2606.14766】XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

链接https://arxiv.org/abs/2606.14766

作者:Hamza Riaz,Arham Haroon,Maha Baig,Muhammad Dawood Rizwan,Muhammad Naseer Bajwa,Muhammad Moazam Fraz

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:clinical decision making, support clinical decision, interpret visual data, systems increasingly rely, decision making

备注: Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)

点击查看摘要

Abstract:Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

254. 【2606.14765】Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

链接https://arxiv.org/abs/2606.14765

作者:Qinwu Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Self-supervised video representation, video representation learning, Self-supervised video, video representation, recently advanced

备注: 13 pages, 5 Figures, and 2 Tables

点击查看摘要

Abstract:Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.

Comments:
13 pages, 5 Figures, and 2 Tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

Cite as:
arXiv:2606.14765 [cs.CV]

(or
arXiv:2606.14765v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.14765

Focus to learn more

              arXiv-issued DOI via DataCite</p>
255. 【2606.14764】Avoiding Exponential Blow-Up in Distributive Lattice Submodular Minimization

链接https://arxiv.org/abs/2606.14764

作者:Ishant Shanu

类目:Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM)

关键词:recent years, Machine Learning, gained a lot, lot of interest, interest in recent

备注

点击查看摘要

Abstract:Submodular function minimization has gained a lot of interest in recent years. They are highly applicable in the area of Computer Vision and Machine Learning. Often such applications require to work with submodular functions defined on distributive lattice. Current best way of dealing with it is using a transformation which extrapolates the submodular function for the respective boolean lattice. It makes optimization system too inefficient due to enlargement of the working space. Quantitatively, the expanded space has additional exponential (in set size) number of elements. We propose a generic framework for dealing with distributive lattice which only works within distributive lattice. Our framework allows one to use already established submodular function minimization algorithms for boolean lattice. In our experiment, we show the huge improvement in terms of running time over tranditional methods for handling distributive lattice.

256. 【2606.14762】Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

链接https://arxiv.org/abs/2606.14762

作者:Julian Abelarde,Hugo Garrido-Lestache Belinchon

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video content continues, recorded lectures, educational platforms, live-streamed entertainment, footage has increased

备注

点击查看摘要

Abstract:As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased \cite{1}. Although many existing AI programs provide high-level video summaries based on AI-generated transcripts \cite{2,3,4,5}, these approaches are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships, all of which are required for comprehensive video analysis. This paper proposes an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis \cite{6,12,13}. The first stage of the process indexes the video at a micro level by (1) analyzing the full transcript, (2) analyzing individual transcript sentences, and (3) grouping these sentences by semantic similarity using an LLM as a judge \cite{6,13}. Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This framework establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps. Limitations and future expansions of the framework are also discussed.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.14762 [cs.CV]

(or
arXiv:2606.14762v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.14762

Focus to learn more

              arXiv-issued DOI via DataCite</p>
257. 【2606.14760】GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

链接https://arxiv.org/abs/2606.14760

作者:Yu Luo,Kun Hu,Mengwei He,Xiaogang Zhu,Shan Zeng,Allen Benter,Wei Xiang,Patrick Filippi,Thomas Francis Bishop,Zhiyong Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Remote-sensing foundation models, resolve scale mismatch, Remote-sensing foundation, foundation models, benefit from pretraining

备注

点击查看摘要

Abstract:Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose {GeoRoPE}, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, \textit{Geo-Coordinate Calibration (GCC)} rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, \textit{Geo-Frequency Calibration (GFC)} adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

258. 【2606.14759】mporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

链接https://arxiv.org/abs/2606.14759

作者:Yiheng Cao,Gustavo Andrade-Miranda(SyCoIA - IMT Mines Alès),Jiatian Zhang,Guillaume Sallé,Xin Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Cine cardiac magnetic, public datasets limits, assessing cardiac function, cardiac magnetic resonance, advanced data-driven models

备注

点击查看摘要

Abstract:Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

259. 【2606.14758】Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

链接https://arxiv.org/abs/2606.14758

作者:Emirhan Bilgiç,Baptiste Caramiaux,Zhi Yan,Gianni Franchi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:safety-critical applications, explanations becomes crucial, Vision-Language Models, increasingly deployed, deployed in safety-critical

备注: 41 pages in total. 5 figures, and 2 tables in the main paper; 10 figures and 17 tables in the appendix

点击查看摘要

Abstract:As Vision-Language Models are increasingly deployed in safety-critical applications, the trustworthiness of their explanations becomes crucial. Explainable AI (XAI) methods for Vision-Language Models often suffer from semantic hallucination, where attribution maps highlight prominent image regions even when prompted with incorrect text descriptions (e.g., highlighting a dog when prompted ``cat''). Although this problem is widespread, a formal mathematical analysis of XAI methods and CLIP embeddings is largely missing in the literature. We demonstrate that this phenomenon is not specific to a single architecture but is a fundamental consequence of Linear Semantic Leakage in high-dimensional embedding spaces. We propose a unified theoretical framework, Linear Semantic Attribution (LSA), which generalizes across discriminative methods. We introduce OSP, a geometric intervention that utilizes the residual property of OMP to disentangle unique semantic signals from shared concepts. We prove theoretically and demonstrate empirically that OSP minimizes hallucination by orthogonalizing the query vector against distractor concepts, rendering the attribution model blind to shared features while preserving fidelity for correct prompts. Our code is available at: this https URL

260. 【2606.14757】Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

链接https://arxiv.org/abs/2606.14757

作者:Leyla Naz Candogan,Arshia Afzal,Pol Puigdemont,Volkan Cevher

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:mechanism lacks explicit, Space Filling Curves, lacks explicit spatial, due to permutation, permutation equivariance

备注: ICML 2026

点击查看摘要

Abstract:Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

261. 【2606.14756】Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

链接https://arxiv.org/abs/2606.14756

作者:Abhi Gupta,Polina Barabanshchikova,Vikas Garg,Samuel Kaski,Tommi Jaakkola

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:pre-trained diffusion models, opportunity for composition, pre-trained diffusion, models, diffusion models

备注: Accepted as spotlight at ICML 2026

点击查看摘要

Abstract:The abundance of pre-trained diffusion models provides an opportunity for composition. Combining several models, however, runs the risk of one model dominating or models disagreeing with each other. Here, we propose Divide-and-Denoise, a method for coordinating multiple pre-trained diffusion models during sampling. Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models. Central to our method is the notion of an allocation which defines the responsibility of each model to every region of the noisy sample. At every timestep, we then denoise by (i) updating the allocation by solving a fair division game, where we divide the sample into regions that maximize total utility under fairness constraints, and (ii) aligning the models with this allocation, where we guide each model to denoise within its assigned region. This leads to a new composite denoising process that evolves in tandem with a division process. We evaluate Divide-and-Denoise on conditional image generation. Across several quality metrics, including the GenEval benchmark, our method outperforms baselines and resolves common failures including missing objects and mismatched attributes. Experiments show that Divide-and-Denoise utilizes each model's expertise without neglecting any other model.

262. 【2606.14755】Where Does Texture Evidence Live in SAM? Features, Proposal Masks, and Texture Segmentation

链接https://arxiv.org/abs/2606.14755

作者:Nadav Orenstein,Aviad Cohen Zada,Shai Avidan,Gal Oren

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:segmentation stresses foundation, stresses foundation segmentation, Texture segmentation stresses, object identity, segmentation stresses

备注: 26 pages, 13 figures, 20 tables. Code available at [this https URL](https://github.com/Scientific-Computing-Lab/ArchiTexture)

点击查看摘要

Abstract:Texture segmentation stresses foundation segmentation because meaningful regions are defined by material or repeated appearance rather than object identity. Segment Anything Models (SAMs) often fail by default on such texture-defined partitions, but this failure is ambiguous: the texture evidence may be absent, missing from the proposal bank, or present but selected or assembled incorrectly by an object-centric readout. We ask what texture-relevant evidence is already preserved in frozen SAM before adaptation. We study two frozen evidence spaces: multiscale features, probed with a minimal clustering readout, and the automatic proposal bank, treated as evidence for a supervised consolidation readout. SAM is frozen throughout; we do not fine-tune the backbone or retrain the proposal generator. Across RWTD, STLD, an ADE20K-selected refined-crop complement, and a ControlNet-stitched PTD bridge archive, frozen SAM is not a texture segmenter by default, but its failures are not simple texture blindness. Coarse frozen features preserve texture organization, and proposal banks often contain texture-aligned masks or fragments. Natural scenes more often require assembly and commitment over fragments, while cleaner synthetic cases more often reduce to selecting an already coherent proposal. Default mask failure should therefore be decomposed into representation evidence, proposal-bank support, readout mismatch, and commitment failure.

263. 【2606.14754】Sub-Semantic Image Segmentation

链接https://arxiv.org/abs/2606.14754

作者:Aviad Cohen Zada,Nadav Orenstein,Shai Avidan,Gal Oren

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:sub-semantic image segmentation, sub-semantic image, visual cues, image segmentation, segmented based

备注: 23 pages. Code: [this https URL](https://github.com/Scientific-Computing-Lab/TextureDetecture)

点击查看摘要

Abstract:Images can be segmented based on visual cues (i.e., texture segmentation) or into objects (i.e., semantic segmentation). We propose a new category of sub-semantic image segmentation that blurs the line between the two. In sub-semantic image segmentation, language is not used to name whole objects. Instead, it is used to partition an image into stable appearance patterns that can be described by language. To do that, we couple a general-purpose vision-language model to SAM 3, a promptable segmentation backbone whose native text pathway can ground rich descriptions into masks. Simple coupling fails for a number of reasons that we identify in the paper, and we overcome them by introducing DETECTURE that resolves three concrete failure modes -- language leakage between texture regions, prompt competition inside the segmentation backbone, and semantic distortion at the language-to-mask interface. Since there is no dataset of sub-semantic image segmentation, we introduce one, termed TextureADE. The new dataset is derived from the ADE20K dataset using a system we designed. We compare DETECTURE to a number of baselines and find that it achieves the strongest performance on several datasets using different metrics. Code is available at this https URL.

264. 【2606.14753】Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

链接https://arxiv.org/abs/2606.14753

作者:Chiradeep Ghosh,Dakshina Ranjan Kisku

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:semantically meaningful textual, meaningful textual descriptions, aims to generate, generate coherent, coherent and semantically

备注: 8 pages, 8 figures

点击查看摘要

Abstract:Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.

265. 【2606.14752】X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

链接https://arxiv.org/abs/2606.14752

作者:Xirui Kang,Yanpei Shi,Lucy Liang,Roy Gan,Dongxiu Liu,Pushi Zhang,Danpeng Chen,Xiaoyi Qin,Yinan Zheng,Jinliang Zheng,Hao Wang,Xianyuan Zhan,Hang Su

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:precise continuous robot, continuous robot control, precise continuous, continuous robot, action

备注: Project page: [this https URL](https://x-square-robot.github.io/X-Tokenizer_projectPage/)

点击查看摘要

Abstract:Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

266. 【2606.14749】Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

链接https://arxiv.org/abs/2606.14749

作者:Chih-Wei Huang,Chang-Wen Huang,Chung-Ping Chiang,Tsung-Wei Pan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Precision aquaculture faces, quantify instantaneous three-dimensional, high-resolution behavioral traits, Precision aquaculture, tracking high-resolution behavioral

备注

点击查看摘要

Abstract:Precision aquaculture faces a "phenotyping bottleneck" in tracking high-resolution behavioral traits, as conventional methods cannot quantify instantaneous three-dimensional (3D) physical exertion. To address this, we present a high-throughput 3D behavioral phenotyping framework integrating deep learning object detection with binocular stereo vision for real-time monitoring of juvenile tilapia in high-density environments. The system automates non-contact body length estimation and reconstructs 3D swimming trajectories from absolute spatial coordinates. By eliminating 2D perspective distortions, this approach precisely quantifies 3D velocity and acceleration, marking the first estimation of true physical swimming speeds in free-roaming juveniles. Results show the framework successfully establishes circadian locomotor baselines, serving as an early warning system for physiological stress and providing an objective metric for fish vitality.

267. 【2606.14748】Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

链接https://arxiv.org/abs/2606.14748

作者:Daniel DeAlcala,Gonzalo Mancera,Julian Fierrez,Aythami Morales,Ruben Tolosana,Ruben Vera-Rodriguez

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Membership Inference Test, Inference Test, Membership Inference, present the Membership, learning training processes

备注: IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

点击查看摘要

Abstract:We present the Membership Inference Test (MINT) Demo 2, a framework designed to improve transparency in machine learning training processes. MINT is a technique for experimentally determining whether specific data were used during machine learning model training. We establish the theoretical framework and propose multiple architectures for MINT depending on the amount of information known about the models that are being audited. Experimental results using a popular face recognition model, 4 state-of-the-art LLMs, and multiple, diverse, and large-scale public image and text databases achieve promising accuracy levels in the detection of training data of up to 90%. Building on these results, we introduce a comprehensive web platform1 that expands these capabilities to image and text modalities. The platform integrates a diverse technological stack, including MINT, aMINT, and gMINT, allowing users to audit a wide range of models. This demonstrator aims to promote AI transparency and provides a practical tool to foster compliance with emerging AI regulations.

268. 【2606.14747】MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

链接https://arxiv.org/abs/2606.14747

作者:Haitian Wang,Ruoxi Sun,Quantong Qiu,Juntao Li,Junhui Li,Hua Chen,Jinxiong Chang,Min Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Embedding Models, Recent advancements, Multimodal Embedding, theoretical context windows, advancements have significantly

备注

点击查看摘要

Abstract:Recent advancements have significantly expanded the theoretical context windows of Multimodal Embedding Models (MEMs). However, larger context windows do not necessarily translate into effective comprehension and representation of long-context multimodal inputs, which remains a critical bottleneck for real-world deployment. To address the lack of systematic evaluation in this setting, we introduce MMLongEmbed, the first comprehensive benchmark for evaluating MEMs in long-context scenarios. MMLongEmbed comprises four retrieval tasks spanning multiple context-length ranges, covering text, document, and video modalities. Through extensive evaluation of state-of-the-art models, we find that current architectures rely heavily on superficial feature matching and struggle to capture deep semantic and structural dependencies. We further observe that performance degradation varies systematically with context length and key information placement. Moreover, models exhibit substantially different robustness to redundant contextual information across modalities. For reproducibility, the benchmark and code are publicly available.

269. 【2606.14746】Style-CCL: Content-Preserving Style Transfer via Curriculum Continual Learning

链接https://arxiv.org/abs/2606.14746

作者:Shiwen Zhang,Haoyuan Wang,Xianghao Zang,Haibin Huang,Chi Zhang,Xuelong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformers, Content-Preserving Style transfer, challenging for Diffusion, Random Memory Rehearsal, Curriculum Continual Learning

备注: code and models of QwenStyle are released at [this https URL](https://github.com/witcherofresearch/Qwen-Image-Style-Transfer/) and [this https URL](https://github.com/Tele-AI/TeleStyle/)

点击查看摘要

Abstract:Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to entangled content and style features. With a reverse triplet synthesis pipeline to build a million-scale training set and a dual-branch Style-Content DiT (SC-DiT) that decouples style and content via separate ROPE embeddings and causal masking, we observe that such a one-stage training paradigm on mixed style categories causes semantic styles to dominate, hindering texture style learning, and harming content preservation. To address these issues, we propose Style-CCL, a Multi-Stage Curriculum Continual Learning framework that trains SC-DiT from semantic (easy) to texture (hard) styles, and from clean to synthetic data, with Random Memory Rehearsal across stages to avoid catastrophic forgetting. Extensive experiments demonstrate that our Style-CCL achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.

270. 【2606.14741】HorusEye: Language as Dynamic Attention for Emergency Visual Analysis

链接https://arxiv.org/abs/2606.14741

作者:Armel Yara

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Dynamic Attention, Attention for Emergency, introduce HorusEye, Emergency Visual Analysis, language feedback

备注: 18 pages, 9 figures, 11 tables

点击查看摘要

Abstract:We introduce HorusEye, Language as Dynamic Attention for Emergency Visual Analysis. Our investigation followed five stages. The first one is benchmarking RefCOCO-Degraded, a dataset of 15,244 images (3,811 base images x 4 conditions: Clean, Fog, Smoke and Thermal) with systematic visual degradation. Through four research questions, we evaluate multiple VLMs (Gemini, Qwen2-VL, BLIP-2, LLaVA, Kosmos-2) across visual grounding the second stage, language feedback recovery the third one, health VQA tasks the fourth, and hallucination analysis the final stage. Our key finding is that language feedback effectiveness is model-dependent: Gemini achieves +47.3% improvement in thermal conditions through iterative language feedback, while Qwen2-VL shows -5.1% degradation under the same protocol. We also identify the 'Thermal Paradox' where cropping strategies that improve RGB performance catastrophically fail in thermal imagery. Furthermore, BLIP-2 uniquely hallucinates more under degradation, making it unsuitable for emergency deployment

271. 【2606.14740】GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods

链接https://arxiv.org/abs/2606.14740

作者:Sujay Belsare,Sudarshan Nikhil,Sushant Kumar,Ponnurangam Kumaraguru,Chirag Agarwal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:relevant stakeholders, increasing development, development of Vision-Language, predictions are readily, readily explainable

备注: 23 pages, 15 Figures, Accepted for poster presentation at CVPR 2026 TRUE-V Workshop

点击查看摘要

Abstract:With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not kept pace with the multimodal surge. While recent Multimodal Explainable AI (MxAI) methods generate explanations to attribute the interaction between different modalities, current evaluation protocols lack the ground truth required to distinguish between true cross-modal reasoning (e.g., spatial composition) and shallow cross-modal shortcuts (e.g., Bag-of-Words attribute matching). It remains unknown whether MxAI methods faithfully capture synergistic interactions or merely hallucinate reasoning on models acting as simple feature detectors. In this paper, we introduce GridVQA-X, the first diagnostic framework specifically designed to evaluate cross-modal explainability. Unlike natural datasets, GridVQA-X leverages a closed-world synthesis logic to generate unique, mathematically guaranteed explanations. We utilize this controlled environment to train paired ground-truth models on identical architectures: $M_{\text{pure}}$, which learns robust spatial-relational reasoning and $M_{\text{spur}}$, which is structurally forced to rely on cross-modal shortcuts. This behavioral divergence creates a rigorous testbed: a faithful explainer must report distinct reasoning pathways for each model. Our findings reveal that widely used methods fail to distinguish between models relying on genuine spatial-relational reasoning and those exploiting cross-modal shortcuts, highlighting a critical gap in capturing true cross-modal synergy and misrepresenting how multimodal models actually make decisions.

272. 【2606.14735】UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

链接https://arxiv.org/abs/2606.14735

作者:Romiyal George,Sathiyamohan Nishankar,Selvarajah Thuseethan,Roshan G. Ragel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong representation, strong representation capability, image classification, ultra-tiny Vision Transformer, Vision Transformer architecture

备注: 13 pages, 7 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated strong representation capability in image classification. However, their quadratic self-attention complexity and large parameter counts limit deployment on resource-constrained mobile and edge devices. This paper introduces UtVAA, an ultra-tiny Vision Transformer architecture designed for efficient visual recognition under strict computational budgets. It incorporates a novel Affix Attention block that combines depthwise-pointwise local feature extraction, linear self-attention, coordinate attention for spatial dependency modelling, and a lightweight ternary fusion strategy to integrate local and global representations. In addition, Dilated Bottleneck blocks expand the receptive field using dilated depthwise separable convolutions while maintaining low FLOPs and stable optimisation through residual connections. UtVAA is implemented in scalable Tiny, Medium, and Large variants, with the smallest model containing 204.67K parameters and 53.95M FLOPs. Experimental results on CIFAR-10, CIFAR-100, PlantVillage-Tomato and SLIF-Tomato datasets show that UtVAA achieves competitive accuracy within a sub-million-parameter regime. Overall, the results demonstrate that transformer-based vision models can be redesigned into ultra-tiny architectures without significant loss in discriminative performance, making UtVAA suitable for mobile and edge deployment. Code is available at this https URL

273. 【2606.14732】Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

链接https://arxiv.org/abs/2606.14732

作者:Matiur Rahman Minar,Seunghun Oh,GangHyeon Jeong,Unsang Park

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:static scene layouts, scene layouts drift, diffusion models enable, models enable streaming, video diffusion models

备注: Project page: [this https URL](https://minar09.github.io/steadyforcing/)

点击查看摘要

Abstract:Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: this https URL

274. 【2606.14731】BBR-Net: Boundary-Balanced Replay for Continual Medical Image Segmentation

链接https://arxiv.org/abs/2606.14731

作者:Zahid Ullah,Sieun Choi,Jihie Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:segmentation remains challenging, explicitly modeling anatomical, image segmentation remains, remains challenging, explicitly modeling

备注

点击查看摘要

Abstract:Continual learning for medical image segmentation remains challenging under domain shift because replay-based methods often preserve appearance information without explicitly modeling anatomical structure. This study investigates whether structural consistency governs knowledge retention in continual cardiac ultrasound segmentation. We propose the Boundary-Balanced Replay Network (BBR-Net), which selects replay samples using boundary-aware priority and class balance to preserve anatomically informative regions. The method is evaluated on CAMUS and CardiacNet under forward (CAMUS to CardiacNet) and reverse (CardiacNet to CAMUS) task orders. In the forward setting, BBR-Net retains source-task performance close to an offline joint-training reference, while markedly reducing catastrophic forgetting and preserving competitive target-task adaptation. Ablation results show that boundary-aware prioritization contributes to retention and improves the balance between source-task preservation and target-task adaptation when combined with class-aware sampling. In contrast, the reverse setting reveals that structure-aware replay fails when initial representations are learned from noisy and structurally inconsistent data. To isolate this effect, we conduct a controlled structural perturbation analysis by progressively corrupting source-task boundaries while keeping the dataset, architecture, and training protocol fixed. Forgetting increases consistently as structural reliability decreases, suggesting that replay effectiveness is strongly influenced by the quality of stored structural information, rather than by memory capacity alone. These findings indicate that preserving anatomical structure under domain shift is a central factor in continual medical image segmentation, and that replay mechanisms should account for structural reliability to support robust knowledge retention.

275. 【2606.14730】Hierarchical GRU with Input-Conditioned Slot Queries for Ball Action Anticipation

链接https://arxiv.org/abs/2606.14730

作者:Parthsarthi Rawat

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:football broadcast video, broadcast video, ball action anticipation, present a hierarchical, hierarchical model

备注: CVPR 2026 SoccerNet Ball Action Anticipation Challenge, Validated Rank 4

点击查看摘要

Abstract:We present a hierarchical model for ball action anticipation in football broadcast video. Given a 30-second observation window, the system predicts actions occurring in the subsequent 5-second window across 10 classes. A shared local Transformer encodes clip-level features within each 5-second sub-window; a GRU then aggregates temporal context across all sub-windows; finally, a Transformer decoder with K input-conditioned event slots decodes the anticipation target via three decoupled heads (objectness, class, temporal offset). We introduce frequency-reweighted Hungarian matching that systematically favours rare action classes, and Gaussian soft targets for temporal bin supervision. On the SoccerNet Ball Action Anticipation benchmark, our method achieves 17.91% mAP on the test server.

276. 【2606.14728】FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

链接https://arxiv.org/abs/2606.14728

作者:Harry Zhang,Luca Carlone

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly important role, multiple domains, playing an increasingly, increasingly important, important role

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are playing an increasingly important role across multiple domains. In many applications, such as robotics, it is crucial to quantify the uncertainty in the output of these models. } We develop FUSE, a probabilistic framework for capturing two complementary sources of uncertainty in vision-language modeling: (i) aleatoric embedding-level uncertainty derived from input data vision-language ambiguity, and (ii) epistemic model-level uncertainty estimated from the semantic response diversity of VLMs. Our approach formulates a Bayesian fusion mechanism that analytically combines these uncertainty sources to produce a scalar measure of uncertainty. This measure can be used to reliably predict the model's output correctness for downstream applications. We demonstrate that our method outperforms baselines and achieves SOTA uncertainty calibration.

277. 【2606.14727】FairGen: Preference-Aligned Diffusion for Demographically Equitable Medical Image Synthesis

链接https://arxiv.org/abs/2606.14727

作者:Zhimin Li,Ruichen Zhang,Zhen Tan,Howard J Aizenstein,Jingtong Hu,Tianlong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:support image-based analysis, artificial intelligence, systems are increasingly, improving efficiency, imaging is central

备注: Accepted for publication in npj Digital Medicine. 20 pages, 6 figures

点击查看摘要

Abstract:Medical imaging is central to modern diagnostics, and artificial intelligence (AI) systems are increasingly used to support image-based analysis by improving efficiency, accuracy, and access to care. However, inequities in healthcare access and differential disease prevalence create severe demographic imbalances in clinical image data. Such imbalances are compounded by the fact that diseases can manifest with distinct features across demographic groups, rendering certain phenotypic presentations naturally rare. AI models trained on such imbalanced data risk perpetuating diagnostic bias and widening healthcare disparities. Here we introduce FairGen, a fairness-aware diffusion framework that synthesizes demographically balanced medical images while preserving pathology-relevant visual features. By embedding physician-aligned preferences into the generation process, FairGen improves subgroup coverage during synthesis and downstream classification. Applied to dermatology, radiology, and neuroimaging benchmark tasks, FairGen achieves fairness improvements of 95.9% for skin images, 80.0% for chest radiography, and 35.2% for brain MRI, while maintaining competitive diagnostic accuracy relative to models trained on original clinical data. Clinician-facing expert review and external validation on independent cohorts further support that these gains extend beyond standard fidelity metrics and are not confined to the original in-distribution datasets.

278. 【2606.14725】Interpolation between Convolution and Attention via K-Nearest Neighbors

链接https://arxiv.org/abs/2606.14725

作者:Mingi Kang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, Neural Networks, Convolutional Neural, reshaped computer vision, fundamentally distinct

备注: Undergraduate Thesis in Computer Science at Bowdoin College

点击查看摘要

Abstract:The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. Convolutional Neural Networks are defined by spatially local convolution operations, while Transformers rely on global self-attention. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and weighted aggregation. Convolution selects neighbors by spatial proximity while self-attention selects by feature similarity, revealing that they lie on a continuous spectrum rather than representing categorically different computations. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. ConvNN exactly recovers standard and depthwise convolution by restricting neighbor selection to normalized spatial coordinates, and exactly recovers self-attention and its sparse variants, including KVT-attention, by replacing spatial proximity with scaled dot-product similarity. Beyond these special cases, ConvNN serves as a drop-in replacement for both convolution and attention layers, enabling systematic exploration of the intermediate spectrum between local and global aggregation through configurable similarity functions, neighbor selection strategies, positional encodings, and aggregation kernels.

Comments:
Undergraduate Thesis in Computer Science at Bowdoin College

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.14725 [cs.CV]

(or
arXiv:2606.14725v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.14725

Focus to learn more

              arXiv-issued DOI via DataCite</p>
279. 【2606.14724】VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

链接https://arxiv.org/abs/2606.14724

作者:Xinze Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Video anomaly detection, stronger feature extractors, balance detection accuracy, existing methods address, real-time throughput

备注

点击查看摘要

Abstract:Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

280. 【2606.14723】Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

链接https://arxiv.org/abs/2606.14723

作者:Durga Sandeep Saluru

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:study multiple-choice video, video question answering, multiple-choice video question, cross-shot spatial layout, frontier video LLM

备注

点击查看摘要

Abstract:We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout. On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4.8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth -- our main evaluation -- the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains concentrated on Motion Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) -- the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model), confirming the validation result on an independent split.

281. 【2606.14721】DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

链接https://arxiv.org/abs/2606.14721

作者:Hequan Wang,Jiaxu Zhang,Zhengbo Zhang,Zhigang Tu

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:requires synthesizing physically, synthesizing physically realistic, long-horizon textual instructions, physically realistic dynamics, generation requires synthesizing

备注

点击查看摘要

Abstract:Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

282. 【2606.14720】AI for Maritime Security: Comparative Evaluation of CNN and Vision Transformer Architectures for Maritime Object Detection

链接https://arxiv.org/abs/2606.14720

作者:Ismet Gocer,Zakirul Bhuiayn,Shakeel Ahmad,Raza Hasan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced Artificial Intelligence, Artificial Intelligence, advanced Artificial, enhance maritime security, study aims

备注: 24 Pages

点击查看摘要

Abstract:This study aims to enhance maritime security by using advanced Artificial Intelligence (AI) and Computer Vision (CV) techniques. For this purpose, it was designed and assessed intelligent object detection systems that can detect the presence of ships on the sea surface under different real-time environments. To achieve this goal, a maritime image dataset with 6,468 images was used, covering different weather conditions like cloudy, foggy, rainy, and sunny environments. Six deep learning architectures were evaluated, including a base Convolutional Neural Network (CNN) model, four transfer learning models (Xception, VGG16, MobileNetV2, and EfficientNetV2L), and a Vision Transformer (ViT) model. The models were compared using multiple performance indicators, including accuracy, Type I and Type II errors, model size, and video processing time. The results show that model performance varies depending on computational constraints and deployment conditions. While lightweight architectures are suitable for resource-limited devices, the ViT achieved the best overall performance, reaching 100% accuracy with the lowest error rates and the fastest video processing time. The findings highlight the potential of AI-driven computer vision systems for maritime surveillance, border protection, and autonomous navigation.

283. 【2606.14716】RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

链接https://arxiv.org/abs/2606.14716

作者:Kushal Khemani,Evan Leri,George Xu,Amit Hod

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:Edge object detection, embedded hardware requires, hardware requires balancing, changing resource pressure, Edge object

备注

点击查看摘要

Abstract:Edge object detection on embedded hardware requires balancing inference latency and detection quality under changing resource pressure. We present RAMS, a lightweight runtime controller that monitors device pressure, calibrates switching thresholds from idle behavior, and dynamically selects among three resident YOLOv8 tiers (NANO/SMALL/MEDIUM at 320/416/640 px) without model-reload latency. RAMS defines five switching policies, including two detection-conditioned variants that prevent aggressive downgrades after recent vulnerable-road-user (VRU) detections. We further introduce the VRU-Weighted Accuracy Score (SWAS), a scalar metric for offline policy comparison without ground-truth annotations, together with an oracle-bounded variant that separates detector circularity from genuine tier-retention benefit. Across Raspberry Pi 5, x86 laptops, and Jetson Orin ONNX/TensorRT deployments, the same controller equations operate over a 37x latency range. On Jetson Orin TensorRT under heavy load, the safety2 policy achieves 3.41 ms mean latency, 5.6x faster than fixed-MEDIUM inference, while retaining 74% of its proxy accuracy through near-NANO operation with selective SMALL and MEDIUM locks during VRU-positive windows. Detection-conditioned switching improves SWAS by 25.4% under oracle scoring and 47.3% under detector-derived scoring relative to threshold-only policies under heavy load. Live KITTI evaluation reports per-tier VRU recall of 24.2%, 41.2%, and 59.0%, showing that reactive overrides are fundamentally limited by baseline detector recall.

284. 【2603.04592】From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

链接https://arxiv.org/abs/2603.04592

作者:Junlong Tong,Zilong Wang,YuJie Ren,Peiran Yin,Hao Wu,Wei Zhang,Xiaoyu Shen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Standard Large Language, Large Language Models, Standard Large, Language Models, Large Language

备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at this https URL.

285. 【2606.16261】Wavelength-Multiplexed 2D Beam Steering via a Passive Diffractive Network

链接https://arxiv.org/abs/2606.16261

作者:Che-Yung Shen,Yuhang Li,Cagatay Isil,Tianyi Gan,Mona Jarrahi,Aydogan Ozcan

类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph)

关键词:high-dimensional control parameter, beam steering, parameter for arbitrarily, transforms illumination wavelength, beam

备注: 20 Pages, 4 Figures

点击查看摘要

Abstract:We introduce a wavelength-addressable diffractive optical network that transforms illumination wavelength into a high-dimensional control parameter for arbitrarily programmable 2D beam steering. The proposed passive architecture comprises cascaded spatially optimized diffractive layers, jointly designed using deep learning, to rapidly map distinct wavelengths to predefined/desired output angles. Unlike conventional single-layer dispersive optical elements, which are physically restricted to 1D linear mapping, this framework harnesses complex wavefront transformations to utilize the illumination wavelength as an intrinsic addressing key for arbitrary 2D beam steering, eliminating the need for mechanical scanning or electronic phase control. We numerically demonstrate wavelength-controlled beam steering across 625 wavelength channels spanning 400-750 nm, realizing a 25 x 25 array of independently addressable beam positions with subwavelength positioning accuracy and high channel fidelity. Unlike conventional gratings, which constrain wavelength routing to a linear trajectory, the proposed diffractive network performs nonlocal wavefront transformations, enabling arbitrary wavelength-to-angle mappings across a 2D field of view. We further validate the proposed framework experimentally in both the terahertz and visible spectral regimes, demonstrating wavelength-multiplexed beam steering using 3D fabricated passive diffractive layers at terahertz frequencies and phase-only spatial light modulators in the visible spectrum. This wavelength-addressable diffractive architecture establishes a compact and scalable paradigm for high-speed programmable beam steering, with potential applications in optical communications, routing, imaging, sensing, and emerging photonic information-processing systems.

286. 【2606.16107】Variable-Rate Deep Image Compression based on Low-Rank Adaptation by Progressive Learning

链接https://arxiv.org/abs/2606.16107

作者:Xing-Yu Xu,Chen-Hsiu Huang,Ja-Ling Wu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:including web media, high-resolution medical imaging, enabling efficient data, streaming services, connected vehicle networks

备注

点击查看摘要

Abstract:In the digital age, image compression is crucial for numerous applications, including web media, streaming services, high-resolution medical imaging, and connected vehicle networks, enabling efficient data storage and transmission. With the increasing demand for high-quality image communication, the need for advanced compression techniques becomes increasingly critical. Numerous Deep Image Compression (DIC) techniques have recently been introduced, showing impressive performance compared to traditional standards. However, variable-rate image compression remains an unresolved issue. Specific DIC methods deploy multiple networks to attain different compression rates, whereas others use a single model, which often results in higher computational complexity and reduced performance. This work proposes a progressive learning approach for variable-rate image compression based on the parameter-efficient fine-tuning method, the Low-Rank Adaptation (LoRA). We introduce an additional LoRA Rate-Adaptive Module (LoRAM) in DIC methods. Due to the re-parameterized merging of LoRA, our proposed method does not introduce additional computational complexity during inference. Compared to methods utilizing multiple models, comprehensive experiments demonstrate that our approach achieves competitive performance, saving 99\% in parameter storage, 90% in datasets, and 97% in training steps.

287. 【2606.15352】Chroma-gated, differentiable OKLCH interpolation: Continuous Oklab fallback for color-cast reduction

链接https://arxiv.org/abs/2606.15352

作者:Naoyuki Uchida

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Ottosson Oklab color, Oklab color space, interpolation space recommended, Ottosson Oklab, space recommended

备注: 14 pages, 5 figures. Ancillary files: reproducibility scripts (symbolic verification, evaluation, and figure generation)

点击查看摘要

Abstract:OKLCH -- the cylindrical (lightness, chroma, hue) form of Ottosson's Oklab color space -- is the interpolation space recommended by CSS Color 4 for gradients and color-mix(), and it is now broadly deployed. Its polar parameterization, however, casts color near the neutral axis in two ways: (1) an inter-hue detour between two chromatic endpoints that sweeps through an unintended hue (blue to yellow visibly passing through green), and (2) an off-line bow when one endpoint is achromatic. Existing remedies are uniformly two-valued -- a threshold switch that fires only at an achromatic endpoint -- so they address only (2); on chromatic pairs every one of them reduces to raw OKLCH, leaving the (1) inter-hue cast untreated. We introduce Continuous Oklab fallback (COFb), a one-parameter, differentiable chroma gate $w(C)=C^n/(C^n+\sigma^n)$ that continuously blends the OKLCH path toward the linear Oklab path as chroma falls. A single gate reduces the (1) cast that the two-valued family leaves untreated and unifies the handling of (1) and (2) without any endpoint test. We characterize a cast-hue trade-off frontier, adopt a default ($n=1$, the rational Michaelis-Menten form; $\sigma\approx0.19$ for a typical sRGB palette, from a normalization-independent cast-half criterion), and verify the gate's properties symbolically. At the default, COFb halves the inter-hue path detour (mean lateral deviation -49.5%, chroma-weighted hue excursion -35.5%). We also state the method's limits: on (2) alone the two-valued switch remains better, and like any Cartesian blend COFb does not preserve chroma. In deployment, COFb runs entirely in plain Oklab (a,b) to sRGB, so it serves as a fallback that delivers the same cast-reduced gradients where modern CSS color interpolation (color-mix(in oklch) and the like) is unavailable -- older engines, image and video pipelines, or GPU shaders.

288. 【2606.15000】Polyp-D2ATL: Deep Domain-Adaptive Transfer Learning for Colorectal Polyp Classification under Label Distribution Shift

链接https://arxiv.org/abs/2606.15000

作者:Sajad Jabarzadeh Ghandilu,Maryam Sadat Hosseini Azad,Shahriar Baradaran Shokouhi,Emad Fatemizadeh

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:highly accurate prediction, Early and highly, colorectal polyp classification, types of cancer, saving more lives

备注: 15 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Early and highly accurate prediction of colorectal polyps, as an important sign of one of the most dangerous types of cancer, will result in saving more lives. Despite the advancements in colorectal polyp classification, many challenges remain in obtaining an automated polyp prediction system that is able to diagnose the difficult-to-predict polyps accompanied by different features in real scenarios, where the model can handle imbalanced data, label distribution shift, and cross-modality generalization successfully. In this study, we propose Polyp-D2ATL, a novel framework accompanied by a specific training strategy, which mitigates these limitations and effectively predicts the different classes of polyps belonging to the NICE classification. Our extensive experiments on the PICCOLO validation and test sets demonstrate that the proposed Polyp-D2ATL significantly outperforms existing state-of-the-art models across various reliable metrics, achieving an accuracy of 82.38%, a Macro-F1 of 77.49%, and a specificity of 87.47% on the validation set, alongside consistent improvements on the held-out test set which demonstrates the generalization capacity and clinical applicability of the proposed approach.

289. 【2606.14828】Leptomeningeal Collateral Detection on DSA via Vessel-Graph Neural Networks

链接https://arxiv.org/abs/2606.14828

作者:Junyong Cao,Hakim Baazaoui,Chinmay Prabhakar,Suprosanna Shit,Lukas Bastian Otto,Susanne Wegener,Bjoern Menze,Ezequiel de la Rosa

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:acute ischemic stroke, important prognostic factor, Leptomeningeal collaterals, ischemic stroke, important prognostic

备注

点击查看摘要

Abstract:Leptomeningeal collaterals (LMCs) are an important prognostic factor in acute ischemic stroke. Existing automated methods rely on CT angiography (CTA), but individual LMCs are often too small to be resolved on CTA, limiting these methods to coarse collateral scoring. Digital subtraction angiography (DSA) visualizes individual collaterals at superior resolution, yet current assessment remains subjective, relying on manual grading scales that suffer from poor inter-rater agreement. We present a framework that formulates collateral detection as the classification of individual vessel segments on a graph derived from DSA. A hybrid graph-pixel architecture combines a topology-aware graph branch with a dense pixel branch, fused in a shared node-probability space. In a five-fold cross-validation setting, the fused model achieves a PR-AUC of 0.434, outperforming the graph-only (0.403) and pixel-only (0.362) baselines. To our knowledge, this is the first method to enable the individualization of LMCs in DSA, allowing for precise per-vessel quantitative assessment. This integration shifts DSA assessment toward objective evaluation, supporting future biomarker and pattern discovery for individual LMCs.

290. 【2606.14808】Explainable Task-Oriented Token Communication for AI-Native 6G Networks

链接https://arxiv.org/abs/2606.14808

作者:Feibo Jiang,Lei Mao,Li Dong,Kezhi Wang,Cunhua Pan,Jiangzhou Wang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)

关键词:Foundation Models, Task Tokens, integration of Foundation, Visual Tokens, Task-Oriented Token Communication

备注

点击查看摘要

Abstract:The integration of Foundation Models (FMs) and wireless communications is driving the evolution of image communication from bit-accurate transmission toward task-oriented transmission. However, existing task-oriented image communication methods still face three major challenges: insufficient task-oriented Token representation, inadequate collaboration between Visual Tokens and Task Tokens, and limited interpretability of task decisions. To address these challenges, we propose an Explainable Task-Oriented Token Communication (ET-TokenCom) framework. By treating Tokens as unified units for information representation and transmission, the proposed framework constructs an end-to-end communication link that spans visual perception, wireless transmission, and task reasoning. At the transmitter, the ET-TokenCom framework extracts Visual Tokens from images to preserve low-level visual information. Meanwhile, Task Tokens generated by the FM are introduced to represent the target information and decision intent required by the current task. A Cross-Modal Attention (CMA) fusion mechanism is further designed, enabling Task Tokens to explicitly guide the selection, weighting, and transmission of Visual Tokens. At the receiver, the framework integrates Token decoding with an explainable output mechanism, where attention heatmaps are generated to highlight critical perceptual regions under different task objectives and reveal the influence of Task Tokens on the outputs. Finally, simulation results validate the effectiveness and robustness of the proposed ET-TokenCom framework.

291. 【2606.14750】Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

链接https://arxiv.org/abs/2606.14750

作者:Adarsh Arigala,Arjun Gangwar,S Umesh,Yova Kementchedjhieva

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:Recent advances, exploit visual cues, pixel-based text modeling, images enables models, language understanding

备注: 5 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.