本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新741篇论文，其中：

自然语言处理89篇
信息检索11篇
计算机视觉169篇

自然语言处理

1. 【2606.32038】Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

链接：https://arxiv.org/abs/2606.32038

作者：Zifan Carl Guo,Laura Ruis,Jacob Andreas,Belinda Z. Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：predictions yield faithful, superficial imitation, predictions yield, yield faithful introspection, training language models

备注： 32 pages, 19 figures

点击查看摘要

Abstract:When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using models' counterfactual behavior on modified inputs as supervision. Surprisingly, we find that LMs trained on fixed counterfactual explanations derived from earlier checkpoints of themselves, or even from behaviorally similar models in different families, frequently produce explanations more faithful to their own current behaviors than to those of their training targets. This "introspective" coupling between LM explanations and behaviors occurs when training explanations remain sufficiently correlated with current behaviors over the course of training, even as behaviors themselves shift. We also show that introspective coupling tracks behavior shifts: when explanation training is provided concurrently with other post-training objectives, explanations track those shifts without requiring updated supervision. This phenomenon appears in multiple tasks, including sycophancy and refusal, and is robust to label noise. Overall, our results show that even fixed datasets of counterfactual explanations can provide scalable and generalizable post-training signal for introspection.

2. 【2606.32034】QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

链接：https://arxiv.org/abs/2606.32034

作者：Sergio Hernández-Gutiérrez,Matteo Merler,Ilze Amanda Auzina,Joschka Strüber,Ameya Prabhu,Matthias Bethge

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：LLM agents increasingly, agents increasingly act, LLM agents, Dense supervision, Dense supervision methods

备注： 10 pages, 5 figures in main text; 48 pages, 6 figures with appendix

点击查看摘要

Abstract:LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. As a result, dense supervision methods are rarely benchmarked on common ground. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. This lets us compare signals before any training run and separate signal quality from other engineering choices. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family. These findings hold across model sizes, environments, and observation modalities. QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run.

3. 【2606.32032】Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

链接：https://arxiv.org/abs/2606.32032

作者：Gabrielle Kaili-May Liu,Avi Caciularu,Gal Yona,Idan Szpektor,Arman Cohan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：cognitive processes, critical component, component of intelligence, intelligence that describes, monitor and regulate

备注： Code: [this https URL](https://github.com/yale-nlp/RLMF)

点击查看摘要

Abstract:Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.

4. 【2606.32029】When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

链接：https://arxiv.org/abs/2606.32029

作者：Yuqing Yang,Qi Zhu,Zhen Han,Boran Han,Zhengyuan Shen,Shuai Wang,Vassilis N. Ioannidis,Huzefa Rangwala

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：table structure, incorrectly citing, large language models, omitting table, large language

备注： ACL 2026 (Oral)

点击查看摘要

Abstract:While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and reliability of intermediate reasoning steps. Yet prior studies have only offered limited, small-scale analyses. In this work, we present the first systematic evaluation of tabular data referencing errors across different models and tasks. Our results show that DREs occur across all tested models (1.7B to 20B parameters). Furthermore, we demonstrate that incorporating data referencing as a critic significantly improves answer accuracy up to 12.0%, through critic-based filtering and rejection sampling. Finally, we trained a lightweight 4B-parameter critic model that achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs, and effectively assists inference for larger models.

5. 【2606.32025】Generative Skill Composition for LLM Agents

链接：https://arxiv.org/abs/2606.32025

作者：Xinyu Zhao,Zhen Tan,Vaishnav Tadiparthi,Nakul Agarwal,Kwonjoon Lee,Ehsan Moradi Pari,Hossein Nourkhiz Mahjoub,Tianlong Chen

类目：Computation and Language (cs.CL)

关键词：Recent LLM agents, Recent LLM, LLM agents benefit, solving complex tasks, LLM agents

备注：

点击查看摘要

Abstract:Recent LLM agents benefit from skills for solving complex tasks. Skills encapsulate modular packages of procedural knowledge and instructions for performing specialized tasks, such as setting up a sandboxed environment, running a test suite, or refactoring a function across multiple files. As skill libraries grow and become reusable across tasks and domains, selecting an appropriate skill composition has emerged as a central bottleneck. Existing approaches fall into two categories. One exposes the agent's reasoning to the entire skill collection; the other performs skill retrieval via embeddings or LLM-based rerankers. Both provide useful insights; however, they miss the structural nature of skill composition, which is a joint decision over which skills, how many, and in what order -- three dimensions that cannot be decoupled. We formalize this as structured skill composition: given a task and a skill library, predict an executable skill plan that jointly specifies the activated subset, count, and execution order. We propose SkillComposer, which instantiates structured skill composition as task-conditioned skill sequence prediction. SkillComposer uses a constrained autoregressive decoder over skill identifiers, so subset, count, and order emerge jointly from a single decoding pass, and dependencies between successive skills are captured naturally. We build a training set of task-composition pairs from a real, human-curated skill library. We then evaluate SkillComposer along two axes: composition quality on a held-out test set, and downstream task success on SkillsBench across two production-grade coding agents. On GPT-5.2-Codex, Gemini-3-Pro-Preview, SkillComposer raises the pass rate by +23.1, +18.2pp over the no-skill baseline, surpassing top-3 retrieval and matching the gold-skill retrieval upper bound at lower prompt-token cost.

6. 【2606.32022】SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

链接：https://arxiv.org/abs/2606.32022

作者：Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Residual-stream analysis, language-model computation evolves, intermediate decoding requires, decoding requires comparable, evolves across depth

备注： an early-stage version

点击查看摘要

Abstract:Residual-stream analysis asks how language-model computation evolves across depth, but intermediate decoding requires comparable readout coordinates across layers. If embedding anchors and unembedding readout disagree on the chosen span, apparent motion may reflect measurement drift rather than computation. We introduce \emph{Semantic Reference Frames} (SemRF), an anchor-based formalism separating semantic measurement from residual dynamics. A SemRF fixes anchors and measures states against them. Pseudo-inverse tying gives exact synchronization; under restricted bi-invertibility, SemRF yields stable semantic-basis coordinates, distortion bounds, and near-identity changes. With the frame fixed, residual computation becomes a depthwise semantic trajectory. The anchors induce a semantic Voronoi diagram: distance, or evidence such as logits, assigns each layer to a coarse cell, while coordinates retain within-cell motion and margins. We define layerwise steps, contribution profiles, and imbalance diagnostics, then use the Voronoi trace to define a margin-relaxed tube. The canonical trace is the minimum-action path inside this tube; when nonempty with positive quadratic weight, it is unique and obeys a discrete spline equation away from active constraints. Excess action controls step, curvature, and profile mismatch. Low curvature implies piecewise-linear compressibility and local knowledge density: lower trace complexity means fewer semantic knots. Through the parameter-to-trajectory map, this gives a conditional link to parameter efficiency: among admissible settings fitting data, lower-action and lower-complexity traces use fewer semantic degrees of freedom. The guarantees require controlled interface error and small projection residual under explicit tube constraints.

7. 【2606.32014】Scalable Behaviour Cloning on Browser Using via Skill Distillation

链接：https://arxiv.org/abs/2606.32014

作者：Kaisen Yang,Zheng Jiang,Yuzhao Peng,Houde Qian,Boshi Zhang,Youjie Zheng,Shijin Hong,Qingle Liu,Ruoyu Han,Bohan Lyu,Bingxiang He,Eren Cai,Calvin Xiao,Qinhuai Na

类目：Computation and Language (cs.CL)

关键词：making human browsing, users collectively perform, editing to search, enterprise workflows, Internet users collectively

备注：

点击查看摘要

Abstract:Internet users collectively perform an enormous range of skilled work through web browsers, from software development and document editing to search, forms, and enterprise workflows, making human browsing a highly scalable but under-exploited source of reusable browser skills. We argue that the bottleneck for browser agents is decision-making under incomplete information rather than low-level operation, and that the priors agents lack are already implicit in human interaction traces. We therefore study scalable behavior cloning for browser agents via skill distillation, converting user interaction trajectories into compact natural-language skills that agents can read, retrieve, reuse, and compose directly. We further organize the distilled skills into a skill graph so that growth proceeds through consolidation rather than unbounded accumulation. This suggests that the scalability of browser agents may come less from manually designed tasks and more from the collective skills already expressed by internet users. Our project is available at: this https URL.

8. 【2606.31980】DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

链接：https://arxiv.org/abs/2606.31980

作者：Meng Chen,Anya Ji,Tsung-Han Wu,Tobias Maringgele,David M. Chan,Alane Suhr,Amy Pavel

类目：Computation and Language (cs.CL)

关键词：automating software tasks, increasingly capable, capable of automating, software tasks, automating software

备注：

点击查看摘要

Abstract:Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions consisting of 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications. We use DigitalCoach to evaluate whether state-of-the-art models can teach humans how to use computers. Automated evaluation shows that models differ from humans in how they coach: models provide more direct instructions, but fewer explanations, error diagnoses, and knowledge-check questions. When we fix the coaching method, models produce utterances similar to human references yet poorly grounded in visual context. Interactive evaluation confirms that model coaches cause learners to passively follow instructions without deeper engagement and fall short in visual grounding. DigitalCoach lays a foundation for collaborative and proactive computer use coaching agents.

9. 【2606.31966】MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

链接：https://arxiv.org/abs/2606.31966

作者：Qingyun Liu,Jiwen Zhang,Jingyi Hu,Siyuan Wang,Zhongyu Wei

类目：Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent multimodal large, environments remains underexplored, visually grounded environments, grounded environments remains, multimodal large language

备注： Project website: [this https URL](https://q-i-n-g.github.io/MECoBench-Website/)

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a multimodal embodied cooperation benchmark with an evaluation platform spanning diverse real-world tasks, two cooperation structures, and three collaboration modes. Through extensive experiments across various MLLMs, we summarize three key findings: (i) Collaboration generally improves embodied task completion, but its benefits depend on balancing collaborative gains against coordination complexity. (ii) Communication is essential to collaboration gains, while the best collaboration mode depends on team size and model capability. (iii) Moreover, collaboration improves robustness under noisy priors and exploration conditions. Generally, MECoBench provides a systematic testbed for understanding the mechanisms and limits of multimodal embodied collaboration. Code and dataset are available at this https URL.

10. 【2606.31963】Signed-Permutation Coordinate Transport for RMSNorm Transformers

链接：https://arxiv.org/abs/2606.31963

作者：John Sweeney

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词：Modern LLM workflows, LLM workflows move, Modern LLM, workflows move coordinate-indexed, LLM workflows

备注： 31 pages, 2 figures, 26 tables

点击查看摘要

Abstract:Modern LLM workflows move coordinate-indexed objects across checkpoints: steering vectors, sparse autoencoders, top-$k$ neuron sets, attribution lists, and merge alignments. This is only well posed after fixing the model's residual-stream gauge, which we show is architecture-dependent: LayerNorm residual charts have permutation gauge $S_d$ (up to a global sign flip), while RMSNorm charts with generic per-channel gain have signed-permutation gauge $B_d = S_d \ltimes \{\pm 1\}^d$. Permutation-only alignment is therefore symmetry-incomplete for RMSNorm models. We introduce sign-marginalized Hungarian matching and prove a sharp failure mode: with decorrelated coordinates, raw signed-correlation matching has a structural permutation-accuracy ceiling at the positive-sign fraction of the true gauge, which sign-marginalization removes. We then make coordinate-preserving transport, not function-level merging, the primary object: composing saved-checkpoint local $B_d$ gauges along same-base fine-tuning trajectories recovers 91.1% of cross-run coordinates at 1500 steps versus 60.3% for endpoint matching, and the gain is not explained by merely routing through the base. The recovered gauge transfers tools that permutation-only alignment breaks: TinyLlama SAE reconstruction has NMSE 0.004 under $B_d$ versus 1.08 under $S_d$; Qwen sentiment steering preserves 95.8% of its effect versus 17.2%; refusal steering reverses sign under $S_d$; coordinate-preserving merges behave the same way. The same covariance governs stateful training: signed transport of AdamW state preserves the resumed trajectory, while permutation-only state follows a different one from a functionally identical checkpoint. Finally, gauge-sweep audits show index-level interpretability claims are reproducible only relative to an explicit gauge.

11. 【2606.31947】LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

链接：https://arxiv.org/abs/2606.31947

作者：Nina Hosseini-Kivanani,Sandipana Dowerah

类目：Computation and Language (cs.CL)

关键词：datasets predominantly focus, speech technology research, speech datasets predominantly, Radio Télévision Luxembourg, widely spoken languages

备注： 7 pages, 4 figures, under review

点击查看摘要

Abstract:State-of-the-art speech datasets predominantly focus on widely spoken languages, often overlooking low-resource languages such as Luxembourgish, which remain underrepresented in speech technology research. In this work, we introduce LuxEmo, a 21-hour conversational expressive speech corpus for Luxembourgish with 4 emotion categories. LuxEmo is derived from Radio Télévision Luxembourg (RTL) youth broadcasts, using automated detection followed by human validation. We propose a semi-automatic curation workflow combining voice activity detection, denoising, language identification, LuxASR-based segmentation, automatic emotion prediction, lexical cues, and targeted human review. Additionally, we benchmark five expressive TTS systems covering German-based cross-lingual transfer, multilingual Luxembourgish support, Luxembourgish adaptation, and non-parametric prosody transfer. Performance is evaluated using both objective metrics and human evaluation.

12. 【2606.31916】heory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

链接：https://arxiv.org/abs/2606.31916

作者：Ben Slater,Matteo G. Mecattaf,Lucy G. Cheke,John Burden,Winnie Street

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Theory of Mind, Large Language, passive question-answering formats, benchmarks for Large

备注： 29 pages, 12 figures

点击查看摘要

Abstract:Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper we evaluate an agent's ability to induce specific belief states in other agents by taking actions rather than using conversational persuasion, a capability we call Non-Conversational Planning ToM (NCP-ToM). NCP-ToM is likely to be essential for many agent use-cases, including within user-assistant interactions and pedagogical contexts, but may also present manipulation or misinformation risks. Using a novel framework, NCP-ExploreToM, we subvert the conventional task structure by providing models with a set of belief state goals and requiring them to move objects or direct characters into rooms to achieve their goals. We evaluated six frontier models, including GPT-5, Gemini 2.5 Pro and the Claude 4 series, and a cohort of human participants, across 600 task instances. GPT-5 was successful on approximately 80% of tasks in the agentic setting, and was the only model to outperform human participants on our task, but was still less robust than humans across contexts. We additionally found that all models, like humans, performed better on tasks inducing true belief states than false belief states, which is a positive signal for alignment efforts. These findings highlight emerging social-reasoning capabilities in LLMs for non-conversational task completion and underscore the necessity of agentic evaluations for understanding the safety and alignment of autonomous social agents.

13. 【2606.31859】Review Residuals: Update-Conditioned Residual Gating for Transformers

链接：https://arxiv.org/abs/2606.31859

作者：Kyle Kramer

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Residual connections add, sublayer proposed update, Review Residuals, introduce Review Residuals, connections add

备注： 9 pages, 2 figures. Also on Zenodo: [this https URL](https://doi.org/10.5281/zenodo.21053343) ; Code: [this https URL](https://github.com/SixSigmaEngineer/review-residuals)

点击查看摘要

Abstract:Residual connections add every sublayer's proposed update with a fixed coefficient of one; the network never evaluates whether an update is reliable before committing it. Drawing on the human-factors principle of independent verification, we introduce Review Residuals, which scale each update by a learned, input-dependent gate conditioned on both the current state and the proposed update: h_l = h_{l-1} + r_l * u_l with r_l = sigmoid(W[RMSNorm(h_{l-1}), RMSNorm(u_l)]). Conditioning the gate on the update is the property that distinguishes it from prior gated and scaled residuals. We report two findings. First, a depth-stability result: a convex (Highway-style) form of the gate reintroduces vanishing gradients and fails to train beyond ~20 layers, whereas the additive, identity-preserving form trains stably at all depths we tested. Second, an emergence-with-scale result: trained from scratch across five sizes (60M-1B parameters, multi-seed), Review Residuals show no advantage at small scale but at 590M significantly outperform both a parameter-matched Highway gate and a parameter-matched standard residual (p0.05), with a larger advantage at 1B. The benefit grows with model size rather than shrinking.

14. 【2606.31845】Explicit Fuzzy Logic in the Feed-Forward Layer: Self-Forgetting Quantifiers Discover Legible Grammatical-Licensing Detectors

链接：https://arxiv.org/abs/2606.31845

作者：Mark Oskin

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：distinctions attention gathers, sublayer materializes, attention gathers, materializes the distinctions, distinctions attention

备注：

点击查看摘要

Abstract:A transformer's feed-forward (FFN) sublayer materializes the distinctions attention gathers, yet gives no account of what it computes. In a parameter-neutral replacement, each hidden unit is an explicit fuzzy set operation on sigmoid-bounded [0,1] memberships: intersection A*B and set-difference A*(1-B), the latter a bounded positive negation ("A but not B") that gated/bilinear units lack -- a negation-capable FFN (NC-FFN). On N-bit parity they are the most parameter-efficient reasoning basis at shallow depth; at scale (125M, OpenWebText) NC-FFN ties the GELU baseline's perplexity, every unit carrying explicit logical form. Two limits share one cause: two-operand logic localizes to layer 0 and erodes under training, and the one robust grammatical deficit concentrates in licensing and quantifiers, beyond within-token operators. We resolve both with a small block of sequence quantifiers: a soft existential and a soft proportion, each with a per-unit learned forgetting rate from a sticky init. This recovers the deficit at epoch one (halving the wider epoch-two gap), modestly leads on LAMBADA, and makes the FFN legible: the structure now holds and migrates into depth; the decay un-learns its stickiness (median half-life ~1.5 tokens; zero latch units); and at the semantic layers the units read, without dictionary learning, as grammatical licensing detectors: each fires on a licensor (a comparative, a passive participle, a negative-polarity item) and carries its memory forward to predict the licensed word (than, by, nor). This legibility is localized and free only up to a partition (a fully Boolean FFN diverges in training), but the result is a parameter-neutral, language-model-quality transformer with a readable, interpretable-by-construction grammatical mechanism -- an account not just of what a feed-forward layer represents but how it licenses.

15. 【2606.31796】CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

链接：https://arxiv.org/abs/2606.31796

作者：Dohyeon Kwon,Youngjin Park

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Toggle, Selective Ground Truth, Ground Truth Token, Truth Token Training, Toggle Hugging Face

备注： 33 pages, 3 figures, 28 tables. Preprint. Figures are native TikZ/pgfplots. Evaluation is loss-based; downstream benchmarks (KMMLU, HAERAE, KoBEST, MMLU) and selection-control ablations (random-15%, top-loss-15%) to appear in a future version

点击查看摘要

Abstract:We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output tokens that carry semantic payload. Through positive gradient coupling in position-shared transformer weights -- a token-level instance of auxiliary-task transfer -- the remaining 85% of unsupervised tokens still improve substantially, giving a 4.5x per-supervised-token efficiency (at the step-100 eval optimum, ~67% of the full-sequence loss reduction is recovered from 15% of the supervision). We prove that this improvement on unsupervised tokens is guaranteed whenever the gradient coupling coefficient gamma-bar = 0.72 is positive (Theorem 1), and show the effect is a property of natural-language structure: it collapses on shuffled text. (2) Depth compression with recurrent recovery. A 48-layer, 1B-parameter transformer is compressed to 6 layers (227M) by averaging adjacent layers and restored through learned recurrent unrolling. With 34 effective recurrent layers it reaches a held-out loss of 2.934, within measurement noise of a 566M dense model at 2.926 -- a 2.5x reduction in parameters. (3) Fusion of compressed experts. Assembling several compressed models as a Mixture of Efficient Experts (MoEE) with multi-token prediction improves over each single expert at comparable active parameters: a 2-expert MoEE reaches loss 2.789 versus 2.926 for the best single compressed model. We validate these techniques on CHERRY-1.8B, a Korean foundation model whose every trainable parameter derives from our own training runs. We are explicit throughout about the scope of the evidence (one model family, Korean data, loss-based metrics) and about which claims are established versus prospective.

Comments:
33 pages, 3 figures, 28 tables. Preprint. Figures are native TikZ/pgfplots. Evaluation is loss-based; downstream benchmarks (KMMLU, HAERAE, KoBEST, MMLU) and selection-control ablations (random-15%, top-loss-15%) to appear in a future version

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.31796 [cs.CL]

(or
arXiv:2606.31796v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.31796

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Dohyeon Kwon [view email] [v1]
Tue, 30 Jun 2026 15:14:38 UTC (95 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield, by Dohyeon Kwon and 1 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CL

|
next

new
|
recent
| 2026-06

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

    We gratefully acknowledge support from
    our major funders,
    member institutions, ,
    and all contributors.

About

Help

Contact

Privacy

Accessibility

Operational Status (opens in new tab)

Major funding support from

16. 【2606.31781】SpikeLogBERT: Energy-Efficient Log Parsing Using Spiking Transformer Networks

链接：https://arxiv.org/abs/2606.31781

作者：Thuan Bui,Duong Do,Tung Vu,Duc-Tho Mai,Cong-Kha Pham

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：transforming raw system, structured event templates, automated log analysis, raw system logs, system monitoring

备注：

点击查看摘要

Abstract:Log parsing is a fundamental step in automated log analysis, transforming raw system logs into structured event templates for downstream tasks such as anomaly detection and system monitoring. Existing log parsing methods range from rule-based and clustering-based approaches to neural models that learn semantic representations from log messages. However, neural approaches typically rely on dense matrix multiplications, which can result in high computational cost and energy consumption. This paper presents SpikeLogBERT, a spiking neural network framework for energy-efficient log parsing. The proposed model integrates a spiking transformer architecture with knowledge distillation from a BERT teacher model, enabling spike-driven computation while preserving semantic representation capability. By leveraging sparse spike activations and event-driven processing, the number of active operations during inference can be significantly reduced. As an initial benchmark study, experiments on the HDFS dataset demonstrate that SpikeLogBERT outperforms ANN-based neural log parsing models with a parsing accuracy of 0.99997, while reducing estimated theoretical energy consumption by up to 62.6% under standard 45nm CMOS assumptions.

17. 【2606.31779】Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

链接：https://arxiv.org/abs/2606.31779

作者：Ying Fan,Anej Svete,Kangwook Lee

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：models typically reason, Language models typically, typically reason, CoT, models typically

备注：

点击查看摘要

Abstract:Language models typically reason via explicit chain-of-thought (CoT), generating intermediate steps token-by-token. Latent CoT offers an alternative: it performs multi-step reasoning in the model's hidden states, replacing decoded tokens with continuous representations for greater efficiency. However, existing latent CoT methods underperform explicit CoT beyond 1B parameters, and the gap widens with scale. Looped, or recurrent-depth, Transformers, which reuse their weights to increase computation depth without adding parameters, are a natural fit for latent reasoning. We therefore ask whether looped Transformers can bridge this gap. We answer affirmatively with a simple recipe: a looped padded Transformer that processes K latent blocks in parallel for R iterations, with a cross-entropy loss on each latent position's gold CoT-step token, similar to explicit CoT supervision. We instantiate it as LOTUS (Looped Transformers with parallel supervision on latents). LOTUS is, to our knowledge, the first latent-CoT method to bridge the gap to explicit CoT at the 3B scale, while cutting thought-phase latency by 2.5x-6.9x from compact math expressions to natural language. Projecting LOTUS's post-loop latents through the base LM head recovers the gold reasoning steps and even surfaces alternative valid intermediate steps, evidence that its latent space is interpretable and CoT-aligned. Ablations confirm that both the looped backbone and the parallel supervision on gold CoT tokens are essential.

18. 【2606.31741】STEB: Style Text Embedding Benchmark

链接：https://arxiv.org/abs/2606.31741

作者：Rafael Rivera Soto,Anna Wegmann,Cristina Aggazzotti

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Massive Text Embedding, Text Embedding Benchmark, Style Text Embedding, embeddings remains fragmented, Massive Text

备注：

点击查看摘要

Abstract:While semantic embeddings are rigorously evaluated on the Massive Text Embedding Benchmark, the evaluation of style embeddings remains fragmented, with each work relying on their own set of tasks and datasets. To bridge this gap, we introduce the Style Text Embedding Benchmark, a comprehensive open-source benchmark intended to standardize the evaluation of style embeddings. STEB encompasses 96 datasets across 7 languages, spanning applications such as authorship verification, authorship retrieval, AI-text detection, probing of linguistic features, and others. We find that semantic embeddings consistently fail in stylistic tasks, and that there is no style embedding that is universally superior across all tasks evaluated. We open-source the STEB code base at: this https URL.

19. 【2606.31722】Adapting Foundation ASR Models to Dysarthric Speech: A Case Study

链接：https://arxiv.org/abs/2606.31722

作者：Christian Huber,Laura Kernahan,Alexander Waibel

类目：Computation and Language (cs.CL)

关键词：Automatic speech recognition, Automatic speech, limiting their usefulness, everyday communication, ASR

备注：

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems often perform poorly in dysarthric speech, limiting their usefulness to affected speakers in everyday communication. This paper presents a personalized ASR system for a dysarthric speaker, built by adapting a foundation ASR model to speaker-specific data. Using the TEQST tool, we collected 92 hours of read speech and later added 8.8 hours of user corrections gathered through a deployed mobile application. Starting from Whisper, fine-tuning reduced word error rate to 15.8% with only 1.4 hours of adaptation data, reached 10.7% with 22.5 hours, and achieved the best result of 9.7% when using all available data including the corrections. Using LoRA adaptation and/or Qwen3-ASR as foundation model performed worse in this setting. The results show that personalized fine-tuning can make foundation ASR models substantially more effective for dysarthric speech and suitable for practical deployment.

20. 【2606.31719】Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

链接：https://arxiv.org/abs/2606.31719

作者：Nan Li,Albert Gatt,Massimo Poesio

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：guarantee shared interpretation, shared interpretation, shared, shared perception, guarantee shared

备注： 17 pages, 9 figures, 8 tables; accepted to SIGDIAL 2026

点击查看摘要

Abstract:In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be shared from what has been shared between dialogue participants through grounding. We formulate this as an interpretation-matching task on 13,077 annotated reference expressions from HCRC MapTask dialogues, and evaluate VLMs under systematically controlled manipulations of dialogue context and map-information access. Our results show that providing authentic map images improves overall performance but shifts models toward over-predicting alignment. Textual descriptions of the same map content reproduce this bias, while non-informative images suppress alignment predictions entirely, indicating that the bias is driven by task-relevant map content, not the visual channel. This improvement comes at the cost of degraded accuracy on non-aligned cases. Calibration analysis and reference-chain tracking further suggest that models rely on static referential cues on the maps rather than tracking how grounding unfolds through dialogue history. We observe these patterns most clearly in Qwen3-VL-8B-Instruct and, to varying degrees, in four additional models from two architecture families. In models that exhibit the bias, map content, whether presented visually or textually, is treated as evidence of mutual understanding, conflating potential with established common ground.

21. 【2606.31718】Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian

链接：https://arxiv.org/abs/2606.31718

作者：Dragos-Mitrut Vasile,Elena-Simona Apostol,Stefan-Adrian Toma,Adrian Paschke,Ciprian-Octavian Truica

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：annotated corpora, typically constrained, lack of annotated, Romanian, Romanian BERT

备注：

点击查看摘要

Abstract:Relation extraction (RE) for low-resource languages is typically constrained by the lack of annotated corpora. We investigate the feasibility of cross-lingual RE for Romanian by combining automatic dataset translation with large language model (LLM) inference. We translate the SemEval-2010 Task 8 benchmark from English to Romanian using an LLM-based translation pipeline and evaluate Gemma 4 31B under zero-shot, few-shot, and QLoRA fine-tuned configurations, against four encoder baselines spanning 125M to 560M parameters: XLM- RoBERTa (base and large), Romanian BERT, and RoBERT- large. We assess two task formulations: relation classification with marked entities and end-to-end extraction. Our results show that Romanian incurs a 3 to 5 percentage point (pp) drop relative to English in prompt-only settings, that few-shot prompting provides marginal gains over zero-shot, and that QLoRA fine-tuning improves macro F1-Score by more than 22 percentage points in both languages while reducing the cross-lingual gap from 3.3 to 1.4pp. The encoder baselines come within 1-4pp of QLoRA Gemma on Romanian despite being 50-250 times smaller, with monolingual Romanian BERT at 125M parameters matching multilingual XLM-R at 278M. The case for using a 31B model for single-task RE on Romanian is therefore weak in deployment scenarios where compute matters. We release the translated dataset, evaluation code, and trained models.

22. 【2606.31694】RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

链接：https://arxiv.org/abs/2606.31694

作者：Jingbo He,Michael Färber,Roberto Calandra

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：manipulating open-world objects, robots manipulating open-world, open-world objects, manipulating open-world, representations must generalize

备注：

点击查看摘要

Abstract:For robots manipulating open-world objects, tactile representations must generalize to unseen materials. We introduce RCT (Robotic Contact Tactile), a robot-collected touch-vision-language dataset with 29,279 tactile frames from full robot presses on 122 industrial reference materials in 7 categories, recorded with three DIGIT sensors at multiple contact positions. RCT preserves each press as a contact sequence, enabling held-out evaluation across materials, categories, sensors, contact positions, and contact sequences. Frames from one press are strongly correlated: frame-random splits can place near-duplicate observations of the same physical interaction in both training and test. With the encoder held fixed, removing contact-sequence overlap reduces tactile-to-text Recall@1 by 17.7 percentage points. When materials are additionally held out at training time, performance drops sharply, leaving held-out-material Recall@1 at 25.1 +/- 6.1% averaged over three held-out draws. The public TVL/HCT split shows the same structure: every test contact sequence appears in training, and raw-pixel nearest neighbors recover the correct sequence in 98.3% of cases. Uniformly sampling a press improves contrastive training, and RCT-trained embeddings improve category probes on unseen materials. RCT makes contact-sequence-aware, held-out-material evaluation reproducible and exposes novel-material generalization as a central challenge for robotic tactile perception. The RCT dataset is open-sourced at this https URL

23. 【2606.31693】ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

链接：https://arxiv.org/abs/2606.31693

作者：Jiacheng Chen,Tao Zhang,Manxi Lin,Dunxian Huang,Teng Shi,Honghao Fu,Mengyan Li,Xinming Zhang,Chenchi Zhang,Xuan Lu,Xiaoxiong Du,Haibin Chen,Shaolin Ye,Hao Chang,Xiaoqi Li,Shuwen Xiao,Yujin Yuan,Jingxuan Feng,Shaopan Xiong,Huimin Yi,Ju Huang,Qiu Shen,Ying Chen,Junjun Zheng,Xiangheng Kong,Yuning Jiang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：intent-driven experiences orchestrated, wave of AI-native, AI-native applications, applications is moving, feed-based browsing

备注：

点击查看摘要

Abstract:The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation pipelines, forcing complex intents through low-bandwidth retrieval or ranking interfaces and leaving a gap between language understanding and item-space fulfillment. Generative recommendation gives LLMs a direct item-space interface through semantic IDs (SIDs), but existing models mainly generate candidates for retrieval rather than translate flexible intents into item-space outcomes. We propose ShopX to address this bottleneck by unifying intent understanding, execution planning, and flexible SID-native item-space operations into a single foundation model. We deploy ShopX in agentic shopping workflows through a model-native item-fulfillment framework with a serving harness that defines a model-facing action protocol and exposes support surfaces for context access, catalog grounding, and state management. Within this framework, ShopX plans and composes SID-based item-space operations such as SID beam-search retrieval, listwise ranking, or product bundling. This model-centric design reduces lossy hand-offs between agent orchestration and item-space execution. To build ShopX, we design semantically recoverable, LLM-operable SIDs and a training recipe that equips a general LLM for flexible multi-turn item-space fulfillment while retaining the knowledge and instruction-following abilities needed by a shopping agent. We evaluate the ShopX framework against tool-mediated agentic systems on single- and multi-turn fulfillment tasks derived from anonymized Taobao production logs, showing that model-native fulfillment improves overall framework behavior, especially on complex or ambiguous requests.

24. 【2606.31692】Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management

链接：https://arxiv.org/abs/2606.31692

作者：Luis Gasco,Hermenegildo Fabregat,Laura García-Sardiña,Paula Estrella,Warre Veys,Casimiro Pío Carrino,Matthias De Lange,Daniel Deniz Cerpa,Álvaro Rodrigo,Jens-Joris Decorte,Rabih Zbib

类目：Computation and Language (cs.CL)

关键词：Human Capital Management, Natural Language Processing, Conference and Labs, advancing Natural Language, Language Processing research

备注：

点击查看摘要

Abstract:This paper presents an overview of the second edition of the TalentCLEF challenge, organized as a Lab at the Conference and Labs of the Evaluation Forum (CLEF) 2026. TalentCLEF is an initiative aimed at advancing Natural Language Processing research in Human Capital Management. The second edition of the challenge consisted of two tasks: Task A, contextualized job-person matching, focuses on identifying and ranking the most suitable candidates represented by their resumes for a given job vacancy in English and Spanish. Task B, job-skill matching with skill type classification, addresses retrieving the most relevant skills for a given job title in English and distinguishing between core and contextual skills. TalentCLEF attracted 113 registered teams and received more than 400 submissions in the two tasks, reflecting the growing interest of the research community in shared evaluation benchmarks for Human Capital Management. This paper describes the motivation and organization of the challenge, summarizes the datasets and evaluation settings, and reports the main results obtained by the participating teams.

25. 【2606.31644】Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

链接：https://arxiv.org/abs/2606.31644

作者：Mohammadamin Shafiei,Shuyue Stella Li,Yulia Tsvetkov

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：morally consequential roles, large language models, roles in healthcare, hiring contexts, large language

备注：

点击查看摘要

Abstract:As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations substantially overestimate moral safety. Models appear fair when demographic identity is stated as an explicit label, yet become measurably less fair when the same identity must be inferred. We term this failure \emph{performative compliance}, where a model is fair when the presentation resembles a fairness evaluation and less fair as that cue weakens. We introduce a cue-variation methodology that holds the moral dilemma and the demographic identity fixed and varies only how that identity is conveyed. Hiding the explicit label raises harmful decisions by $+4.4$~pp and changes model safety rankings, and the shift persists when models correctly infer the demographic, ruling out attribution error. We propose the \textbf{Cue Visibility Gap}, a model-agnostic robustness metric that can be added to any existing fairness benchmark to separate genuine from performative moral safety. Fairness evaluations that omit cue variation measure surface compliance, not moral robustness, and should not ground deployment decisions in high-stakes settings.

26. 【2606.31642】one-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition

链接：https://arxiv.org/abs/2606.31642

作者：Kesego Mokgosi,Vukosi Marivate,Sitwala Mundia,Unarine Netshifhefhe,Tsholofelo Hope Mogale,Thapelo Sindane

类目：Computation and Language (cs.CL)

关键词：current foundation ASR, Southern Bantu languages, Southern Bantu, foundation ASR models, produce zero-shot WER

备注：

点击查看摘要

Abstract:Southern Bantu languages are spoken by over 80 million people, yet current foundation ASR models still produce zero-shot WER above 100%, which limits practical use in education and public services. We addressed this gap with a tone conditioned curriculum framework for 6 Southern Bantu languages that combined hybrid difficulty scoring, gated adapters driven by tonal statistics and staged curriculum training. We trained on a community corpus and tested transfer to NCHLT to measure robustness beyond matched evaluation. Results revealed clear interactions between architecture and language, with W2V-BERT outperforming Whisper on Nguni languages by 3 to 4 WER points whilst Whisper performed better on Sotho-Tswana languages. W2V-BERT with tone conditioning reached 28.41% average WER across datasets and 23.79% on Xitsonga transfer. No single model suited all 6 languages, so deployment should pair model selection per language with validation across corpora.

27. 【2606.31608】CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

链接：https://arxiv.org/abs/2606.31608

作者：Ajmal M.,Abin Roy,Afthab Salam Kanniyan,Jawadh Abdul Kabeer,Jerin James,Preslav Nakov,Zhuohan Xie

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, reasoning remains difficult, achieve strong results, Language Models

备注： 21 pages, 12 figures

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations can appear clinically convincing even when the final diagnosis is incorrect. We introduce CLExEval, a human-in-the-loop framework for evaluating LLM clinical reasoning under progressive information masking. CLExEval combines 5,600 expert-physician annotations with 200 clinical reasoning traces derived from 40 rare diagnostic cases. Our analysis identifies three recurring failure patterns: (i) verbosity bias, where GPT-4o-mini's diagnostic accuracy drops from 95.0% to 32.5% under information scarcity; (ii) a hidden knowledge paradox, where a specialist model reaches 92.5% maximum diagnostic potential but fails to retrieve that knowledge reliably in verbose contexts; and (iii) a 68.6% reasoning-to-output mismatch, where correct diagnoses appear in reasoning traces but are not reflected in final answers. We further evaluate the LLM-as-a-Judge paradigm on a human-verified failure set (n = 142). GPT-4o-mini approved 47.9% of clinically incorrect outputs, while HuatuoGPT-o1 approved all validly scored failures and showed a positive self-preference bias. These results suggest that standalone automated clinical evaluations can substantially overestimate clinical reliability without expert-grounded validation.

28. 【2606.31602】Robust Text Watermarking for Large Language Models via Dual Semantic Embeddings

链接：https://arxiv.org/abs/2606.31602

作者：Jonas Schäfer,Cezary Pilaszewicz,Gerhard Wunder

类目：Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：presents Dual-Embedding Watermarking, large language models, semantic watermarking scheme, work presents Dual-Embedding, Dual-Embedding Watermarking

备注： Preprint. 22 pages, 9 tables, 1 figure

点击查看摘要

Abstract:This work presents Dual-Embedding Watermarking (DEW), a semantic watermarking scheme for large language models (LLMs) that leverages contextual and token-level embeddings to enhance robustness against paraphrasing and translation. DEW utilizes a signal-processing methodology, applying algebraic vector-space operations to \mbox{token and context embeddings to derive a watermark signal that degrades gracefully under semantic shifts. The method obfuscates the watermark by projecting embedding vectors through pseudo-random matrices seeded with a secret key. Relevant distributions derived from the underlying algebra are evaluated and employed for statistical testing and benchmarking of DEW. Experimental results across multiple LLMs indicate that DEW improves post-paraphrase detection while maintaining competitive text quality, and remains detectable after translation, even when prior semantic watermarks degrade significantly. These findings position DEW as a practical and robust solution for safeguarding LLM-generated text and addressing critical issues in responsible AI deployment.

29. 【2606.31551】AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

链接：https://arxiv.org/abs/2606.31551

作者：Zhaojian Yu,Penghao Yin,Shuzheng Gao,Shilin He,Kai Cai,Xiao-Ping Zhang

类目：Computation and Language (cs.CL)

关键词：highly human-intensive process, frontier language model, frontier language, remains a highly, human-intensive process

备注：

点击查看摘要

Abstract:Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that autonomous post-training is not just a coding problem: it requires the agent to repeatedly plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state across many hours of interaction. We present AutoTrainess, a LM agent that exposes these operations as a repository of agent-computer interfaces for planning, data preparation, training, evaluation, and logging. Rather than leaving the agent to operate in a raw CLI environment with an underspecified action space, AutoTrainess externalizes prior human experience as explicit workflows, rules, and execution constraints that guide the agent toward effective and reliable training behavior. On PostTrainBench, AutoTrainess consistently outperforms CLI-only baselines, achieving 26.94 average score with GPT-5.4 (Codex) versus 23.21 for CLI-only. It also generalizes across models and harnesses, improving DeepSeek-V4-Flash (OpenCode) from 12.13 to 19.58.

30. 【2606.31543】Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

链接：https://arxiv.org/abs/2606.31543

作者：Johan Land

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：internally coherent reasoning, Large language models, coherent reasoning traces, abstract reasoning tasks, candidate reasoning traces

备注： 37 pages, 4 figures; source code available at [this https URL](https://github.com/beetree/ARC-AGI)

点击查看摘要

Abstract:Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I present a solver for ARC-AGI-2, a few-shot visual reasoning benchmark, built around two principles: (i) treating reasoning modalities as search operators, generating diverse candidates independently across text, image, and code channels, and (ii) context-preserving holistic judging, in which a judge model jointly compares all candidate reasoning traces within a single long-context prompt. Unlike self-consistency or majority voting, this approach reliably recovers correct minority hypotheses on tasks where the modal answer is wrong. On the ARC Prize semi-private evaluation set, the solver achieves 72.9 percent at USD 38.99 per task - the highest score on the verified leaderboard at the time of writing, exceeding the best standalone frontier models, GPT-5.2 Pro at 54.2 percent and Gemini 3 Pro at 54.0 percent, by +18.7 percentage points. On the public evaluation set, it achieves 76.1 percent at USD 19.69 per task. I release the full source code and document extensive negative results, including the finding that prescriptive prompting templates and iterative refinement systematically reduce hypothesis diversity and degrade performance.

Comments:
37 pages, 4 figures; source code available at this https URL

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2606.31543 [cs.AI]

(or
arXiv:2606.31543v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.31543

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

31. 【2606.31522】FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

链接：https://arxiv.org/abs/2606.31522

作者：Muhammad Usman Safder,Ayesha Gull,Rania Elbadry,Fan Zhang,Yankai Chen,Xueqing Peng, Xue (Steve)Liu,Preslav Nakov,Zhuohan Xie

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, avoid speculative bets, Mandate Salience Decay

备注： 29 pages, includes figures and tables; formalizes Mandate Salience Decay and introduces FinPersona-Bench

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision throughout deployment. In practice, however, as market context accumulates over long horizons, these mandates gradually lose their behavioral influence, a phenomenon we formalize as Mandate Salience Decay (MSD). To measure MSD objectively, we introduce FinPersona-Bench, a simulation benchmark in which a synthetic market decouples observable price from hidden fundamental value, enabling falsifiable evaluation across three failure modes: trading without signal in calm markets, panic-selling during crashes, and ignoring fundamental value during speculative bubbles. Evaluating 18 leading frontier and open-source LLMs, each assigned one of three behavioral profiles ranging from strict capital preservation to aggressive growth, shows that MSD compounds over time and is model-dependent. In crash scenarios, the behavioral gap between static agents and those receiving periodic mandate re-grounding grows 4.4x from the first to the final quarter of the simulation. The effects of mandate re-grounding are not uniformly positive: it consistently helps conservative agents in low-signal markets but actively worsens behavior for aggressive agents in the same setting. These findings suggest that reliable long-horizon deployment requires selective, mandate-aware re-grounding based on agent profile and market regime.

32. 【2606.31519】RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference

链接：https://arxiv.org/abs/2606.31519

作者：Wenhao Li,Jinhao Dong,Hailin Zhang,Wenhang Shi,Wei Lu,Xiaoyong Du

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Long-context Large Language, Large Language Model, Language Model inference, Long-context Large, Large Language

备注： Accept by ICML 26

点击查看摘要

Abstract:Long-context Large Language Model inference is severely bottlenecked by the massive Key-Value (KV) cache, yet existing sparse attention methods often suffer from static fixed-budget (Top-k) retrieval or rely on proxy scores that are computationally expensive and biased. To address these limitations, we propose RaBitQCache, a novel sparse attention framework that utilizes randomized rotated binary quantization and high-throughput binary-INT4 arithmetic to efficiently estimate attention weights. Our proxy score serves as an unbiased estimator with a proven error bound, enabling adaptive Top-p retrieval that dynamically adjusts the token budget based on actual attention sparsity. We further implement a hardware-aware system with asynchronous pipelining and lazy updates to mask overhead. Evaluations demonstrate that RaBitQCache significantly accelerates inference and reduces memory I/O while preserving generation quality compared to state-of-the-art baselines. Code is available at this https URL.

33. 【2606.31511】Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

链接：https://arxiv.org/abs/2606.31511

作者：Mehmet Iscan

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：small frozen code, retraining is infeasible, deployment settings, settings where retraining, routinely asked

备注： 39 pages, 5 figures, 14 tables

点击查看摘要

Abstract:In deployment settings where retraining is infeasible, small frozen code models are routinely asked to repair a failed program after seeing their own failing output, usually treated as a retry mechanism. From a Popperian view, a generated program is a conjecture and a test-execution violation is an oracle-relative, executable counterexample, so feedback's value should be attributed not to re-exposure to failing code but to whether the conjecture is opened to external, executable criticism. As the third stage of a falsification-centered measurement program, this study builds a placebo-controlled instrument that decomposes the feedback packet against a blind-resampling baseline at matched output-generation budget and against content-free, shape-matched placebos. The contribution is not a new repair algorithm but a reflexive methodology (packet decomposition, placebo mirroring, matched-budget discordant-pair tests, fresh-generation confirmation, executable audits) that makes both the model's program conjecture and the researcher's "feedback content works" claim falsifiable. Across six HumanEval+/MBPP+ cells with three 0.5B-1.5B frozen models, 290 dead task-cell units (no best-of-8 candidate passing the public tier) were evaluated; the main run produced 7,000 fresh generations and a preregistered follow-up 1,400 more. Blind resampling exceeded bare-code retry by +18 net unlocks (25/7, Holm p=0.0021). Code-plus-facts recovered +18 over bare code (21/3, p=0.00042) and +15 over a generic-bullet placebo (p=0.0041). An instruction-only effect was not distinguishable (+3, p=0.36). Code-plus-facts and blind resampling tied at 26 unlocks each (not equivalence). Six external-controller follow-ups tied a content-free shape placebo. In this regime, falsification helped not as vocabulary or self-critique, but as comparison with external, executable counterexamples.

34. 【2606.31508】Building an ASR Solution for Training and Assessing Children's Reading

链接：https://arxiv.org/abs/2606.31508

作者：Yacouba Diarra,Nouhoum Souleymane Coulibaly,Mamadou Dembele,Aymane Dembele,Michael Leventhal

类目：Computation and Language (cs.CL); Sound (cs.SD)

关键词：Automatic speech recognition, African languages, reproducible literacy assessment, reading remains underdeveloped, Automatic speech

备注： 5 pages, 2 figures

点击查看摘要

Abstract:Automatic speech recognition for children's reading remains underdeveloped for most African languages, including Bambara, despite its potential value for reproducible literacy assessment. We present an open-source system for assessing children's reading in Bambara, developed through an end-to-end process linking field data collection, benchmark construction, model adaptation, a reading application, and classroom validation. A mobile collection and assessment app was used to collect 55 hours of raw reading speech from 60 children, from which we construct a public benchmark for Bambara child-reading assessment. Fine-tuning experiments compare Soloni, a Bambara-adapted Fast-Conformer ASR framework with TDT and CTC decoders, with QuartzNet, a compact convolutional ASR architecture. The best Soloni model reduces WER from 0.42 to 0.22 and CER from 0.15 to 0.08, substantially outperforming QuartzNet on the isolated benchmark. The experiments further show that repeated readings of the same texts provide architecture-dependent benefits: they substantially improve QuartzNet but add only marginal gains for Soloni, while SpecAugment regulates training without exceeding the best unaugmented configuration. Disaggregated analysis identifies children under 10 as the main source of residual errors, motivating targeted collection from younger readers. Ten classroom trials supported continued use of the application.

35. 【2606.31484】Fork-Think with Confidence

链接：https://arxiv.org/abs/2606.31484

作者：Zena Al-Khalili,Rafi Hakim,Dietrich Klakow,Ji-Ung Lee

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：enjoyed great success, boosting LLM performance, enjoyed great, great success, success for boosting

备注：

点击查看摘要

Abstract:Parallel thinking has enjoyed great success for boosting LLM performance on reasoning tasks without the need for any re-training. However, existing methods follow a think-first-then-decide paradigm, i.e., they first sample multiple reasoning paths, which inevitably leads to overgeneration, then prune or stop unnecessary paths to compensate. In contrast, decide-first-then-think, i.e., first identifying points that are likely to lead to desirable generations, has been underexplored so far. Following this paradigm, we propose Fork-think with confidence, that first identifies forking points using model confidence in a single seeding path, then triggers thinking, sampling multiple continuations and aggregating them for the final response. Our experiments across three models and three reasoning benchmarks show that Fork-think reduces the token consumption by up to 30% and run-time by up to 57%, while performing comparable to or better than parallel thinking. Our analysis reveals that Fork-think is able to identify forking points that are meaningful with respect to the downstream task and that sampling at later positions can lead to substantially better generations. Finally, we demonstrate how combining Fork-think with existing mechanisms such as early stopping and weighted voting can further boost the performance and perform comparably to existing state-of-the-art methods, without requiring any warm-up or offline training. Our results establish pre-determined forking as a promising research direction for efficient LLM reasoning.

36. 【2606.31464】am MKC at CLPsych 2026: Capturing and Characterizing Mental Health Changes through Social Media Timeline Dynamics

链接：https://arxiv.org/abs/2606.31464

作者：Kyomin Hwang,Hyeonjin Kim,Hyunho Lee,Nojun Kwak

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, including Artificial Intelligence, Language Models, Artificial Intelligence, Large Language

备注：

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have motivated their adoption across a wide range of domains, including Artificial Intelligence (AI) for mental health. Given the growing prevalence of mental health disorders worldwide and the limited accessibility of professional care, there is an increasing demand for scalable computational approaches that can assist in early detection and continuous monitoring of psychological well-being. In this area, ongoing efforts have focused on curating domain-specific datasets and leveraging them to develop LLMs capable of supporting holistic mental health analysis. In line with this direction, we propose an LLM-based pipeline for comprehensive mental health analysis over sequentially ordered user posts, as part of the CLPsych shared task. Our pipeline offers a unified framework that jointly enables post-level assessment and user-level temporal modeling.

37. 【2606.31446】Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

链接：https://arxiv.org/abs/2606.31446

作者：Stefan Larson,Attila Nagy,Sam Desai,Cyrus Desai,Nicole C. Lima,Yixin Yuan,Siddharth Betala,Kaushal K. Prajapati,Jamiu T. Suleiman,Sharad Duwal,Kevin Leach

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：benchmarking document classifiers, RVL-CDIP, test-train overlap, popular dataset, document classifiers

备注： DocEng 2026

点击查看摘要

Abstract:RVL-CDIP is a popular dataset for benchmarking document classifiers. However, the dataset contains ample amounts of label errors as well as non-trivial amounts of test-train overlap, both of which may impact model performance metrics. In this paper, we address these two problems by (1) finding and fixing label errors, and (2) detecting and addressing test-train overlap. We produce several variations of RVL-CDIP with label error and test-train overlap fixes, and benchmark document classification performance on these new RVL-CDIP variations. Our rigorous analysis of RVL-CDIP finds that the corpus contains 12\% label error and approximately 35% test-train duplication. Remediation sees improvements in classification accuracy when errors are removed, but sees decreases in accuracy when duplicates are removed. We additionally evaluate models on RVL-CDIP-N, an out-of-distribution benchmark, finding that training on error-corrected data substantially improves OOD generalization, with supervised models gaining an average of 8.1 percentage points in accuracy and improvements as large as 14 percentage points.

38. 【2606.31435】CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

链接：https://arxiv.org/abs/2606.31435

作者：Yuchen Huang,Xiang Li,Zhenqing Ling,Sijia Li,Qianli Shen,Daoyuan Chen,Yi R. Fung,Yaliang Li

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：involves executing multi-step, evolving text states, refinement involves executing, executing multi-step recipes, processing operators determine

备注： 29 pages, 20 figures. Corresponding authors: Daoyuan Chen and Yi R. Fung

点击查看摘要

Abstract:Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or entangle it with code and tool execution, it remains unclear whether LLMs can directly and faithfully execute these compositional, order-sensitive data refinement recipes. To fill this gap, we introduce CDR-Bench, a comprehensive benchmark featuring 3,462 high-quality tasks spanning four real-world data refinement domains and 29 distinct operators. Our benchmark evaluates models across atomic, order-agnostic, and order-sensitive settings, leveraging deterministic reference outputs to enable exact evaluation. Experiments on 10+ state-of-the-art LLMs reveal consistent failure patterns: performance degrades sharply in compositional settings, and order-sensitive recipe success collapses. These findings underline that current LLMs lack the procedural faithfulness required for reliable compositional data refinement.

39. 【2606.31432】Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

链接：https://arxiv.org/abs/2606.31432

作者：Yining Huang

类目：Computation and Language (cs.CL)

关键词：heterogeneous knowledge domains, requires parameter-efficient adaptation, answering requires parameter-efficient, multiple-choice question answering, Medical multiple-choice question

备注：

点击查看摘要

Abstract:Medical multiple-choice question answering requires parameter-efficient adaptation across heterogeneous knowledge domains and reasoning operations. A medication question, a diagnostic decision, a public-health item, and a nursing-action item may require different low-rank updates, while some recall items should preserve the base model's representation with only mild adapter intervention. We propose BiRG-LoRA, a single-adapter rank-gated LoRA method for medical question answering. BiRG-LoRA keeps one LoRA module per target layer but makes its rank dimension input-conditioned: for each question, a biaxial gate combines hidden semantic evidence with specialty/profession priors, clinical-operation priors, and their interaction to select a sparse top-$k$ subset of rank atoms. A scalar injection coefficient further controls the strength of the selected adapter update. Under a matched Qwen3-8B CMB-source protocol, BiRG-LoRA achieves the highest four-benchmark macro-average accuracy among trainable PEFT baselines and matched routing controls: 69.31% averaged over CMB, CMExam, MedQA, and MedMCQA. It improves over MoELoRA by 0.89 percentage points while using 28.1% fewer trainable parameters; a paired, benchmark-stratified bootstrap over final predictions gives a 95% confidence interval of [0.42, 1.37] for this macro-average gain. Basic controls show that BiRG-LoRA also improves over vanilla LoRA r16 and active-rank-matched LoRA r4 by 0.83 macro points, and an evaluation-time weak-axis perturbation check suggests that performance is not brittle to moderate tag noise. The results support a bounded claim: clinically structured rank allocation improves cross-benchmark medical QA under a matched single-seed protocol, while training-seed variance remains future work.

40. 【2606.31411】Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

链接：https://arxiv.org/abs/2606.31411

作者：Anh-Tuan Dao,Driss Matrouf,Mickael Rouvier,Nicholas Evans

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：generative speech technology, Rapid advancements, voice biometrics, advancements in generative, generative speech

备注：

点击查看摘要

Abstract:Rapid advancements in generative speech technology have compromised the reliability of voice biometrics. While current spoofing detectors excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. We show that this can be due to linguistic bias. A reliance on linguistic cues observed in training data can then compromise robustness to cross-data. We propose a linguistic-invariant spoofing detection framework utilizing teacher-student adversarial learning. The linguistic-aware teacher model, pre-trained on linguistic content of an external dataset, guides the student detector via gradient reversal to minimize the linguistic information. To prevent the inadvertent removal of non-linguistic cues, we incorporate a Variational Information Bottleneck to enable suppression of principal cues. Across nine DF Arena datasets, our method achieves up to a 36.2% relative reduction in the EER compare to the baseline.

41. 【2606.31407】Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

链接：https://arxiv.org/abs/2606.31407

作者：Ta Duc Huy,Trang Nguyen,Townim Chowdhury,Ankit Yadav,Minh-Son To,Zhibin Liao,Johan W. Verjans,Vu Minh Hieu Phan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：visually ambiguous inputs, produce confident answers, biased predictions, produce confident, visually ambiguous

备注： Accepted at ECCV2026

点击查看摘要

Abstract:Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows that overconfident visual embeddings suppress output diversity under stochastic decoding, causing SE to underestimate uncertainty in such cases. Recent methods instead probe output diversity through input perturbations, including textual paraphrasing or joint text-image perturbations, and show improved performance. We study these approaches and reveals that the resulting variability is often dominated by textual changes rather than visual evidence, causing uncertainty estimates to reflect prompt sensitivity rather than visual ambiguity. We therefore propose Visual Semantic Entropy (VSE), which perturbs only the image to probe nearby visual variations while keeping the text query fixed. VSE measures uncertainty by clustering generated answers into semantic prototypes and computing the mass-weighted dispersion among them. Extensive evaluation across five modern vision-language models and five diverse VQA benchmarks demonstrates that VSE effectively captures visual ambiguity, establishing a new state-of-the-art for VLM uncertainty estimation.

42. 【2606.31371】Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

链接：https://arxiv.org/abs/2606.31371

作者：Zewen Liu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language model, learned strategy distribution, agent learned strategy, systematic evaluator biases, evaluator biases propagate

备注： 7 pages, 2 tables

点击查看摘要

Abstract:When large language model (LLM) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent's learned strategy distribution - a phenomenon termed evaluator preference coupling. Prior work has documented this coupling and established a diagnostic framework (EPC) to measure it, but has not investigated whether calibration techniques can mitigate the effect. We present the first study of evaluator calibration as mitigation: applying probability calibration to the evaluator's pairwise judgments to reduce spurious preference propagation. In a controlled within-subjects experiment (N=5) comparing standard binary TTRL (win/loss) with confidence-calibrated TTRL (probability-weighted updates) using DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, we find that calibration reduces the coupling coefficient gamma by 20-49% and Jensen-Shannon divergence by 45-67%. A symmetric-LR control confirms the effect is not due to reduced update asymmetry. We release the calibrated TTRL protocol and recommend it as a lightweight mitigation for LLM-as-judge deployment pipelines.

43. 【2606.31315】BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

链接：https://arxiv.org/abs/2606.31315

作者：Hao Zhang,Yiming Hu,Yong Wang,Mingqiao Mo,Xin Xiao,Xiangxiang Chu

类目：Computation and Language (cs.CL)

关键词：enabling lossless acceleration, generate candidate tokens, Speculative decoding accelerates, lightweight draft model, optimal block size

备注： 16 pages

点击查看摘要

Abstract:Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens in parallel, and are then verified by the target model, enabling lossless acceleration. Recently, diffusion-based speculative decoding further improves parallelism by generating multiple tokens per forward pass via block-level diffusion, achieving state-of-the-art (SOTA) performance. However, existing methods adopt a fixed inference block size and assume a uniform optimal decoding strategy across all inputs. In this paper, we show that this assumption is suboptimal, as the optimal block size varies across samples and plays a critical role in speculative decoding performance. Moreover, these values exhibit a clear local structure, concentrating around the training block size, which reduces the problem to a low-dimensional and structured decision space. Based on these insights, we propose BlockPilot, a sample-adaptive policy that predicts the optimal block size from the prefilling representation. Specifically, we formulate block size selection as a lightweight policy learning problem and propose an instance-adaptive decision mechanism that predicts the optimal block size based on the representation of the prefilling stage. The prediction is performed only once after prefilling, allowing for seamless integration. Extensive experiments demonstrate that our method is plug-and-play, introduces minimal overhead, and consistently improves efficiency, achieving an acceptance length of 5.92 and a 4.20$\times$ speedup on Qwen3-4B under temperature $T=1$.

44. 【2606.31310】LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

链接：https://arxiv.org/abs/2606.31310

作者：Hong-Yun Lin,Fu-An Chao,Bi-Cheng Yan,Berlin Chen

类目：Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：Spoken Language Assessment, Multimodal Large Language, Multimodal Large, Large Language Models, Fueled by increasing

备注：

点击查看摘要

Abstract:Fueled by increasing model scale and multimodal inputs, Multimodal Large Language Models (MLLMs) have emerged as a promising paradigm for Spoken Language Assessment (SLA). While effective, this paradigm often overlooks the intrinsic ordinal structure of language acquisition. This paper works around the necessity of large-scale MLLMs by introducing Latent Ordinal Prototype Alignment (LOPA) for SLA, a prototype-based regularizer that enforces an ordinal geometric prior directly on the latent space. Coupled with Semantic-Anchored Layer Routing (SALR), which adaptively harvests multi-depth representations from a frozen Whisper encoder, our framework achieves an RMSE of 0.361. This performance rivals billion-parameter systems without the need for LLM-based fine-tuning. Further analysis reveals that SALR's synergy with LOPA offers interpretable, criterion-aligned preferences, thereby supporting an efficient and ordinal-aware modeling alternative to current scaling-centric models for SLA.

45. 【2606.31307】When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue

链接：https://arxiv.org/abs/2606.31307

作者：Mohammad Alijanpour Shalmani,Alale Rezvani Boroujeni,Jiann Shiun Yuan

类目：Computation and Language (cs.CL)

关键词：Large language models, surface mismatched information, Large language, database calls fail, backend database calls

备注： Accepted at SIGDIAL 2026

点击查看摘要

Abstract:Large language models used in task-oriented dialogue often produce fluent but unsafe responses when backend database calls fail, return empty results, or surface mismatched information, inventing venues, confirmations, or booking details not grounded in the database. We study a lightweight prompting-based recovery approach that improves robustness without retraining or additional model calls. We compare three response strategies, including a guided recovery prompt conditioned on structured database status, across six open-weight model families (DeepSeek-R1, Gemma-2, Llama-3, Mistral, Phi-3, and Qwen-2.5) and four database conditions: empty result, wrong-domain retrieval, API error, and clean retrieval. Using fault-injected benchmarks built on two structurally different datasets, MultiWOZ 2.2 (5 domains) and SGD (20 domains), we find that naive agents hallucinate on 30.5% of failure turns on MultiWOZ and 20.9% on SGD. Our Guided-Retry strategy reduces hallucination by 50% on MultiWOZ (30.5 to 15.3%) and by 42% on SGD (20.9 to 12.2%) without retraining. However, residual hallucination remains substantial (6-37% across models), with wrong-domain failures the hardest case. Results are consistent across both datasets and all six model families, and human annotation shows substantial agreement while supporting the validity of the automatic commitment-safety metric.

46. 【2606.31272】he Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills

链接：https://arxiv.org/abs/2606.31272

作者：Hongliang Liu,Yuhao Wu,Tung-Ling Li

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：agents increasingly acquire, tool declarations fetched, agents increasingly, increasingly acquire, acquire and execute

备注：

点击查看摘要

Abstract:AI agents increasingly acquire and execute skills at runtime: bundles of prompt instructions, executable code, and tool declarations fetched from marketplaces and other agents. Governing them needs a stable notion of skill identity, yet cryptographic hashing is engineered to destroy the very similarity we need, as a one-character edit scrambles the digest. We present a compact, locality-sensitive fingerprint that embeds each component of a skill and projects it to bits with a multi-bank SimHash, giving a fixed 120-byte signature compared in constant time by Hamming distance. Our central claim is that keeping the fingerprint as a per-component triple (prompt, code, tools), rather than a single score, is what makes it useful: the triple recovers skill-family identity through paraphrase, renaming, refactoring, and controlled code translation when another component remains shared, while independent multilingual reimplementation is not recovered; it also localizes which component carries the reuse. We claim lineage, not behavioral equivalence: identity supplies the structural axis of a registry and leaves safety to behavioral verification. The fingerprint reaches an area under the ROC curve (AUC) of 0.974 (95% CI [0.956, 0.994]) over 4,950 pairwise comparisons while using 77x fewer bits than the embedding it approximates, with ranking preserved in expectation and finite-bit concentration; the per-component split turns one number into relationship classification, families, novelty, and a portable "SkillBOM" for a skill registry. On a 906-skill injection benchmark the fingerprint recognizes injected skills as tampered copies of a known base and localizes the change, but recognition is not trust: it remains, by design, an identity signal complementary to behavioral verification rather than a safety verdict.

47. 【2606.31270】Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

链接：https://arxiv.org/abs/2606.31270

作者：Xueqiao Sun,Xiaohan Wang,Ludwig Schmidt,Serena Yeung-Levy,Yuhui Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词：leverage multimodal large, multimodal large language, attracted significant attention, large language models, Computer-use agents

备注： Published in ECCV 2026

点击查看摘要

Abstract:Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these agents is collecting large-scale, high-quality trajectories. The standard approach generates synthetic data through a self-improving loop: an agent is placed in a verifiable environment and iteratively fine-tuned on its successful trajectories. Despite its effectiveness, this paradigm exploits only successful trajectories and discards the failed ones, even though failures carry rich information about a model's weaknesses. In this work, we explore a complementary failure-driven self-improvement loop, a data-centric paradigm that turns failed trajectories into agent improvements. Specifically, we employ an LLM to diagnose failure modes, propose inference-time solutions, and generate code patches -- lightly verified by humans -- that upgrade the agent. We validate this approach with the state-of-the-art OpenCUA-72B model on the OSWorld benchmark, improving the success rate from 42.3% to 48.9%, a gain of 6.6 percentage points, without any additional training cost and with only modest inference overhead. Our results demonstrate that failure-driven self-improvement is a viable complement to success-based pipelines, enabling more efficient agent improvement.

48. 【2606.31250】Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law

链接：https://arxiv.org/abs/2606.31250

作者：Noah Scharrenberg,Chang Sun

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, web-scale corpora generate, corpora generate output, Large language, safeguards focus narrowly

备注：

点击查看摘要

Abstract:Large language models (LLM) trained on web-scale corpora generate output that may infringe copyright, yet existing technical safeguards focus narrowly on verbatim memorisation. EU copyright doctrine applies a broader standards: substantial similarity, which extends to stylistic choices, narrative structure, and creative elaboration. This mismatch between what current methods detect and what the law protects leaves a significant compliance gap. We introduce PSALM, an LLM-as-a-judge framework that operationalises EU copyright doctrine through ten evaluators assessing computational overlap, stylistic dimensions (writing style, narrative voice), content dimensions (character, plot, scene, world building), and statutory exceptions (parody, pastiche, quotation, scènes à faire). Applying PSALM to Llama~3.2 models fine-tuned on translated historical Dutch literary works, we find that: 1) instruction-tuned models exhibit non-trivial baseline stylistic similarity prior to corpus exposure; 2) fine-tuning induces systematic stylistic appropriation across all infringement-relevant dimensions, extending beyond verbatim memorisation to abstract narrative patterns; 3) Negative Preference Optimisation unlearning substantially reduces similarity but leaves detectable residual stylistic patterns. These findings indicate that safeguards targeting literal copying alone are insufficient to mitigate broader copyright risks. PSALM provides infrastructure for auditable, legally informed compliance evaluation, though the relationship between automated similarity scores and infringement determinations requires validation by legal experts. This work bridges qualitative legal standards and quantitative technical measurement, exposing fundamental tensions between generative AI and EU intellectual property law.

49. 【2606.31213】Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas?

链接：https://arxiv.org/abs/2606.31213

作者：Jongchan Choi,Nari Yang,Sung Soo Park,Jaemin Cho,Han Seoyoung,Haerin Shin,Jun-Hyung Park

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large language models, language models, large language, increasingly deployed, dilemmas

备注： "23 pages. Preprint

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed as moral advisors and agents, they need to address dilemmas between two competing values. However, existing research on LLMs with moral dilemmas overlooks a central aspect of human moral cognition: the ability to imagine alternatives that move beyond the given options. We introduce MoralAltDataset, a dataset of 307 moral dilemmas spanning narrative Advisor dilemmas and AI-facing Agent dilemmas, each augmented with compromise and reframed alternatives. We first examine whether humans and LLMs shift their judgments when such alternatives are introduced. Across 15 LLMs, we find that compromise alternatives are often preferred over either original option, substantially reshaping moral choice. We then evaluate the quality of LLM-generated alternatives against human-authored ones using pairwise preference and expert-based criteria. Results show that LLM-generated alternatives are often preferred and better satisfy fine-grained structural and ethical criteria, while revealing trade-offs between structural quality and practical feasibility.

50. 【2606.31186】Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimer's Disease Detection

链接：https://arxiv.org/abs/2606.31186

作者：Jinyu Li,Xiao Wei,Bin Wen,Kai Li,Yuqin Lin,Xiaobao Wang,Longbiao Wang,Jianwu Dang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Alzheimer Disease, vital non-invasive biomarker, systems overlook non-linear, overlook non-linear structural, non-linear structural disruptions

备注： 5 pages, 1 figure, 2 tables, and accepted in interspeech 2026 conference

点击查看摘要

Abstract:Spontaneous speech is a vital non-invasive biomarker for Alzheimer's Disease (AD), yet many systems overlook non-linear structural disruptions and clinical heterogeneity in pathological language. We propose a Multi-View Gated Graph Attention Network that transcribes audio via Automatic Speech Recognition (ASR) to construct semantic, dependency, and co-occurrence graphs, characterizing speech through a "content-structure-flow" framework. Notably, the co-occurrence graph leverages Pointwise Mutual Information (PMI) from a normative corpus to quantify narrative logic and linguistic deviation. To address symptomatic diversity, an adaptive gated fusion mechanism dynamically integrates these views. Evaluated on the ADReSSo dataset, our model achieves 90.00% accuracy. Ablation results confirm that the PMI-based graph and heterogeneity-aware gating are essential for robust classification across diverse clinical populations. Our source code is publicly available at this https URL.

51. 【2606.31179】HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

链接：https://arxiv.org/abs/2606.31179

作者：Qianchu Liu,Sheng Zhang,Guanghui Qin,Jeya Maria Jose Valanarasu,Maximilian Rokuss,Mingyu Lu,Timothy Ossowski,Juan Manuel Zambrano Chaves,Cliff Wong,Peniel Argaw,Yashna Hasija,Mu Wei,Wen-wai Yim,Qin Liu,Zilin Jing,Jason Entenmann,Naoto Usuyama,Tristan Naumann,Hoifung Poon

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world healthcare applications, rigorous and holistic, increasingly capable, holistic evaluation, evaluation is essential

备注：

点击查看摘要

Abstract:As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite of 54 agentic healthcare tasks across 7 categories each with its unique environment. The benchmark suite spans diverse workflows throughout the patient journey and a broad range of modalities. Each task is designed to replicate an end-to-end clinical workflow: given minimal instructions, an agent must explore raw healthcare data, operate within a complex environment, and execute multi-step solutions that go beyond naive prompting. A final task success rate is reported to provide a single, interpretable metric for HealthAgentBench overall performance for each agent. Evaluating frontier agents on HealthAgentBench, we find that overall task success rate remains low, underscoring the difficulty of the suite. The strongest and the most cost effective agent, Codex GPT-5.5, achieves only approximately 42% success rate. Beyond aggregate performance, HealthAgentBench reveals nuanced strengths and weaknesses across task categories. Frontier agents show promise in automatically developing research modeling pipelines over EHR data, but medical imaging remains especially challenging, particularly for Claude Code models, while Codex GPT-5.5 shows emerging capability. Tasks that combine large search spaces with compositional reasoning requirements remain difficult for all current agents. Together, these results suggest that HealthAgentBench provides a challenging and realistic benchmark with substantial room for future progress. We release our benchmark at this https URL.

52. 【2606.31166】AG-DLM: Diffusion Language Models for Text-Attributed Graph Learning

链接：https://arxiv.org/abs/2606.31166

作者：Lingjie Chen,Yuanchen Bei,Haobo Xu,Yanjun Zhao,Yuzhong Chen,Hanghang Tong

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：natural language description, language model, Text-attributed graphs, diffusion language model, carries a natural

备注：

点击查看摘要

Abstract:Text-attributed graphs (TAGs), where each node carries a natural language description, require models to jointly reason over text and graph topology. Existing approaches often handle the two modalities separately: graph neural networks operate on shallow text features, while hybrids of LLMs and graphs use the language model mainly as a text encoder and delegate structure learning to a separate graph module. We propose method that unifies textual reasoning and graph message passing within a masked diffusion language model, a language model with bidirectional attention and generative decoding. For each graph instance, method linearises a sampled local neighbourhood into a token sequence and injects graph structure through a topology attention mask, which realises message passing over the graph. Because the diffusion language model can both interpret and generate text, the method adapts to different tasks simply by changing the prompt, supporting node classification, link prediction, and cross-dataset transfer with no target-specific fine-tuning. Experiments show that method outperforms graph neural networks, graph transformers, and LLM-based baselines on all three TAG benchmarks across two tasks, improving over the strongest baseline by up to 3.9 points.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2606.31166 [cs.CL]

(or
arXiv:2606.31166v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.31166

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

53. 【2606.31163】ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

链接：https://arxiv.org/abs/2606.31163

作者：Abhishek Dey

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：regulated industries operate, language models deployed, Large language models, deployed in regulated, regulated industries

备注：

点击查看摘要

Abstract:Large language models deployed in regulated industries operate under two constraints: compliance enforcement and cost efficiency. Personally identifiable information (PII) in user queries can reach model endpoints before the system determines whether that data should leave its jurisdictional boundary. Serving all queries through a single large model consumes full GPU capacity regardless of query complexity while offering no mechanism for geographic routing. Mixture-of-Experts architectures do not address this routing occurs between expert layers within the model after data has already arrived at the endpoint, with all experts loaded in memory regardless of query complexity. We propose a classifier-gated routing architecture that enforces compliance by design. A trained encoder classifier sits before any decoder inference, evaluating each query for complexity and data sensitivity, then routing it to an appropriately sized dense model in the appropriate geographic location. PII-containing queries route to local endpoints before any LLM computation begins, making data residency violations structurally impossible. Simple queries reach small, fast models at a fraction of the cost. Our evaluation on 600 queries demonstrates 39% median latency reduction, 33-52% cost savings depending on query distribution, and generation throughput of 122-200 tokens/second versus 50-64 for the baseline. The encoder classifier achieves 99.2% accuracy with near-perfect PII recall at 7ms inference overhead, establishing pre-inference classification as a practical path to compliance-by-design LLM deployment.

54. 【2606.31148】PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

链接：https://arxiv.org/abs/2606.31148

作者：Duc Cao Dinh,Khai Le-Duc,Florent Draye,Chris Ngo,Terry Jingchen Zhang,Bernhard Schölkopf,Zhijing Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：localize target objects, Visual Grounding, aims to localize, natural language descriptions, localize target

备注： Preprint

点击查看摘要

Abstract:3D Visual Grounding (3DVG) aims to localize target objects in 3D scenes given natural language descriptions. Existing approaches typically perform reasoning over the entire scene, leading to ambiguous predictions and high computational cost, especially in cluttered environments. We observe that many referential expressions rely on local spatial context and often correspond to restricted spatial regions rather than the full scene. Motivated by this insight, we propose PruneGround, an effective plug-and-play framework for 3DVG built upon three key components. First, we introduce Language-Guided Spatial Pruning (LGSP), which leverages a frozen Vision Language Model (VLM) to identify language-relevant regions, thereby reducing spatial computation and grounding candidates in the narrower search space. Second, we propose MultiView-Conditioned Description Reformulation (MCDR), which decomposes complex expressions into simplified target-anchor relations and augments missing spatial cues through multi-view reasoning. Finally, we propose LLM-Grounder, which repurposes a detection-pretrained spatial LLM into a language-conditioned grounding model by aligning point cloud and linguistic representations within the pruned region. Extensive experiments on the three most popular point cloud benchmarks demonstrate that our method achieves state-of-the-art results on all three ScanRefer settings and on 9 out of 10 Nr3D/Sr3D settings. Code and models are publicly available: this https URL

55. 【2606.31145】SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

链接：https://arxiv.org/abs/2606.31145

作者：Amirhossein Abaskohi,Giuseppe Carenini,Peter West,Yuhang He

类目：Computation and Language (cs.CL)

关键词：Large language models, language models increasingly, models increasingly operate, size grows linearly, Large language

备注：

点击查看摘要

Abstract:Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and must be retained throughout decoding, making full GPU caching prohibitively expensive without compression. Existing KV cache compression methods struggle to balance efficiency with faithful context preservation. Token eviction discards information, while semantic grouping fixes compression decisions at prefill time; neither can recover token-level detail from a compressed span once it becomes relevant during generation. As a solution, we propose SeKV, a resolution-adaptive semantic KV cache that organizes context into entropy-guided semantic spans and stores them across a GPU-CPU memory hierarchy without discarding information. Each span keeps a lightweight summary vector on GPU for coarse routing and a low-rank SVD basis on CPU for on-demand token-level reconstruction. A trained zoom-in mechanism selectively expands query-relevant spans during decoding, enabling precise retrieval without materializing the full KV cache on GPU. SeKV enables adaptive token-level reconstruction while keeping the base LLM fully frozen and adding fewer than 0.05% trainable parameters. Across four benchmarks, SeKV improves over the strongest semantic compression baseline by 5.9% on average while reducing GPU memory by 53.3% versus full KV caching at 128K context. Code is available on this https URL.

56. 【2606.31128】UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

链接：https://arxiv.org/abs/2606.31128

作者：Chuanbo Zhu,Wuyou Zhou,Rongxiu Zhong,Shilei Zhang,Kun Qian,Yike Guo,Wei Xue

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：modify specific portions, aims to modify, modify specific, specific portions, utterance while preserving

备注：

点击查看摘要

Abstract:Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion editing as separate tasks, limiting both editing granularity and flexibility. We propose UniSAE, a unified speech attribute editing framework which supports composable speaker, emotion and content editing from sub-phoneme to word level within a single architecture. UniSAE introduces a Discrete Phonetic PosteriorGram (DPPG) representation that factorizes speech content into discrete tokens encoding phoneme identity, pronunciation variants, and duration, enabling direct phoneme- and sub-phoneme-level editing. For higher-level modifications, an autoregressive content transformer predicts edited DPPG sequences for word-level content editing. The edited sequences are rendered into speech by a diffusion-based acoustic decoder, conditioned on disentangled speaker and emotion representations. Experimental results demonstrate that the proposed unified framework supports precise speaker and emotion control, content editing at multiple granularities, and joint modification of all three attributes within a single framework.

57. 【2606.31112】What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

链接：https://arxiv.org/abs/2606.31112

作者：Hawau Olamide Toyin,Srinivasan Umesh,Hanan Aldarmaki

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：reported to underperform, ASR, atypical speech, speech, atypical

备注： 5 pages, 2 figures, accepted at Interspeech 2026

点击查看摘要

Abstract:ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including repetitions/prolongations) and intended (the canonical form of the text with disfluencies removed) in atypical speech recognition depending on context and use-case. Most ASR evaluations conflate this duality into a single ground truth and reward systems that delete disfluencies, ignoring verbatim faithfulness. We benchmark 11 ASR models from encoder-decoder, CTC and transducer families using both verbatim and intended references on atypical stuttered speech as a case study. Our quantitative assessment underlines the disparity in model performance and rankings using the two transcript styles. Through this analysis, we highlight the importance of selecting a suitable transcription reference for valid model selection depending on the use-case, particularly for atypical ASR.

58. 【2606.31087】When Reranking Hurts: Uncertainty-Based Gating for Few-Shot Reranking

链接：https://arxiv.org/abs/2606.31087

作者：Orian Dabod,Amir Cohen,Gabriel Stanovsky

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：selection typically assumes, Few-shot selection typically, selection typically, typically assumes, Few-shot selection

备注：

点击查看摘要

Abstract:Few-shot selection typically assumes that reranking retrieved examples always improves performance. We challenge this view by identifying that the expensive reranking step can in fact degrade performance. Instead, we propose \emph{Training-Free Gated Reranking}, which decides whether to rerank the few-shot examples based on the model's uncertainty. Extensive experiments across 8 LLMs, covering 7 NLU datasets and 9 MT domain-language combinations, demonstrate that our approach reduces computational costs by 15\%-80\% while improving average performance by up to 2\%. These findings indicate that higher computational cost does not guarantee better performance, and that reranking is most beneficial when targeted at high-uncertainty instances.

59. 【2606.31081】Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021

链接：https://arxiv.org/abs/2606.31081

作者：Chengzhi Zhang,Liang Tian,Heting Chu

类目：Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：twenty-one major LIS, Information Science, research articles published, present study analyzed, machine learning

备注：

点击查看摘要

Abstract:The present study analyzed over 26,000 research articles published between 1991 and 2021 in twenty-one major LIS (Library and Information Science) journals, using the machine learning (ML) approach to categorize the research methods used by LIS scholars. The findings of this study are significant. Firstly, there has been a shift in the research strategy from conceptual research (e.g., "Theoretical approach") to empirical research (e.g., "Interview") in LIS investigations over the past 31 years. Secondly, the research topics explored by LIS scholars during this period have moved from system-centered issues (e.g., "Information retrieval/models and algorithms") to user-centered topics (e.g., "Information services "). Thirdly, the study revealed dynamic and revealing relationships between the 18 research topics identified in the study and the 16 research methods commonly adopted in the LIS field. These dynamic relationships can be visualized by year and longitudinally via an interactive map created in this study.

60. 【2606.31074】riospect: A Three-Dimensional Framework for Robust Statistical AI-Generated Text Detection Against Diverse Attacks

链接：https://arxiv.org/abs/2606.31074

作者：Guangsheng Bao,Lihua Rong,Yanbin Zhao,Xiao Yu,Qiji Zhou,Yue Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Existing AI-generated text, manipulate textual characteristics, AI-generated text detectors, Existing AI-generated, textual characteristics

备注： TACL final version, 12 pages, 9 figures, and 9 tables

点击查看摘要

Abstract:Existing AI-generated text detectors are vulnerable to attacks that manipulate textual characteristics. In this study, we propose a novel Triospect Detection Framework by using additional perspectives of content (core ideas) and expression (stylistic elements) within a given text. Experiments on two benchmarks involving 17 attacks, 12 domains, and 17 source models demonstrate that Triospect is robust against these attacks. It improves the strong baseline by a significant margin of 22.3% (AUROC) and 13% (TPR01) on the Humanize-16K after-attack subset, and by 9.1% (AUROC) and 22% (TPR01) on the adversarial RAID. This framework marks a pioneering effort in statistical methods to enhance detection reliability against attacks. We release our data and code at this https URL.

61. 【2606.31069】Building a Multimodal Dataset of Academic Paper for Keyword Extraction

链接：https://arxiv.org/abs/2606.31069

作者：Jingyu Zhang,Xinyi Yan,Yi Xiang,Yingyi Zhang,Chengzhi Zhang

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：typically relies solely, keyword extraction task, keyword extraction, task typically relies, extraction task typically

备注：

点击查看摘要

Abstract:Up to this point, keyword extraction task typically relies solely on textual data. Neglecting visual details and audio features from image and audio modalities leads to deficiencies in information richness and overlooks potential correlations, thereby constraining the model's ability to learn representations of the data and the accuracy of model predictions. Furthermore, the currently available multimodal datasets for keyword extraction task are particularly scarce, further hindering the progress of research on multimodal keyword extraction task. Therefore, this study constructs a multimodal dataset of academic paper consisting of 1000 samples, with each sample containing paper text, images, audios and keywords. Based on unsupervised and supervised methods of keyword extraction, experiments are conducted using textual data from papers, as well as text extracted from images and audio. The aim is to investigate the differences in performance in keyword extraction task with respect to different modal information and the fusion of multimodal information. The experimental results indicate that text from different modalities exhibits distinct characteristics in the model. The concatenation of paper text, image text and audio text can effectively enhance the keyword extraction performance of academic papers.

62. 【2606.31058】Exploring the relationship between team institutional composition and novelty in academic papers based on fine-grained knowledge entities

链接：https://arxiv.org/abs/2606.31058

作者：Ziling Chen,Chengzhi Zhang,Heng Zhang,Yi Zhao,Chen Yang,Yang Yang

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：important factor influencing, novelty, academic, important factor, factor influencing

备注：

点击查看摘要

Abstract:The composition of author teams is an important factor influencing the novelty of academic papers. However, existing studies have paid limited attention to the role of institutional composition, and most novelty measures remain at a general level, making it difficult to explain the specific sources and types of novelty in papers. Taking the field of natural language processing as an example, this study investigates the relationship between team institutional composition and the fine-grained novelty of academic papers. Author teams are classified into three types: academic institutions, industrial institutions, and mixed academic and industrial institutions. Four types of fine-grained knowledge entities are extracted from full-text papers, including methods, datasets, tools, and metrics. The novelty of papers is then measured based on entity combinations, and pairwise combinations of different entity types are further analyzed to examine their contributions to novel papers. The results show that, in the field of natural language processing, collaboration between industrial and academic institutions is more likely to produce novel papers than purely industrial collaboration. From the perspective of fine-grained knowledge entities, mixed academic and industrial teams pay more attention to the novelty of method-metric combinations, whereas industrial teams pay more attention to the novelty of method-tool combinations. This study reveals the relationship between institutional team composition and paper novelty through fine-grained novelty measurement, providing useful evidence for improving paper quality and promoting industry-academia-research collaboration.

63. 【2606.31055】Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

链接：https://arxiv.org/abs/2606.31055

作者：Ashish Hallur,Thomas Thebaud,Georgi Tinchev,Venkatesh Ravichandran,Laureano Moro-Velazquez

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：lacks interpretable speech-native, interpretable speech-native measures, advancing rapidly, agents are advancing, speech-native measures

备注：

点击查看摘要

Abstract:Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with model-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output. Using 4000+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes for $F_0$ mean, $F_0$ expressivity, speech rate, articulation rate, pause ratio, and mean pause duration. We then define a percentile-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th-95th percentile out-of-regime flags. On held-out human rows, pooled references over-flag state-conditioned $F_0$ expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user-centered evaluation.

64. 【2606.31054】ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

链接：https://arxiv.org/abs/2606.31054

作者：Zhiyuan Yao,Zheren Fu,Zhixiao Zheng,Jiajun Li,Yi Tu,Zhendong Mao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：Multimodal Large Language, Large Language Models, Large Language, generating content inconsistent, Multimodal Large

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive degradation of text-to-image cross-attention during generation, leading to specific failure patterns like unfocused or biased attention. Existing mitigation strategies are largely outcome-driven and do not explicitly target this failure mode. To address this problem, we propose ADAPT (Attention Dynamics Alignment with Preference Tuning), an attention-based framework that intervenes directly on text-to-image cross-attention dynamics. We propose ADAPT with three key contributions: a cross-attention visual anchor refined from early decoding to provide stable spatial grounding, an attention-supervised inference mechanism that detects and corrects attention drift online, and a Visual Attention Guidance DPO that aligns preferences toward visually grounded responses. Experiments show that each component of ADAPT contributes to hallucination reduction, and the full framework achieves new best results across multiple hallucination benchmarks, reducing hallucination rates by 40%-60% across mainstream backbones while preserving general multimodal capabilities. Our work provides an attention-based perspective on mitigating hallucinations by exploring the model's internal text-to-image cross-attention behaviors. Code is available at this https URL

65. 【2606.31041】A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

链接：https://arxiv.org/abs/2606.31041

作者：Ha Jeong Kim,Saksonita Khoeurn,Ye Ji Yoon

类目：Computation and Language (cs.CL)

关键词：databases remains significantly, real-world enterprise databases, enterprise databases remains, databases remains, remains significantly

备注： Submitted to FITAT 2026 for peer review

点击查看摘要

Abstract:Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names, heterogeneous SQL dialects, and complex analytical workloads requiring nested aggregations, temporal reasoning, and multi-table joins. We present a semantic-layer-mediated NL2SQL agent that decouples semantic intent from physical SQL execution. Rather than generating SQL directly over raw schemas, the agent reasons over a curated semantic layer through a compact intermediate representation called the Semantic Model Query (SMQ). A deterministic compiler translates each SMQ into dialect-specific SQL, providing verified building blocks that the agent composes into the final query. The system employs a constrained think-act loop, supports SQLite, BigQuery, and Snowflake backends, and is integrated into an end-to-end evaluation framework. Using Gemini 3 Pro, the system achieves 94.15% execution accuracy on the 547-task Spider2-snow benchmark, ranking third on the official leaderboard and substantially outperforming schema-only approaches. We describe the system architecture, SMQ representation, agent workflow, evaluation results, and discuss semantic-layer quality and the trade-off between improved grounding and overfitting.

66. 【2606.31039】ruth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

链接：https://arxiv.org/abs/2606.31039

作者：Xudong Shen,Li Yuan,Ye Chen,Xin Wu,Yi Cai,Zhiyong Wu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, strong semantic capabilities, manipulative linguistic patterns, fallacies remains underexplored

备注： Accepted to ACL 2026 Main. 33 pages (9 pages main text)

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong semantic capabilities, yet their resilience to manipulative linguistic patterns such as logical fallacies remains underexplored. Prior work has primarily examined whether LLMs can identify or classify fallacies, leaving their robustness against fallacious persuasion insufficiently studied. To address this gap, we introduce LoFa (Logical Fallacy), a comprehensive benchmark for evaluating LLM robustness against fallacies. LoFa is constructed through a multi-agent pipeline that pairs factual questions with fallacious arguments, and is accompanied by a multi-round debate framework for assessing model resilience under sustained adversarial persuasion. To disentangle fallacy robustness from a model's inherent knowledge limitations, we further propose Logical Fallacy Resistance at k (LFR@k), a metric that quantifies resistance to fallacious attacks. Experiments show that LLMs exhibit varying levels of robustness across different fallacy types, revealing distinct vulnerability profiles among models.

67. 【2606.31033】CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations

链接：https://arxiv.org/abs/2606.31033

作者：Kazuaki Furumai,Shuichiro Haruta,Kazunori Matsumoto,Daisuke Kamisaka

类目：Computation and Language (cs.CL)

关键词：Retrieval-Augmented Generation, CORTEX, method for Retrieval-Augmented, long-form RAG outputs, Generation

备注：

点击查看摘要

Abstract:In this paper, we propose CORTEX, a token-level hallucination detection method for Retrieval-Augmented Generation (RAG). In long-form RAG outputs, hallucinations often arise in localized spans rather than throughout an entire response. CORTEX therefore identifies ungrounded content at the token level, enabling fine-grained localization of hallucinations. The key intuition behind CORTEX is that tokens grounded in retrieved documents should be more strongly influenced by those documents than hallucinated tokens. To capture this document-induced effect, CORTEX compares internal representations of a large language model (LLM) under two conditions: with and without the retrieved documents. Instead of relying solely on each token's immediate sensitivity to the retrieved documents, CORTEX also leverages the propagation of document-grounded information through preceding tokens, reducing false positives for tokens whose evidence has already been absorbed into the context. Finally, CORTEX applies post-processing smoothing step that models the tendency of hallucination labels to persist over contiguous spans, reducing local noise and encouraging span-consistent predictions. Experiments on two RAG benchmarks and three LLMs show that CORTEX substantially improves token-level hallucination detection, with each component consistently contributing to performance gains.

68. 【2606.31002】Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

链接：https://arxiv.org/abs/2606.31002

作者：Ke Zhang,Patricio Gallardo Candela,Sudhir Murthy,Yi Xie,Zhi Wang,Maziar Raissi

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词：Theorem-proving benchmarks evaluate, benchmarks evaluate proof, evaluate proof search, Theorem-proving benchmarks, evaluate proof

备注： 25 pages, 5 figures

点击查看摘要

Abstract:Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean declaration may type-check while omitting hypotheses, changing domains, or expressing a vacuous claim. We study faithful statement formalization as both an evaluation problem and a bottleneck-attribution problem. On a 400-entry graduate-level benchmark spanning real analysis, complex analysis, topology, and algebra, our protocol combines Lean compilation, cross-model semantic judging, and human expert calibration. The resulting picture is different from compile-rate evaluation: a full tool-augmented agent reaches 89.5% compilation but only 60.5% consensus faithfulness, exposing a 29.0-point compile-pass but consensus-unfaithful gap. Targeted human audits support the metric as a conservative decision boundary: across available case-level audits, 96.0% of consensus-positive outputs are human-confirmed faithful, while 82.4% of compile-pass consensus-negative outputs are human-confirmed semantic failures. Under this metric, existing one-shot formalizer models and prover-oriented Lean models remain low, suggesting that formal validity, proof-oriented Lean competence, and faithful statement generation should be reported separately. We then use a full $2^3$ factorial design to decompose three recurring interventions in formalization pipelines: parametric expert drafting, Mathlib/context search, and Lean elaboration feedback. Elaboration feedback is the largest validity intervention, but it also exposes a larger compile-pass semantic-failure bucket; search mainly improves grounding and selectivity; and fine-tuned drafting is largely substitutable in this tool stack once feedback and grounding are available.

69. 【2606.30989】Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

链接：https://arxiv.org/abs/2606.30989

作者：Naihao Deng,Yilun Zhu,Joan Nwatu,Clayton Scott,Rada Mihalcea

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：offensive statements, toxic and offensive, Warning, Abstract, statements

备注：

点击查看摘要

Abstract:Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences. We provide a statistical interpretation of this phenomenon. To steer models toward fairness-aware reasoning, we propose a reasoning-time injection framework. We further introduce Fair-GCG to systematically discover effective injection phrases. Injection phrases discovered by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, improves reasoning-level fairness, reduces bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.

70. 【2606.30987】Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

链接：https://arxiv.org/abs/2606.30987

作者：Christopher W. Karvetski,Sheldon S. Huang,Simas Kučinskas,Nadja Flechner,Jingyu Hu,Philip Tetlock,Ezra Karger

类目：Computation and Language (cs.CL); General Economics (econ.GN)

关键词：Decision-makers routinely rely, Decision-makers routinely, expert judgments accompanied, measure at scale, Explanation Quality Markers

备注：

点击查看摘要

Abstract:Decision-makers routinely rely on expert judgments accompanied by written explanations, yet explanation quality is difficult to measure at scale. Forecasting tournaments offer a natural testing ground: probabilistic judgments are paired with natural-language rationales and scored against realized outcomes. We introduce Explanation Quality Markers (EQMs), a set of sixty theory-guided reasoning patterns scored by large language models (LLMs). In a pre-registered analysis of over 55,000 forecast-rationale pairs from a multiyear forecasting tournament, EQMs predict accuracy at both the forecast and forecaster levels, consistently outperforming pre-LLM text-analysis methods. More than 90% of statistically significant pattern-level EQM-accuracy correlations match our directional hypotheses. The signal is asymmetric: EQMs identify likely underperformers more reliably than they distinguish the very best forecasters. Benchmarked against traditional indicators of forecasting skill, EQMs are the strongest predictor at the forecast level and competitive at the forecaster level, though weaker than prior accuracy. Human ratings of rationale quality are less consistently correlated with accuracy and place disproportionate weight on rationale length. Results transfer to an independent forecasting study. EQMs provide a scalable, interpretable method for extracting judgment-relevant information from written explanations.

71. 【2606.30973】From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue

链接：https://arxiv.org/abs/2606.30973

作者：Yifan Zhu,Kyeongmin Rim,James Pustejovsky

类目：Computation and Language (cs.CL)

关键词：Frictive Policy Optimization, Frictive Policy, Policy Optimization, epistemic signal essential, common-ground construction

备注： 11 pages. To appear in Proceedings of SIGDIAL 2026

点击查看摘要

Abstract:Frictive Policy Optimization (FPO; Pustejovsky et al., 2025) treats friction in collaborative dialogue -- misalignment, misunderstanding, repair -- as an epistemic signal essential to common-ground construction, rather than noise to be minimized. However, FPO and its implementations assume shared perceptual contexts, where friction arises from differently interpreted propositions over the same scene, which we define as propositional asymmetry. We extend FPO to perceptual asymmetry, where participants hold asymmetric partial information and the same referring expression yields different denotations depending on whose information state grounds the reference. We evaluate this through cross-corpora analysis and LLM probing on referentially asymmetric dialogue tasks, primarily the HCRC MapTask (Anderson et al., 1991). We find that FPO's friction functional is empirically valid only when evaluated from within each participant's information horizon: different landmark configurations produce qualitatively distinct grounding failure modes, with a small class of ambiguous configurations driving a disproportionate share of misunderstandings through trajectories that appear successful but silently diverge. The LLM probe confirms that having the "right perspective" matters more than having all perspectives: the informed single viewpoint outperforms omniscient access to both participants' contexts. We propose two annotation refinements: subtype decomposition of pending grounding states and accommodation-aware alignment classification.

72. 【2606.30957】Linguistic Distancing on Social Media: Indicators of Emotion Regulation Across Age Groups

链接：https://arxiv.org/abs/2606.30957

作者：Daniela Teodorescu,Saif M. Mohammad,Alona Fyshe

类目：Computation and Language (cs.CL)

关键词：Managing our emotional, emotional responses, Managing, emotion regulation, linguistic distancing

备注： 13 pages, 3 figures, Computational Affective Science Workshop

点击查看摘要

Abstract:Managing our emotional responses to events is key to emotional well-being, a process referred to as emotion regulation in psychology. Previous work has established that the degree to which we distance events is a type of emotion regulation. When we psychologically distance from events there can be markers in our language. These markers have been referred to as linguistic distancing. We build upon a previous metric to operationalize linguistic distancing, and explore how it changes across the lifespan. We explore this systematically by analyzing large amounts of social media text, a venue where people express their emotions. By investigating how distancing varies across age groups we can better understand how emotion regulation varies with age and provide initial benchmarks on social media data. We provide additional evidence further strengthening the hypothesis that linguistic distancing occurs in proportionally more instances with age. These findings align with past work in psychology which indicate improved well-being with older age. Better understanding how linguistic distancing changes with age is important because it functions as a marker of well-being and can inform effective health interventions. We provide a foundation for further exploring emotion regulation through linguistic distancing in text data.

73. 【2606.30943】Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

链接：https://arxiv.org/abs/2606.30943

作者：M. K. Arabov

类目：Computation and Language (cs.CL)

关键词：Russian scientific translation, Russian scientific, scientific communication, major languages, Russian

备注： Preprint

点击查看摘要

Abstract:Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research results between these communities, which affects international collaboration and the progress of sustainability-related research. We present a benchmark for Arabic--Russian scientific translation. The benchmark includes a hybrid parallel corpus of about 27,000 sentence pairs, compiled from scientific abstracts and general-domain texts (religion, news, conversations). We fine-tune three multilingual language models -- mT5-base (580M parameters), NLLB-200-distilled-1.3B (1.3B), and Qwen2.5-7B-Instruct (7B) -- using LoRA with ranks 8, 16, 32, and 64. The Qwen2.5-7B model with QLoRA (rank 8) yields BLEU 23.15, chrF 43.89, BERTScore 0.906, and COMET 0.758. These are +4.36 BLEU and +0.051 COMET above the zero-shot baseline. Few-shot prompting with three examples does not improve performance, indicating that domain-specific fine-tuning is required. We release the models, the corpus, and the evaluation code. By lowering the language barrier for scientific texts, the work enables knowledge exchange between Arabic-speaking and Russian-speaking researchers. It contributes to sustainable partnerships (UN SDG 17) and innovation infrastructure (SDG 9), aligning with the conference's focus on technology-driven sustainable development.

74. 【2606.30914】Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

链接：https://arxiv.org/abs/2606.30914

作者：Tanvir Ahmed Sijan,S. M Golam Rifat,Nayeemul Islam,Md. Musfique Anwar

类目：Computation and Language (cs.CL)

关键词：Automatic Speech Recognition, noise largely unexplored, real-world Automatic Speech, systems are typically, largely unexplored

备注： 17 pages, 8 figures

点击查看摘要

Abstract:Event detection (ED) systems are typically evaluated on clean, curated text, leaving their robustness to real-world noise largely unexplored, particularly for low-resource languages such as Bangla. We introduce a generalized Bangla news event ontology and a benchmark comprising 9,979 annotated sentences across 40 event subtypes, spanning clean news text, real-world Automatic Speech Recognition (ASR) transcripts, and orthographically corrupted text. We systematically evaluate fine-tuned encoder-only models (BanglaBERT and XLM-R) alongside instruction-tuned decoder-only large language models (Llama 3 and Gemma 3). Our results reveal a clear architectural trade-off: encoder models achieve higher performance on clean text but degrade substantially under noise, whereas decoder-only LLMs are markedly more robust, particularly when event triggers are corrupted. We further show that embedding annotation guidelines during instruction tuning establishes a higher performance baseline on noisy text but yields inconsistent reductions in performance degradation across noisy conditions. Finally, model scaling consistently improves the robustness of decoder-only LLMs, while combined training on clean and noisy data serves as an effective regularization strategy that disproportionately benefits encoder architectures, significantly narrowing the robustness gap.

75. 【2606.30887】raining Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

链接：https://arxiv.org/abs/2606.30887

作者：Mizanur Rahman,Abeer Badawi,Elahe Rahimi,Laleh Seyyed-Kalantari,Frank Rudzicz,Enamul Hoque,Elham Dolatabadi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词：Large language models, mental health support, language models show, models show promise, Large language

备注：

点击查看摘要

Abstract:Large language models show promise for mental health support, yet therapeutic quality improves only when evaluation functions as an actionable control signal rather than a passive metric. We introduce a framework that formulates therapeutic response generation as a decision-refinement problem driven by multi-dimensional, human-aligned evaluation. In Stage I, we introduce TheraJudge, an open-source therapeutic evaluator trained via preference-based optimization on human-annotated data to produce reliable judgments across 7 psychological dimensions. In Stage II, we introduce TheraAgent, which operationalizes TheraJudge's evaluations through a coordinated refinement process with specialized Critic, Coach, and Therapist roles that translate evaluative signals into targeted response revisions. Empirically, TheraJudge achieves strong agreement with clinician ratings, with intraclass correlation coefficients (ICC = 0.87-0.95), surpassing supervised baselines and strong closed-source judges, particularly on critical dimensions such as Safety, Relevance, and Empathy. Acting on these evaluations, TheraAgent yields a +0.43 improvement in human-rated therapeutic quality (on a 5-point scale) under blind evaluation, with 96\% clinician inter-rater reliability. Low-quality responses ($\leq 3$) improve by +2.45 points with a 94\% recovery rate, demonstrating targeted correction of unsafe outputs. Overall, our results indicate that effective alignment of mental-health LLMs stems from acting on human-aligned evaluation, rather than relying solely on stronger generation. We release code at this https URL.

76. 【2606.30857】Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning

链接：https://arxiv.org/abs/2606.30857

作者：Aaron Bundi Anampiu

类目：Computation and Language (cs.CL)

关键词：multievent online polarization, detecting multilingual, paper describes, describes our submission, multievent online

备注：

点击查看摘要

Abstract:This paper describes our submission to SemEval-2026 Task 9 on detecting multilingual, multicultural, and multievent online polarization. We address all three subtasks: binary polarization detection, polarization type classification, and manifestation identification for English and Swahili. Our approach leverages transformer-based models (RoBERTa-base for English, AfroXLMR-base for Swahili) with class-weighted loss functions to address severe label imbalance and per-label threshold tuning to optimize multi-label classification. On the test set, we achieve F1 macro scores of 0.7901 (English) and 0.7910 (Swahili) for Subtask 1, 0.4615 (English) and 0.4808 (Swahili) for Subtask 2 and 0.4791 (English) and 0.5830 (Swahili) for Subtask 3, which give competitive performance on the leaderboard, demonstrating the effectiveness of our methods for handling imbalanced multi-label polarization detection. Our error analysis reveals that models struggle with dehumanization detection and lack of empathy.

77. 【2606.30852】When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

链接：https://arxiv.org/abs/2606.30852

作者：Zhe Dong(University of Maine at Presque Isle),Fang Qin(Stanford University),Manish Shah(Independent Researcher)

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Reasoning models spend, computation across instances, spend different amounts, remains unclear, reasoning language models

备注： 17 pages, 5 figures

点击查看摘要

Abstract:Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with LearnStop, a hidden-state-free checkpoint stopper for reasoning language models. At fixed budget checkpoints, LearnStop probes a short answer from the current reasoning prefix and predicts prefix correctness from online features such as answer confidence, entropy, prefix vote share, answer stability, and backtracking-marker density. Across 18 task-model settings spanning GSM8K, MATH-500, MMLU-Pro, AIME-90, GPQA, Qwen3, and DeepSeek-R1 distillations, the answer is task-dependent. On free-form math, learned multi-feature stopping improves the fixed-budget frontier and often beats scalar exits: on GSM8K with Qwen3-32B, the empirical frontier reaches a post-hoc peak adapt gain of +0.157, validation-selected operating points preserve positive gains, and the paired gain over the strongest scalar baseline is +0.028. On multiple-choice and very hard settings, scalar confidence, entropy, or stability rules are competitive or stronger. We therefore frame learned stopping not as a universal replacement for scalar exits, but as a tool whose value depends on trajectory structure. We further provide validation-selected operating points, paired bootstrap tests, finite-grid lost-correct risk calibration, cost accounting under KV-fork, prefix-cache, and black-box regimes, H100 serving profiles, checkpoint-schedule sweeps, transfer analyses, and robustness checks. The main practical finding is that learned stopping is useful when many questions become correct before full budget but do not exhibit a single reliable scalar stopping signal; its benefits largely disappear when confidence or answer convergence already solves the stopping problem.

78. 【2606.30851】st-Time Verification for Text-to-SQL via Outcome Reward Models

链接：https://arxiv.org/abs/2606.30851

作者：Mattia Tritto,Giuseppe Farano,Dario Di Palma,Gaetano Rossiello,Fedelucio Narducci,Dharmashankar Subramanian,Tommaso Di Noia

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词：Improving the reliability, large language models, structured reasoning tasks, Outcome Reward Models, Majority Voting

备注： Accepted to the SURGeLLM Workshop at ACL 2026, San Diego, US

点击查看摘要

Abstract:Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority Voting, rely on heuristic signals such as execution success or output frequency, which provide limited semantic discrimination across candidate outputs. In this work, we study Outcome Reward Models (ORMs) as learned semantic scoring functions for test-time verification in Text-to-SQL. While ORMs have been previously explored for test-time scaling and alignment, their application to structured query generation remains underexplored. We introduce GradeSQL, a scalable framework for training task-specific ORMs via automated candidate generation and execution-based labeling, enabling verifier training without manual annotation. We integrate ORMs into a verification-driven Best-of-N pipeline and evaluate our approach on the BIRD and Spider benchmarks across multiple open-source LLM families. ORM-based selection consistently outperforms execution-based Best-of-N and Majority Voting, with gains of up to +4.33% on BIRD and +2.10% on Spider. We further show that ORMs scale effectively with larger candidate sets and yield stronger improvements on complex queries. Overall, our results demonstrate that ORM-based verification provides a simple, effective, and scalable alternative to heuristic test-time selection strategies for Text-to-SQL. Code datasets and models are publicly available.

79. 【2606.30824】Information Terra: A Narrative-Anchored Semantic-First Projection of Document Embeddings

链接：https://arxiv.org/abs/2606.30824

作者：Brian Keith-Norambuena,Fausto German,Chris North

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：introduce Information Terra, longitude thematic deviation, Information Terra, narrative-anchored semantic-first projection, latitude encodes narrative

备注： 5 pages, 6 figures, accepted in IEEE VIS 2026 as a short paper

点击查看摘要

Abstract:We introduce Information Terra, a narrative-anchored semantic-first projection that places a document corpus on an Earth-like globe whose poles are two user-chosen endpoint documents and whose prime meridian is the great-circle geodesic between them on the embedding hypersphere -- so latitude encodes narrative progress and longitude thematic deviation. Land features are recovered from document density via kernel density estimation and labeled by theme. A narrative trail built from the underlying narrative coherence graph, and constrained to be monotone in geodesic progress, provides a readable storyline. The projection's axes are semantically grounded in the user's chosen narrative endpoints, and the globe metaphor affords rotation and antipodal reading. We demonstrate the method on a 540-article Cuban Protests corpus, showing a storyline from Obama's 2016 visit to the 2021 International Aid during the protests.

80. 【2606.30815】When transformers learn "impossible" languages, what do they learn?

链接：https://arxiv.org/abs/2606.30815

作者：Ram Janarthan,Coleman Haley,Sharon Goldwater

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Recent work suggests, Recent work, show a bias, Recent, human languages

备注： CoNLL 2026 (Best Paper Award). 14 pages, 3 figures

点击查看摘要

Abstract:Recent work suggests that transformer language models show a bias towards human languages over unnatural ("impossible") languages argued to be unacquirable by humans. However, this literature has largely based these claims on differences in sample efficiency and test-set perplexity, rather than on direct evaluations of the linguistic capacities that could plausibly explain non-attestation in human languages. We evaluate two theoretically motivated linking hypotheses: impossibility arising from deficiencies in grammatical sensitivity or generative production. Using GPT-2 style models trained on perturbed "impossible" variants of English, we measure sensitivity to grammaticality using BLiMP minimal pairs, finding that model performance exhibits only gradual degradation, mediated by the language's information locality. In contrast, these models exhibited pronounced failures in generation, producing substantially fewer high-quality sentences at longer lengths. Together, these results suggest generative deficiency and transmission failures as a plausible linking hypothesis between language model behaviour and non-attestation of impossible languages.

81. 【2606.30814】When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

链接：https://arxiv.org/abs/2606.30814

作者：Zhichao Yang,Caiqi Zhang,Ruihan Yang,Chengzu Li,Nigel Collier,Deqing Yang

类目：Computation and Language (cs.CL)

关键词：Expected Calibration Error, Brier Score, Error and Brier, Calibration, model confidence aligns

备注：

点击查看摘要

Abstract:Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration Error and Brier Score. We begin by showing, both theoretically and empirically, that such comparisons are confounded by differences in model accuracy. For fairer cross-model comparison, we then propose ACE, an accuracy-controlled evaluation framework with three complementary views: Instance-Aligned, Distribution-Aligned, and Candidate-Aligned calibration. Across multiple benchmarks, model families, and confidence elicitation methods, we use ACE to study two practically important comparison axes, small versus large models and thinking versus non-thinking models. We find that many previously reported calibration advantages under raw global metrics weaken substantially after accuracy control. We also find that ranking reversal is frequent: models favored by raw metrics often cease to be favored once accuracy is controlled. Our results show that raw global calibration metrics are not robust for cross-model comparison, and that fair calibration comparison requires accuracy-aware evaluation.

82. 【2606.30801】Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

链接：https://arxiv.org/abs/2606.30801

作者：Alessandro Morosini,Sarah H. Cen,Andrew Ilyas,Hedi Driss,Aleksander Mądry,Chara Podimata

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词：Personalization algorithms determine, encounter on online, Personalization algorithms, algorithms determine, Personalization

备注： 43 pages, 10 figures

点击查看摘要

Abstract:Personalization algorithms determine what content users encounter on online platforms. Auditing these systems is difficult because independent auditors have only black-box access to the algorithms, while personalization depends on users' attributes, behavior, and evolving interaction histories. Existing auditing methods face a tradeoff: studies with real users capture realistic behavior but are costly and hard to control, whereas sock-puppet audits scale more easily but often rely on scripted behavior that limits realism. Beyond this, both approaches struggle to decouple user attributes from user behavior, limiting our ability to causally understand personalization. To address this gap, we introduce a framework for black-box audits of personalization algorithms using generative AI agents as behavioral engines for synthetic accounts. Each agent is instantiated with a fixed persona, grounded in demographic and political survey data, and interacts with a platform's content by reasoning about it and choosing actions. Because behavior is fixed within each persona while platform-visible signals such as age, gender, or location can be experimentally perturbed, our design enables counterfactual auditing of how platforms respond to user attributes. As a case study, we deploy 1,120 agents on X shortly after the 2024 U.S. election, spanning 14 personas and three counterfactual conditions, collecting over 200,000 content exposures. We find that X's algorithmic feed amplifies toxic, polarizing, political, and right-leaning content relative to the chronological feed, with amplification varying sharply by user ideology. Counterfactual analyses show that demographic signals affect content delivery in persona-dependent ways: pooled effects are largely null, while subgroup-level effects vary in direction and magnitude. Our work establishes GenAI-based agents as a new tool for algorithmic auditing.

83. 【2606.30790】Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

链接：https://arxiv.org/abs/2606.30790

作者：Avisha Das,Mihir Parmar,Mohana Ramnath,Pulkit Verma

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Romanized Code Mixing, Code Mixing, English in Roman, bilingual speakers fluidly, speakers fluidly blend

备注：

点击查看摘要

Abstract:Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs) perform strongly on monolingual and native-script benchmarks, their ability to follow instructions and reason over RCM-based content remains largely unexplored. To this end, we introduce the Indi-RomCoM benchmark for facilitating systematic evaluation on Indic Romanized Code-Mixed instructions. Our benchmark spans seven instruction-following tasks, four widely spoken Indic languages, and three controlled code-mixing intensity levels. We extensively evaluate a suite of LLMs covering proprietary, open-weight, and Indic-focused models under zero- and few-shot settings. LLMs consistently underperform on RCM instructions, with performance degrading as code-mixing density increases. Furthermore, reasoning tasks suffer less degradation than detection tasks (e.g., Toxicity) because the generated explanations offer necessary context. We believe Indi-RomCoM helps the community in developing inclusive multilingual systems.

84. 【2606.30788】Revocable Learned State via Process Sidecars

链接：https://arxiv.org/abs/2606.30788

作者：John Sweeney

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：mathrm, refuse outputs tied, public skill phase, private memory phase, Language models

备注： 23 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not the same problem as subtracting the memory update: the later safety optimizer has transported the memory direction. We introduce process sidecars, a two-coefficient edit family $\hat{\theta}(\lambda,\gamma)=\theta_{\mathrm{AMS}}-\lambda\Delta_{\mathrm{M}}-\gamma\hat{R}_{\mathrm{S}\leftarrow\mathrm{M}}$, with $\hat{R}_{\mathrm{S}\leftarrow\mathrm{M}}=\hat{J}_{\mathrm{S},\varepsilon}(\Delta_{\mathrm{M}})-\Delta_{\mathrm{M}}$, where $\hat{J}_{\mathrm{S},\varepsilon}$ is a centered secant through the realized future AdamW safety-training process. The implementation uses $\varepsilon=1$ at the natural memory-edit scale; it reuses $\theta_{\mathrm{AMS}}$ as the positive endpoint and computes one additional safety trace at $\theta_{\mathrm{A}}-\Delta_{\mathrm{M}}$. We prove two things. First, the exact sidecar, using the true transported direction $R_{\mathrm{S}\leftarrow\mathrm{M}}$ rather than the secant estimate, at $(\lambda,\gamma)=(1,1)$ recovers the counterfactual safety-only oracle $\theta_{\mathrm{AS}}$ up to second order; the proof treats AdamW as an augmented-state map over parameters, first moments, and second moments. Second, this process information is necessary: whenever future safety training bends the memory direction, every scalar task-arithmetic edit leaves first-order counterfactual error, while the process-sidecar edit is second-order accurate. Across three models, the validation-selected 2D edit improves held-out refusal closure over naive task arithmetic in all trials, and over the $\gamma=\lambda$ process-JVP subfamily, the diagonal slice of the cached 2D grid, in all paired trials.

85. 【2606.30775】A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

链接：https://arxiv.org/abs/2606.30775

作者：Yangqiaoyu Zhou,Mohammad Alqudah,Kwei-Herng Lai,Aaron Halfaker,Yingqi Xiong,Yaar Harari

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：route user queries, natural language skill, agents route user, route user, natural language

备注： 12 pages, 4 figures

点击查看摘要

Abstract:Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term skill collision. As agents scale to dozens of skills, manually tuning descriptions to maintain routing accuracy becomes a significant engineering bottleneck. We deploy an automated description optimization pipeline on a production enterprise group chat agent (9 skills, 372 regression cases). The pipeline produces descriptions averaging 79.2% F1, matching manually tuned descriptions at 79.4% F1 (average per-skill difference -0.20%, within the 0.78% multi-seed noise floor), while reducing per-skill engineering effort from 120 minutes to 3.8 minutes (32 times speedup). We then examine which pipeline components actually drive this match. Systematic ablation on both the production system and ToolBench (16k tools) reveals that a single LLM rewrite using any available false-positive and false-negative cases captures most of the available improvement. Other design choices we tested (iteration budget, feedback signal composition, dual editing of confused pairs, and training set size) each affect final F1 by less than 0.5%. Description optimization addresses skill collisions caused by overlapping descriptions but cannot resolve cases where two skills intended scopes genuinely overlap. We identify a diagnostic (a large train-validation F1 gap) that flags the latter cases for architectural rather than text-level intervention.

86. 【2606.30704】From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators

链接：https://arxiv.org/abs/2606.30704

作者：Gan Luo,Zihan Qin,Bin Dong,Wotao Yin

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, structural consistency needed, Large language, reliable deployment, wide range

备注： 35 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) excel across a wide range of tasks, yet their instance-specific solutions often lack the structural consistency needed for reliable deployment. Workflows that encode recurring algorithmic patterns at the task level provide a principled framework, offering robustness across instance variations, interpretable traces for debugging, and reusability across problem instances. However, manually designing such workflows requires significant expertise and effort, limiting their broader application. While automatic workflow generation could address this bottleneck, existing methods either produce instance-specific solutions without learning task-level patterns, or cannot generalize beyond their training configurations. We present MetaFlow, which casts workflow generation as a meta-learning problem: given a task and an operator set, the model learns to compose solution strategies. MetaFlow trains in two stages: supervised fine-tuning on synthetic workflow data, followed by reinforcement learning with verifiable rewards (RLVR) that uses execution feedback across problem instances in the task to improve end-to-end success. The resulting model produces effective workflows for trained tasks and exhibits strong generalization to untrained tasks and novel operator sets. Across benchmarks in question answering, code generation, and mathematical reasoning, MetaFlow achieves performance comparable to state-of-the-art baselines on in-domain tasks with single inference, while demonstrating remarkable zero-shot generalization capabilities on out-of-domain tasks and operator sets.

87. 【2606.30696】ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models

链接：https://arxiv.org/abs/2606.30696

作者：Kaier Liang,Hengde Dai,Cristian-Ioan Vasile

类目：Robotics (cs.RO); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Enabling robots, tasks remains challenging, natural language commands, follow natural language, remains challenging

备注：

点击查看摘要

Abstract:Enabling robots to follow natural language commands to complete zero-shot long-horizon tasks remains challenging. It requires extracting implicit temporal and logical constraints from natural language commands and executing multiple sub-tasks accordingly. Recent zero-shot object navigation methods use vision-language models (VLMs) to guide frontier-based exploration in unknown environments, but they are limited to single-target tasks. Real-world commands such as "Clean either the chair or the couch, then turn on the tv." require navigating to multiple targets in a temporally constrained order, which no existing zero-shot system can handle. We present ViTL, a framework that addresses this gap at two levels. At the task level, we use a large language model (LLM) to compile natural language commands into Linear Temporal Logic (LTL) formulas, which are then converted into Deterministic Finite Automata~(DFA) that coordinate multi-channel value maps and trigger dynamic replanning when new objects are detected. At the navigation level, we introduce directional score: rather than producing a direction-agnostic value across the entire field of view, we label frontier directions on the observation image and extract per-direction scores from the VLM. Experiments on Habitat-Matterport 3D (HM3D) show that the full framework enables zero-shot long-horizon completion of natural language navigation tasks with temporal constraints, and that directional score improves single-target navigation accuracy and efficiency over the baseline.

88. 【2606.30668】Emergent Culture in Minimal LLM Systems

链接：https://arxiv.org/abs/2606.30668

作者：Simon Jones,Sabine Hauert

类目：Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO); Populations and Evolution (q-bio.PE)

关键词：LLM agents operate, minimal prompting, LLM agents, agents operate, introducing evolutionary pressure

备注： 9 pages, 6 figures. Accepted for publication at Alife 2026 conference

点击查看摘要

Abstract:What happens when LLM agents operate with no context outside a turn, minimal prompting, and simple tools? Inspired by swarm engineering, we give collectives of three agents the ability to send messages and manipulate a shared actively decaying text store, introducing evolutionary pressure. The agents spontaneously cooperate, develop storage management strategies, and generate complex evolving cultural artifacts, with no top-down engineering. Using tools from dynamical systems analysis, we show that these behaviours exhibit structured long-range coherence beyond the entropy horizon of the decaying store, consistent with emergent culture in the Sperberian sense.

89. 【2606.30646】ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

链接：https://arxiv.org/abs/2606.30646

作者：Chukwuemeka Ugwu,Oluwafemi Richard Oyeleke

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：working memory processes, memory processes underlying, processes underlying instrumental, underlying instrumental activities, Speech recruits

备注：

点击查看摘要

Abstract:Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia detection systems depend on transcription, discard within-recording temporal structure, and are validated on a single English corpus with known recording artifacts. We propose an ASR-agnostic framework operating directly on Mel spectrograms. Our key contribution is extracting spectrotemporal displacement fields from consecutive spectrogram frames, capturing shifting spectral energy patterns as digital biomarkers of cognitive decline. These features are fused with CNN-ConvGRU acoustic embeddings via a learned cross-attention mechanism and aggregated using a Transformer encoder with learnable query pooling. A composite temporal loss enforces smoothness and contrastive coherence across segments. We train independent models on English DementiaBank, Slovak EWA-DB, and Spanish Ivanova corpora, using clinical elicitation protocols taxing IADL-relevant cognitive domains. The Slovak model achieves 83.9% accuracy, and Spanish achieves, while the English baseline yields 53.2%, confirming known artifacts. Cross-lingual ablation studies reveal distinct fusion regimes: removing cross-attention collapses Spanish performance to 53.7%, below unimodal models, while the Slovak audio encoder alone outperforms the full model, 93.7% vs. 83.9%, and all English configurations remain near chance. Thus, multimodal fusion's value is corpus-dependent: essential when signal is distributed across modalities, counterproductive when one dominates, and irrelevant when no signal exists. Auxiliary temporal losses converge to language-invariant values, indicating cross-lingual architectural stability.

信息检索

1. 【2606.31984】GR2 Technical Report

链接：https://arxiv.org/abs/2606.31984

作者：Yufei Li,Zaiwei Zhang,Mingfu Liang,Kavosh Asadi,Jay Xu,Jimmy Kim,Chongyang Bai,Jieyi Zhang,Hongye Xie,Prachi Agrawal,Dian Yu,Tianyi Chen,Jean-Pascal Billaud,Garret Buell, YK (Yongkang)Zhu,Sachin Patil,Brooke Bian,Zhou Fang,Kevin Huang,Shiva Sudanagunta,Yuzhen Huang,Emma Lu,Chris O'Brien,Yang Song,Lihong Li,Jacob Tao,Zhicheng Zhu,Chao Li,Gaoxiang Liu,Neil Wu,Zhongyin Hu,Li Han,Loki Chen,Ming Lei,Greg Rehm,Siyuan Song,Tianwei Zhang,Li Li,Ketan Singh,Yavuz Yetim,Ilyas Atishev,Satendra Gera,Ashkan Sadeghi,Rachel Yan,Nikko Mizutani,Shuaiwen Wang,Song Yang,Zhijing Li,Jiang Liu,Mengying Sun,Fei Tian,Xiaohan Wei,Chonglin Sun,Parish Aggarwal,Kaushik Rangadurai,Zhi Hua,Frank Shyu,Ruchit Sharma,Liyuan Li,Shike Mei,Wenlin Chen,Santanu Kolay,Ben Schulte,Deepak Chandra,Adam(Yang)Song,Sandeep Pandey,Xi Liu,Hamed Firooz,Luke Simon

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：grid display formats, step disproportionately shapes, systems serve billions, Large Language Models, disproportionately shapes user

备注： 18 pages, 10 figures

点击查看摘要

Abstract:Industrial recommendation systems serve billions of users through a multi-stage funnel -- retrieval, early-stage ranking, and re-ranking -- where the final re-ranking step disproportionately shapes user engagement and downstream performance, particularly for carousel and grid display formats. Despite growing enthusiasm for Large Language Models (LLMs) in recommendation, three gaps hinder industrial adoption: (1) most efforts target retrieval and ranking, leaving re-ranking -- the stage closest to the final user experience -- largely underexplored; (2) LLMs are typically deployed zero-shot or via supervised fine-tuning, underutilizing the reasoning capabilities unlocked by reinforcement learning (RL) on verifiable rewards; (3) deployed catalogs index billions of items with non-semantic identifiers that lie outside any base-LLM vocabulary. We present GR2 (Generative Reasoning Re-Ranker), an end-to-end framework that combines (i) mid-training on semantic IDs produced by a tokenizer with =99% uniqueness, (ii) reasoning-trace distilled from a stronger teacher via targeted prompting and rejection sampling, and (iii) RL with verifiable rewards purpose-built for re-ranking. To make GR2 resource-viable, we further (iv) introduce a context compressor that amortizes training cost, On-Policy Distillation (OPD) as a scalable alternative to SFT -- which we find collapses at industrial scale -- and reasoning distillation for low-latency serving. GR2 delivers +18.7% R@1, +7.1% R@3, and +9.6% N@3 over legacy baselines on industrial-scale traffic. We further find that reward design is critical in re-ranking: LLMs often hack rewards by preserving the incoming order or exploiting position bias, motivating conditional verifiable rewards as essential industrial components.

2. 【2606.31857】An Open-Source Tool for Reproducible Freeway Network Extraction from OpenStreetMap

链接：https://arxiv.org/abs/2606.31857

作者：Drew Miller,Cathy Wu

类目：Information Retrieval (cs.IR)

关键词：preparing road network, road network inputs, network inputs remains, model formulation, difficult to deploy

备注：

点击查看摘要

Abstract:Freeway simulation is often difficult to deploy at scale not only because of model formulation, but because preparing road network inputs remains a manual, corridor-specific, and difficult-to-reproduce task. This paper presents an open-source tool that extracts freeway networks from OpenStreetMap (OSM) and converts them into a compact, station-referenced representation suitable for downstream freeway simulation. Unlike existing tools that primarily support arterial or general network conversion tasks, the proposed workflow is designed around the specific requirements of freeway traffic studies. The tool supports not only OSM data cleaning and conversion, but also the broader workflow required in practice: corridor-specific querying, visual inspection of extracted segments, extraction validation against OSM, and source-data validation against aerial imagery. A locally hosted frontend allows users to define corridor-specific queries, select endpoints visually, and inspect extracted segments. The extraction logic is designed to address several recurring challenges in freeway OSM data, including inconsistent route references, ambiguous path selection through interchanges, managed-lane interference, incomplete corridor capture from naive bounding-box queries, and inconsistent ramp classifications. The workflow was first tested on two prototype corridors, where the extract-first-then-validate approach proposed here required roughly one-third the analyst effort of manual ramp encoding from scratch. It was then deployed across 359.6 miles of freeway in Orange County, California, with total processing and validation averaging about 41 seconds per mile. This deployment also suggests that, in a well-mapped region, OSM is sufficiently accurate for many freeway traffic studies. Overall, the tool provides a more scalable and reproducible foundation for freeway network preparation.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.31857 [cs.IR]

(or
arXiv:2606.31857v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.31857

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

3. 【2606.31693】ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

链接：https://arxiv.org/abs/2606.31693

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：intent-driven experiences orchestrated, wave of AI-native, AI-native applications, applications is moving, feed-based browsing

备注：

点击查看摘要

4. 【2606.31517】Unsupervised Data-Efficient Cross-Modal Retrieval with Global-Neighborhood Alignment Hashing

链接：https://arxiv.org/abs/2606.31517

作者：Runhao Li,Xiaoxu Ma,Zhenyu Weng,Yue Zhang,Guibo Luo,Huiping Zhuang,Zhiping Lin,Yap-Peng Tan

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Compared to supervised, unsupervised CMH reduces, unlabeled image-text pairs, image-text pairs, unsupervised CMH

备注：

点击查看摘要

Abstract:Compared to supervised cross-modal hashing (CMH), unsupervised CMH reduces the reliance on manual labeling by learning binary codes from unlabeled image-text pairs. However, existing unsupervised CMH methods often rely on large-scale image-text pairs, which are costly to collect. To address this limitation, we propose Global-Neighborhood Alignment Hashing (GNAH), a novel approach that preserves the semantic structure of vision-language foundation models within a compact binary Hamming space using only a limited number of image-text pairs. Specifically, GNAH captures global structural information from the continuous latent space and transfers it into the binary Hamming space through a Prototype-Anchored Global Alignment module. In addition, GNAH extends conventional pairwise contrastive learning by modeling stochastic neighborhood relationships via a Contrastive Stochastic Neighborhood Alignment module, thereby alleviating overfitting to sparse pairwise correlations. Extensive experiments demonstrate that GNAH consistently outperforms existing unsupervised cross-modal retrieval methods under data-constrained settings, offering a practical solution for real-world CMH applications.

5. 【2606.31156】One Retrieval to Cover Them All: Co-occurrence-Aware Knowledge Base Reorganization for Session-Level RAG

链接：https://arxiv.org/abs/2606.31156

作者：Shivam Ratnakar,Yixuan Zhu,Cecilia Cheng,Chaya Vijayakumar

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：systems retrieve documents, retrieve documents optimized, RAG systems retrieve, systems retrieve, retrieve documents

备注： Accepted to the Towards Knowledgeable Foundation Models (KnowFM) Workshop at ACL 2026

点击查看摘要

Abstract:RAG systems retrieve documents optimized for answering one query at a time. Yet enterprise users arrive with sessions, that is, coherent episodes of related questions that span semantically distant parts of the knowledge base. We show that a single retrieval call over a standard knowledge base covers only 41% of a user's session-level information need. To close this gap, we reorganize the KB offline using co-occurrence-aware clustering and expand retrieval candidates through cluster neighborhoods at query time. On WixQA (6,221 enterprise support articles), our method raises single-query session coverage to 58% (+17% absolute; 95% CI: [14.1, 20.4]), reduces retrieval calls to 70% coverage by 34%, and compresses the KB to 20% of its original size, all consistently across four embedding models and six functional domains. We argue that session-level coverage, not single-query recall, should be the primary metric for enterprise RAG evaluation.

6. 【2606.31081】Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021

链接：https://arxiv.org/abs/2606.31081

作者：Chengzhi Zhang,Liang Tian,Heting Chu

类目：Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：twenty-one major LIS, Information Science, research articles published, present study analyzed, machine learning

备注：

点击查看摘要

7. 【2606.31069】Building a Multimodal Dataset of Academic Paper for Keyword Extraction

链接：https://arxiv.org/abs/2606.31069

作者：Jingyu Zhang,Xinyi Yan,Yi Xiang,Yingyi Zhang,Chengzhi Zhang

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：typically relies solely, keyword extraction task, keyword extraction, task typically relies, extraction task typically

备注：

点击查看摘要

8. 【2606.31058】Exploring the relationship between team institutional composition and novelty in academic papers based on fine-grained knowledge entities

链接：https://arxiv.org/abs/2606.31058

作者：Ziling Chen,Chengzhi Zhang,Heng Zhang,Yi Zhao,Chen Yang,Yang Yang

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：important factor influencing, novelty, academic, important factor, factor influencing

备注：

点击查看摘要

9. 【2606.31031】GenPage: Towards End-to-End Generative Homepage Construction at Netflix

链接：https://arxiv.org/abs/2606.31031

作者：Lequn Wang,Jiangwei Pan,Fengdi Che,Linas Baltrunas

类目：Information Retrieval (cs.IR)

关键词：Netflix homepage construction, approach to Netflix, traditional multi-stage recommender, multi-stage recommender stack, generative approach

备注：

点击查看摘要

Abstract:We present GenPage, an end-to-end generative approach to Netflix homepage construction that replaces the traditional multi-stage recommender stack with a single transformer. GenPage treats the user and request context as a prompt, and autoregressively generates the entire structured, multi-row homepage as the response. We adapt the LLM training recipe: pretraining on production pages, followed by post-training via weighted binary classification (WBC) or reinforcement learning (RL). For industry-scale deployment, we introduce techniques addressing cold start, model freshness, business-rule enforcement, and serving efficiency. In online A/B tests against a mature, highly optimized production homepage recommender, the WBC variant of GenPage delivered a +0.24% lift on the core user engagement metric we use for launch decisions (p 0.001), while reducing end-to-end serving latency by 20%. Offline, two findings stand out: enriching the prompt yields a larger improvement than scaling model capacity in our current regime, and RL post-training increases homepage diversity even though diversity is not part of the objective.

10. 【2606.30984】owards Critical IR Theories and Practices

链接：https://arxiv.org/abs/2606.30984

作者：Bhaskar Mitra

类目：Information Retrieval (cs.IR); Computers and Society (cs.CY)

关键词：inform information retrieval, Belkin and Robertson, constitutes societal good, half a century, century ago

备注：

点击查看摘要

Abstract:Belkin and Robertson urged us, half a century ago, to develop a theoretical foundation for understanding what constitutes societal good that can inform information retrieval (IR) research and serve as a basis for determining when we should limit our scientific inquiry in the face of demands that are contradictory to societal good. In this article, I argue that to achieve this, IR should embrace critical theories and practices in our work, and shift away from the dominant liberal frame through which much of the IR community today view societal concerns in context of our research. Unlike the liberal frame, the critical frame explicitly adopts nondomination as its stated goal which can clarify our conceptualization of societal good within the field, provide necessary theoretical underpinning that Belkin and Robertson urged the community to develop, and serve as a basis for critical appraisals of our progress in enacting desired societal change.

11. 【2606.30824】Information Terra: A Narrative-Anchored Semantic-First Projection of Document Embeddings

链接：https://arxiv.org/abs/2606.30824

作者：Brian Keith-Norambuena,Fausto German,Chris North

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：introduce Information Terra, longitude thematic deviation, Information Terra, narrative-anchored semantic-first projection, latitude encodes narrative

备注： 5 pages, 6 figures, accepted in IEEE VIS 2026 as a short paper

点击查看摘要

计算机视觉

1. 【2606.32040】FaceMoE: Mixture of Experts for Low-Resolution Face Recognition

链接：https://arxiv.org/abs/2606.32040

作者：Kartik Narayan,Vishal M. Patel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：challenging task due, limited identity information, identity information resulting, remains a challenging, low contrast

备注： ECCV 2026, Project Page: [this https URL](https://kartik-3004.github.io/FaceMoE/)

点击查看摘要

Abstract:Low-resolution face recognition (LR-FR) remains a challenging task due to poor feature extraction and aggregation, as probe images often contain limited identity information resulting from extreme degradations such as blur, occlusion, and low contrast. Additionally, the domain gap between high-resolution (HR) gallery images and low-resolution (LR) probe images poses a significant challenge. A single feature encoder struggles to generalize effectively across both domains when fine-tuned on an LR dataset, and this issue is further magnified by catastrophic forgetting. To address these challenges, we propose FaceMoE, an effective adaptation of Mixture of Experts (MoE) transfomer architecture for low-resolution face-recognition . Specifically, we introduce multiple specialized feed-forward network (FFN) experts and incorporate a top-k router, which dynamically assigns tokens to appropriate experts. This design emergently promotes specialization across experts for different semantic regions of the face, which enables FaceMoE to perform resolution-aware feature extraction. Moreover, the top-k router facilitates sparse expert activation, enabling the model to preserve pretrained knowledge when finetuned on a LR dataset, while increasing model capacity without proportional computational overhead. FaceMoE is trained with a combined face recognition loss, router z-loss, and load balancing loss to ensure expert specialization and stable training. To the best of our knowledge, this is the first work leveraging MoE for LR-FR. Extensive experiments across eleven datasets, spanning HR, mixed-quality, and LR benchmarks, demonstrate that FaceMoE significantly outperforms state-of-the-art methods. Code: this https URL

2. 【2606.32039】GEAR: Guided End-to-End AutoRegression for Image Synthesis

链接：https://arxiv.org/abs/2606.32039

作者：Bin Lin,Zheyuan Liu,Chenguo Lin,Sixiang Chen,Yunyang Ge,Yunlong Lin,Jianwei Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Li Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual generative models, Visual generative, typically trained, tokenizer, generative models

备注：

点击查看摘要

Abstract:Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.

3. 【2606.32036】PointSplat: Compact Gaussian Splatting via Human-Centric Prediction

链接：https://arxiv.org/abs/2606.32036

作者：Yujie Guo,Yudong Jin,Lingteng Qiu,Zehong Shen,Zhen Xu,Jing Zhang,Xianchao Shen,Hujun Bao,Sida Peng,Xiaowei Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：live streaming systems, immersive live streaming, limited computational power, streaming systems, transmission bandwidth

备注： Project Page: [this https URL](https://zju3dv.github.io/pointsplat)

点击查看摘要

Abstract:Producing 3D human representations from input views on the fly is essential for immersive live streaming systems, where representation compactness is as critical as high fidelity given limited computational power and transmission bandwidth. Although recent feed-forward reconstruction methods achieve impressive quality through the view-centric prediction of 3D representations, they repeatedly encode the same subject content across multiple views, leading to significant inter-view redundancy. Our key insight is to perform predictions directly in 3D space, enabling the network to learn and produce a highly compact representation. To this end, we propose PointSplat, a novel human-centric approach that directly infers Gaussian primitives from an input point set. The proposed method first estimates a coarse geometric proxy and performs ray casting to prune redundant points and establish explicit 2D--3D correspondences. Subsequently, it employs a Point-Image Transformer to fuse appearance and geometry features, predicting Gaussian attributes in a single forward pass. This design restricts predictions to foreground regions of interest, substantially reducing the total number of Gaussians while improving novel-view rendering quality. Extensive experiments demonstrate that PointSplat achieves higher efficiency and quality while exhibiting strong robustness to variations in view count and image resolution across multiple datasets.

4. 【2606.32033】SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

链接：https://arxiv.org/abs/2606.32033

作者：Or Hirschorn,Aaron Olender,Eli Alshan,Ianir Ideses,Lior Fritz,Sagie Benaim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：pre-trained diffusion transformers, directly injecting spherical, framework for generating, diffusion transformers, images and videos

备注：

点击查看摘要

Abstract:We present a zero-shot, training-free and optimization-free framework for generating 360 panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training. However, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. Coupled with complementary Semantic Distortion classifier-free guidance (CFG) that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach generalizes across diverse backbones and 360 generation modalities. We demonstrate this across text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive performance against baselines, all while remaining training-free. Project page: this https URL

5. 【2606.32023】FLORA: A deep learning approach to predict forest attributes from heterogeneous LiDAR data

链接：https://arxiv.org/abs/2606.32023

作者：Emilie Vautier,Clément Mallet,Cédric Vega

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：national-scale resource monitoring, National Forest Inventory, Forest, resource monitoring, Forest attributes

备注：

点击查看摘要

Abstract:Forest attributes are essential for national-scale resource monitoring. Airborne LiDAR metrics are among the auxiliary variables most strongly correlated with forest attributes used in National Forest Inventory (NFI) estimates. However, producing wall-to-wall predictions remains challenging when LiDAR data are acquired under heterogeneous conditions. As national LiDAR programs expand across Europe, variability in sensors, flight parameters, seasons, and scan angles limits the robustness of existing models, which are often calibrated for local conditions. We present FLORA (Forest LiDAR Octree Regression with Auxiliary Data), a deep learning framework that predicts six forest attributes: dominant height, total volume, deciduous volume, coniferous volume, basal area, and stem density from heterogeneous LiDAR point clouds. FLORA combines an octree-based backbone with ecological and spatiotemporal auxiliary variables through a late-fusion gating mechanism. Models are trained and evaluated on 32,052 National Forest Inventory plots across mainland France using data from the French LiDAR HD program. A single model trained on both leaf-on and leaf-off acquisitions outperforms season-specific models and improves cross-season robustness. Auxiliary variables provide modest overall gains but contribute more strongly to species-specific volume prediction. FLORA achieves an rRMSE of about 12.3% (R2 = 0.88) for dominant height and 39% (R2 = 0.74) for total volume, providing a robust baseline for large-scale forest attribute estimation from heterogeneous national LiDAR programs.

6. 【2606.32020】Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

链接：https://arxiv.org/abs/2606.32020

作者：Anh Nguyen,Ngan Nguyen,Duc Vu,Trung Dao,Viet Nguyen,Quan Dao,Kien Nguyen,Chi Tran,Phong Nguyen,Khoi Nguyen,Cuong Pham,Dimitris Metaxas,Vishal M. Patel,Anh Tran

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diffusion models achieve, models achieve impressive, achieve impressive quality, distribution-based timestep distillation, one-step diffusion models

备注： ECCV 2026

点击查看摘要

Abstract:Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: Teacher and Student must inhabit the same latent space. This Shared-Space constraint prevents knowledge transfer from modern high-capacity Teachers (e.g., SD 3.5 and Flux) into compact, deployment-friendly Students such as SD 1.5, whose latent resolution and VAE parameterization differ from the Teacher. We formalize this overlooked regime as Cross-Space Distillation, where Teacher and Student differ in both latent resolution and VAE space. To enable distillation under this mismatch, we introduce the Bridge, a lightweight latent interface that maps Student latents into the Teacher space without modifying the Student backbone. Bridge combines a frozen Student VAE decoder as a spatial prior with a compact learnable projector, and is trained with latent reconstruction and attention fidelity objectives for stable Teacher-space alignment. Across diverse modern Teachers, Bridge enables substantial gains for compact one-step Students; for example, it improves SD 1.5 from 5.4 to 9.4 HPSv3 while preserving one-step inference, low latency, and broad ecosystem compatibility. These results show that heterogeneous large Teachers can be distilled into efficient, deployable backbones through a lightweight latent-space interface.

7. 【2606.32018】Automated Background Swapping for Robustness against Spurious Backgrounds

链接：https://arxiv.org/abs/2606.32018

作者：Cesar Roder,Kajetan Schweighofer

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Deep Neural Networks, Neural Networks exhibit, Deep Neural, exhibit strong performance, based on Deep

备注：

点击查看摘要

Abstract:Classifiers based on Deep Neural Networks exhibit strong performance across domains, yet can fail catastrophically if they rely on spurious correlations, i.e., features that are predictive of the target label in the training data but are not causally linked and thus fail to generalize. For the vision domain, many such spurious correlations manifest themselves within the background of the image, where only the foreground is predictive of the class label. In this paper, we introduce Automated Background Swapping (AutoBackSwap) to reduce the reliance of classifiers on such spurious backgrounds. AutoBackSwap uses a secondary network to disentangle the foreground and background, followed by infilling to synthesize complete backgrounds, and finally combines different foregrounds and inpainted backgrounds to augment the training data. We find that patch-wise labeling of just a few hundred samples suffices to train the secondary network and automatically augment the full training dataset on challenging image classification tasks. In contrast to many previous methods, AutoBackSwap proves very effective even if there is not a single sample in the training data breaking the spurious correlation. Across a range of image classification tasks with spurious backgrounds, AutoBackSwap consistently outperforms prior methods.

8. 【2606.32012】CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

链接：https://arxiv.org/abs/2606.32012

作者：Sanghyuk Chun,William Yang,Amaya Dharmasiri,Olga Russakovsky

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Dunning-Kruger effect, long-standing challenge, metacognition is notoriously, notoriously difficult, Uncertainty estimation

备注： 33 pages, 13.3MB

点击查看摘要

Abstract:Uncertainty estimation has been a long-standing challenge in AI models; it amounts to "knowing what you don't know," and metacognition is notoriously difficult even for humans (cf. the Dunning-Kruger effect). Although it is still far from solved even in simpler classification systems, tackling it in multimodal large language models (MLLMs) is becoming increasingly important. Within MLLMs, uncertainty can stem from any of the diverse sources as well as from their relationships, and further can stem from the unbounded answers in the open-ended setting. To tackle the issues, we propose CoMet, an MLLM uncertainty estimation method by decomposing uncertainty into a context-specific term and a multiplicity-specific term. The former captures ambiguity induced by the given context (e.g., task or prompt), while the latter captures how many plausible answers determined by the context remain compatible with the given input. We train a lightweight post-hoc uncertainty module to estimate these quantities, which enables efficient uncertainty estimation without autoregressive answer generation or repeated sampling. Experiments on various open-ended multimodal benchmarks, hallucination detection, and multiple-choice visual question answering benchmarks show that CoMet consistently improves uncertainty estimation over existing baselines while remaining efficient in practice. Code is available at this https URL

9. 【2606.31986】CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts

链接：https://arxiv.org/abs/2606.31986

作者：Lianyu Hu,Shengqian Qin,Zeqin Liao,Qing Guo,Liang Wan,Wei Feng,Yang Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generating explicit intermediate, enabled multi-modal large, explicit intermediate reasoning, multi-modal large language, large language models

备注： Accepted by ECCV2026. Code is available at [this https URL](https://github.com/hulianyuyy/CoLT)

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has enabled multi-modal large language models (MLLMs) to tackle complex visual reasoning tasks by generating explicit intermediate reasoning steps in natural language. However, this text-based reasoning paradigm is inherently slow at inference time with even thousands of tokens and fundamentally constrained by the expressiveness of natural language. In this paper, we propose CoLT, (Chain of Latent Thoughts), a novel framework that teaches multi-modal models to reason through a chain of latent thought representations instead of verbose text tokens, which can perform thinking with as few as 3 steps. Naively forcing the model to think with latent states easily produces meaningless semantics and makes training unstable. To effectively regulate the latent reasoning process, we introduce a lightweight external decoder that provides step-level supervision for each latent reasoning step in two complementary directions: a forward mode that decodes latent thoughts into the textual reasoning of the next step, and a backward mode that aligns decoder hidden states with the model's latent thoughts given preceding textual context. We further incorporate internal supervision that encourages coherent step-by-step latent transitions. The decoder and internal supervision are removed during inference to maintain high efficiency of latent reasoning. Extensive experiments on eight benchmarks demonstrate that CoLT not only outperforms existing latent reasoning methods such as CODI and SIM-CoT, but also surpasses latent visual reasoning approaches that rely on auxiliary images with costly annotation requirements. Compared to text CoT methods, CoLT can notably reduce the inference time by 10.1$\times$ and text decoding time by 22.6$\times$. Code is released at this https URL.

10. 【2606.31982】ERA: Entropy-Guided Visual Token Pruning with Rectified Attention for Efficient MLLMs

链接：https://arxiv.org/abs/2606.31982

作者：Yuhao Wang,Mu Qiao,Haiwen Diao,Yunzhi Zhuge,Pingping Zhang,Xindong Zhang,Lei Zhang,Huchuan Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注： 17 pages, 7 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) incur prohibitive inference costs due to long visual token sequences. Training-free visual token reduction provides an efficient solution. However, existing methods distort attention distributions, giving rise to a phenomenon we term Attention Logit Collapse. To address this issue, we propose ERA, an Entropy-guided visual token pruning framework with Rectified Attention for efficient MLLMs. Specifically, ERA comprises three crucial components: Dual-view Entropy Pruning (DEP), Bias-aware Token Recycling (BTR), and Logit-preserving Attention Rectification (LAR). First, DEP identifies representative anchor tokens by jointly modeling visual diversity and head-wise saliency. BTR then recycles pruned tokens into their corresponding anchors while estimating a cluster-level logit bias. Building upon this, LAR injects the estimated bias into attention logits, effectively rectifying the collapse induced by token reduction. Together, these components preserve visual evidence even under aggressive compression, enabling robust performance across single-image, multi-image, and video settings on a wide range of MLLMs. Beyond delivering practical acceleration, ERA establishes logit-preserving visual token pruning as a principled framework for efficient MLLMs, unifying theoretical foundation, algorithmic design, and practical deployment. The code is at this https URL.

11. 【2606.31981】LUNA: Learning Universal 3D Human Animation Beyond Skinning

链接：https://arxiv.org/abs/2606.31981

作者：Peng Li,Rawal Khirodkar,Junxuan Li,Yuan Dong,Chen Cao,Yuan Liu,Wenhan Luo,Yike Guo,Shunsuke Saito

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Linear Blend Skinning, Blend Skinning, Linear Blend, Creating photorealistic, depends on Linear

备注： ECCV 2026, Project page: [this https URL](https://penghtyx.github.io/LUNA/)

点击查看摘要

Abstract:Creating photorealistic, animatable 3D human avatars from monocular images still largely depends on Linear Blend Skinning (LBS) and parametric body models, which constrain expressivity and often introduce artifacts due to imperfect fitting. We propose LUNA, an LBS-free universal neural animation model that directly maps multiple 2D controls like images, keypoints, sketches, and unseen characters into 3D Gaussian deformations, bypassing explicit body fitting. At its core, a transformer-based motion regressor disentangles global rigid motion from fine-grained local dynamics to capture both coherent movement and subtle non-rigid effects. To resolve the inherent ambiguity of 2D-to-3D lifting while scaling beyond fitted datasets, we introduce hybrid supervision that distills soft structural priors from an LBS teacher and a loss that supports training on both limited fitted data and large in-the-wild unlabeled videos. Extensive experiments show LUNA achieves competitive visual fidelity compared to LBS-based approaches, while delivering realistic human motion and zero-shot cross-identity generalization across diverse driving modalities. To the best of our knowledge, LUNA is the first end-to-end 3D animatable model that supports implicit 2D driving.

12. 【2606.31979】Planar-SfM: Camera Pose Estimation via Homography Graph Embeddings

链接：https://arxiv.org/abs/2606.31979

作者：Gabi Pragier,Matan Karklinsky,David Ungarish,Avi Ben-Cohen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Structure from Motion, standard epipolar geometry-based, systems traditionally struggle, epipolar geometry-based methods, systems traditionally

备注：

点击查看摘要

Abstract:Structure from Motion (SfM) systems traditionally struggle with planar scenes, where standard epipolar geometry-based methods become degenerate. Rather than viewing planar surfaces as a limitation, we propose a unified framework that leverages them as a source of geometric constraints. Our key insight is that each planar surface visible across multiple views provides an independent estimate of relative camera poses through homography decomposition. By aggregating estimates from multiple planes or even from a single dominant plane we achieve robust pose recovery in scenarios where traditional methods fail. We introduce a novel graph-based approach that constructs a pose-graph from homography estimates and employs spectral embedding to identify and filter unreliable edges. Our method maps homography-based pose estimates onto the real line based on their geometric and visual consistency, enabling efficient extraction of a maximally consistent spanning tree for pose recovery. This approach naturally handles both highly planar scenes, such as indoor sports arenas, and general $3$D environments. We demonstrate superior performance on basketball court imagery where existing methods struggle, while matching or exceeding state-of-the-art results on unconstrained outdoor scenes from the IMC Phototourism benchmark.

13. 【2606.31966】MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

链接：https://arxiv.org/abs/2606.31966

作者：Qingyun Liu,Jiwen Zhang,Jingyi Hu,Siyuan Wang,Zhongyu Wei

类目：Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent multimodal large, environments remains underexplored, visually grounded environments, grounded environments remains, multimodal large language

备注： Project website: [this https URL](https://q-i-n-g.github.io/MECoBench-Website/)

点击查看摘要

14. 【2606.31959】AnyBokeh: Physics-Guided Any-to-Any Bokeh Editing with Optical Fingerprint Transfer

链接：https://arxiv.org/abs/2606.31959

作者：Xinyu Hou,Xiaoming Li,Zongsheng Yue,Chen Change Loy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image remains challenging, single image remains, tool in photography, remains challenging, post-capture bokeh editing

备注：

点击查看摘要

Abstract:Depth-of-field control is a fundamental tool in photography, yet post-capture bokeh editing from a single image remains challenging. A practical editor should handle images captured under arbitrary focus and aperture settings. Existing methods typically assume an all-in-focus input, or first recover an all-in-focus image before rendering new bokeh. Such pipelines can discard useful blur cues from the source image and propagate reconstruction artifacts into the final edit. We introduce AnyBokeh, a physics-guided framework for any-to-any bokeh editing. Instead of treating source blur merely as a degradation to be removed, AnyBokeh estimates the source blur state with a signed circle-of-confusion map and a disparity map. By modeling the linear relation between signed circle of confusion and disparity difference, AnyBokeh estimates a source-specific optical fingerprint and transfers the source optical characteristics to the desired focus and aperture setting. A generative editor conditioned on both source and target circle-of-confusion maps then performs relative blur synthesis, enabling spatially adaptive deblurring, preservation, and defocus rendering. To support physically supervised learning, we further construct a high-fidelity synthetic dataset with accurate depth, focus distance, and full EXIF metadata. Experiments on real-world benchmarks show that AnyBokeh achieves faithful and controllable editing across any-to-any bokeh editing, all-in-focus-to-bokeh rendering, and defocus deblurring, while avoiding all-in-focus reconstruction and test-time bokeh-level calibration commonly required by existing approaches. The code and dataset will be available at this https URL.

15. 【2606.31956】DEMUN: Fast and accurate discovery of music notation in very large collections

链接：https://arxiv.org/abs/2606.31956

作者：Vojtěch Dvořák,Filip Bím,Jiří Mayer,Martina Dvořáková,Markéta Herzanová Vlková,Pavel Pecina,Petr Žabička,Jan Hajič jr

类目：Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)

关键词：written musical heritage, memory institutions, heritage is preserved, preserved and digitised, digitised at memory

备注：

点击查看摘要

Abstract:Much of written musical heritage is preserved and digitised at memory institutions: libraries, museums, and archives. Owing to their collection structures, sheet music tends to be concentrated in large subsets that are defined as collections of music, with corresponding metadata that makes the music findable. However, when studying musical life as opposed to individual works, relevant documents often lie outside of these specialised collections: in textbooks, newspapers, other periodicals, pamphlets, and other documents with extensive circulation. But these documents are typically not catalogued as musical documents, and though there may be a lot of such documents overall, in large library collections, they are still extremely sparse. Manual discovery is thus unfeasible. Automated discovery requires an extremely low false positive rate in order to be useful, and must also operate quickly. We present DEMUN: a two-stage lightweight detector of music notation with a false positive rate of 0.015 %. In the test scenario, 4 million images of a national-scale library were processed, out of which 1,500 pages with music notation were discovered, suggesting the entire collection may contain up to 20-30,000 unmarked documents of musical life.

16. 【2606.31946】World Narrative Model for Highly Controllable Video Generation: A Paradigm Shift from Pixel Sampling to Physical World Orchestration

链接：https://arxiv.org/abs/2606.31946

作者：Ye Chen,Xuanhong Chen,Yupeng Zhu,Liming Tan,Zhewen Wan,Yuxuan Xiong,Tielong Wang,Jinfan Liu,Wuze Zhang,Xiongzhen Zhang,Feifei Li,Xianglin Luo,Zhehan Zhao,Zhifan Zhang,Laisheng Kou,Zhujing Liang,Yugang Chen,Muchun Chen,Xu Miao,Yijing Zhang,Xiaojie Sheng,Qiang Hu,Jialiang Chen,Weimin Zhang,Wenjun Zhang,Bingbing Ni

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：distribution sampling problem, industrial grade video, bypassing the explicit, instance level, lack of controllability

备注：

点击查看摘要

Abstract:The fundamental obstacle to industrial grade video generation is the lack of controllability: existing models treat video as a pixel distribution sampling problem, bypassing the explicit, instance level $4D$ $(3D + T)$ physical world. Consequently, content creators cannot specify geometry, motion, camera parameters, or lighting in a deterministic, quantitative way, leading to the infamous ''gacha'' loop that makes professional content creation prohibitively inefficient and expensive. To address this, we introduce the World Narrative Model (WNM), a paradigm that decouples what to render -- the structured physical narrative -- from how to render -- the pixel generation process. WNM replaces end-to-end black-box sampling with orchestrated $4D$ pre-visualization for media generation. Collaborative agents translate sparse multimodal inputs, including text, reference videos, and sketches, into a fully editable world representation with scene geometry, object layouts, character/animal skeleton motion, trajectories, camera motion, and lighting at quantitative, physically meaningful granularity. This representation acts as a deterministic structural blueprint that drives existing video foundation models, either frozen or lightly adapted, to render final footage, turning the base model into a faithful neural shader. Built on this engine, our human-AI platform supports automatic world generation and pre-visualization aligned with professional filmmaking pipelines, while director consoles enable seamless human refinement. Experiments show that WNM greatly reduces probabilistic ``gacha'' calls and produces videos whose layout, motion, and cinematography closely follow creator intent. The framework is open and modular, allowing each component, such as world representation, control agents, and adapters, to be independently improved. Project website: this https URL.

17. 【2606.31938】FlexViT: A Flexible FPGA-based Accelerator for Edge Vision Transformers

链接：https://arxiv.org/abs/2606.31938

作者：Hubert Dymarkowski,Xingjian Fu,Rappy Saha,Jude Haris,José Cano

类目：Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

关键词：Deploying Vision Transformer, Deploying Vision, Vision Transformer, platforms remains challenging, remains challenging due

备注： Accepted to 36th International Conference on Field-Programmable Logic and Applications (FPL) 2026

点击查看摘要

Abstract:Deploying Vision Transformer (ViT) models on edge platforms remains challenging due to their high computational demands and the architectural heterogeneity of modern hybrid ViT models, which incorporate both fully connected and convolutional layers. This heterogeneity leads to significant variation in tensor shapes, requiring flexible and efficient FPGA-based acceleration. In this paper, we present FlexViT, a reconfigurable FPGA accelerator for efficient ViT inference on resource-constrained edge devices. Built on the SECDA-TFLite framework, FlexViT employs a hardware-software co-design approach that maps both fully connected and convolutional layers onto a unified high-throughput INT8 GEMM engine using a runtime im2col transformation. To efficiently support diverse layer configurations, we propose a dual-mode dataflow that dynamically switches between input and weight reuse by reconfiguring the compute array at runtime. We further introduce a depth-first tiling strategy that completes accumulation in a single pass, eliminating off-chip partial-sum transfers and reducing memory bandwidth requirements. We implement FlexViT on a PYNQ-Z2 FPGA and evaluate it across a representative set of ViT models. FlexViT achieves up to 2.74x speedup on accelerator-executed layers, translating into up to 1.40x end-to-end speedup compared to CPU-only execution. The code is available at: this https URL

18. 【2606.31933】No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

链接：https://arxiv.org/abs/2606.31933

作者：Haojian Huang,Harold Haodong Chen,Meng Luo,Junjia Du,Shanqing Xu,Ziheng Chen,Yanxiang Huang,Yinchuan Li,Ying-Cong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：evaluating video hallucination, large video models, controlled conditions, rigorous and controlled, video

备注： ECCV 2026

点击查看摘要

Abstract:We introduce VidPair-Halluc, a new benchmark for evaluating video hallucination in large video models (LVMs) under rigorous and controlled conditions. Unlike previous benchmarks that primarily rely on text-based perturbations or adversarial questions while neglecting the consistency of visual backgrounds, VidPair-Halluc features video pairs with highly similar backgrounds but distinctly different foreground semantics, enabling precise attribution of model errors to genuine hallucination rather than background variation. The benchmark is constructed through PairFlow, a pipeline that leverages recent advances in text-to-image and video generation to systematically compose stories, generate coherent video clips, and assemble them into adversarial pairs. Covering both spatial and temporal reasoning across ten semantic aspects, VidPair-Halluc comprises 1K high-quality adversarial video pairs and 11K spatio-temporal QA pairs with control over background and foreground variations. Evaluations on mainstream LVMs show persistent difficulty with robust fine-grained video understanding in adversarial settings, and code and data are available at the this https URL.

19. 【2606.31924】InstanceControl: Controllable Complex Image Generation without Instance Labeling

链接：https://arxiv.org/abs/2606.31924

作者：Xiaoyu Liu,Huan Wang,Fan Li,Zhixin Wang,Jiaqi Xu,Ming Liu,Wangmeng Zuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：guide image generation, Controllable image generation, guide image, depth maps, image generation

备注：

点击查看摘要

Abstract:Controllable image generation methods, such as ControlNet, have demonstrated a remarkable capacity to introduce visual conditions(e.g., depth maps) to guide image generation. However, these methods often struggle with complex multi-instance scenes, frequently leading to attribute confusion among instances. While recent approaches attempt to mitigate this via manual instance labeling, such requirements are labor-intensive. In this paper, we propose InstanceControl, a novel multi-instance controllable generation method that eliminates the need for instance labeling. We identify the primary bottleneck in existing methods as the inability to accurately associate instance descriptions with their corresponding regions within visual conditions. To address this, we leverage the Vision-Language Model (VLM) to establish instance-level correspondences between text prompts and visual conditions. Specifically, the VLM automatically parses instance descriptions from the text prompts and simultaneously predicts instance masks based on the visual conditions. Furthermore, since the predicted masks may contain noise, we introduce an adaptive mask refinement strategy that dynamically refines these instance masks during the generation process. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, achieving superior fidelity and precise instance-level control.

20. 【2606.31919】MVP-Nav: Multi-layer Value Map Planner Navigator

链接：https://arxiv.org/abs/2606.31919

作者：Wenyuan Xie,Shaokai Wu,Yijin Zhou,Yanbiao Ji,Guodong Zhang,Bayram Bayramli,Qiuchang Li,Xunchu Zhou,Yue Ding,Hongtao Lu

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Zero-shot Object Goal, Object Goal Navigation, Object Goal, information introduces severe, severe physical uncertainty

备注：

点击查看摘要

Abstract:Zero-shot Object Goal Navigation (ZSON) with RGB-only perception poses a fundamental challenge for embodied agents, as the absence of explicit depth information introduces severe physical uncertainty and semantic-physical misalignment. Existing approaches either rely on high-level semantic reasoning without geometric grounding or learn end-to-end policies that lack explicit physical constraints, often resulting in semantically plausible but physically unsafe behaviors. In this paper, we propose MVP-Nav, a physical-aware RGB-only navigation framework that aligns perception, planning, and control with the real 3D world. MVP-Nav reconstructs explicit physical occupancy from monocular observations by leveraging 3D foundation models to project 2D semantic instances into 3D oriented bounding boxes, forming a global spatial semantic representation. To unify high-level semantic reasoning and low-level physical constraints, we introduce a Multi-layer Value Map (MVM) that integrates semantic priorities and reconstructed geometry into a shared cost space, enabling physically grounded geometric planning. Extensive experiments on zero-shot object navigation benchmarks demonstrate that MVP-Nav significantly outperforms existing depth-free methods, achieving state-of-the-art performance and validating that structured physical priors can effectively compensate for the absence of active depth sensors.

21. 【2606.31918】DriveWeaver: Point-Conditioned Video Inpainting for Controllable Vehicle Insertion in Autonomous Driving Simulation

链接：https://arxiv.org/abs/2606.31918

作者：Junzhe Jiang,Zipei Ma,Zijie Pan,Li Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：involves inserting foreground, simulation involves inserting, autonomous driving simulation, autonomous driving, driving simulation involves

备注： Accepted at ECCV 2026, Project Page: [this https URL](https://github.com/LogosRoboticsGroup/DriveWeaver)

点击查看摘要

Abstract:A pivotal step in autonomous driving simulation involves inserting foreground vehicles with predefined trajectories into simulated scenes. This process enhances scene diversity and facilitates the creation of various corner cases for testing and improving autonomous driving models. However, existing methods often rely on pre-reconstructed 3D assets, which frequently lead to lighting inconsistencies between the inserted foreground and the background. Moreover, the reliance on limited, manually-curated 3D assets hinders large-scale deployment. To address these challenges, we propose DriveWeaver, a novel framework for controllable vehicle insertion in autonomous driving simulation. Specifically, for a masked target insertion area, DriveWeaver performs video inpainting conditioned on vehicle point clouds to generate high-quality, temporally consistent vehicles. This video-inpainting-based approach ensures seamless blending between the foreground and background, while the readily available point cloud conditions enable superior generalization. To support long-term generation, we further design a global-to-local hierarchical inpainting strategy, ensuring the consistent identity and appearance of the inserted vehicles. Meanwhile, we extract explicit 3D Gaussian representations of the inserted vehicles through an urban reconstruction pipeline to enable real-time rendering for autonomous driving simulation. Extensive experiments across diverse datasets demonstrate that our method outperforms existing baselines in visual realism and geometric consistency, providing a robust tool for scalable autonomous driving scene augmentation.

22. 【2606.31903】Attend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inference

链接：https://arxiv.org/abs/2606.31903

作者：Zhaoyang Luo,Runmin Dong,Miao Yang,Fan Wei,Yushan Lai,Bin Luo,Haohuan Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal large language, increasingly process long, process long visual-token, Multimodal large, increasingly process

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) increasingly process long visual-token sequences, increasing the overall inference computation. Existing acceleration methods usually remove visual tokens or skip visual-token updates in entire layers, but these coarse strategies may discard fine-grained evidence or suppress useful operators together with redundant ones. In this paper, we study visual-token computation from an answer-observable perspective and find that late visual-token updates can remain large while having little effect on answer-token representations. Motivated by this answer-silent redundancy, we decompose each Transformer layer into attention and FFN operators and show that useful visual computation is often operator-dominant and layer-dependent. We propose an operator-level visual-token skipping framework that preserves the full visual-token sequence while selectively bypassing redundant attention, FFN, or both. Experiments across three MLLM architectures and 10 VQA benchmarks show that our method achieves strong efficiency-accuracy trade-offs, reducing \textbf{33.7\%} TFLOPs on Qwen3-VL while retaining \textbf{99.5\%} of the vanilla model performance.

23. 【2606.31895】RESOLVE: A Multi-Resolution and Multi-Modal Dataset for Roadside Cooperative Perception

链接：https://arxiv.org/abs/2606.31895

作者：Shaozu Ding,Linan Song,Marco De Vincenzi,Dajiang Suo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：roadside cooperative perception, increasingly been integrated, cameras to expand, expand coverage, coverage and mitigate

备注： Accepted to ECCV 2026. Including supplementary material

点击查看摘要

Abstract:LiDAR has increasingly been integrated into traffic cameras to expand coverage and mitigate occlusion in roadside cooperative perception. However, how unimodal and camera-LiDAR fusion architectures behave under variations in LiDAR point sparsity induced by sensor configurations and scene-dependent sensing conditions remains underexplored. We introduce RESOLVE, a large-scale real-world benchmark dataset featuring multi-resolution roadside LiDAR and synchronized camera-LiDAR sensing for systematic evaluation of unimodal and fusion-based architectures in roadside 3D detection and tracking. RESOLVE contains over 100k images and 26k point cloud frames with 220k manually annotated bounding boxes, captured at a real-world urban intersection across diverse lighting and weather conditions and spanning 10 classes of traffic participants. In particular, RESOLVE enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed. This allows fair cross-architecture comparisons under point cloud distribution shifts resulting from resolution variations, sensing distance, and training-inference resolution mismatches. Results from extensive benchmark experiments reveal insights into how multimodal fusion can compensate for LiDAR point sparsity, offering clues for designing cost-efficient roadside multimodal perception. The dataset and benchmark codes are available at this https URL.

24. 【2606.31876】Harnessing Textual Refusal Directions for Multimodal Safety

链接：https://arxiv.org/abs/2606.31876

作者：Moreno D'Incà,Massimiliano Mancini,Nicu Sebe

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Large Language Models, Language Models, Large Language, perform post-training alignment, perform post-training

备注： Preprint

点击查看摘要

Abstract:To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space. Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart. In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video). Preliminary findings confirm this ability, though effectiveness is conditioned by layer selection, steering strength, and cross-modal alignment, with the latter causing safe multimodal inputs to be spuriously steered toward refusal. Building on this, we introduce Modality-Agnostic Refusal Steering (MARS), a light-weight training-free approach that injects multimodal safety without the need for multimodal safety data. MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five SOTA MLLMs across safety, utility, and video jailbreak benchmarks, MARS achieves consistent safety gains while preserving utility. These results reveal that safety-relevant structure is shared across modalities and that textual refusal directions are a powerful and underexplored foundation for multimodal alignment.

25. 【2606.31875】SENSE-VAD: Sentient and Semantic Video Anomaly Detection for Autonomous Driving

链接：https://arxiv.org/abs/2606.31875

作者：Nghia T. Nguyen,Lokman Bekit,Yasin Yilmaz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：socially complex situations, situations whose danger, danger is constituted, constituted by inter-agent, inter-agent relationships

备注：

点击查看摘要

Abstract:Autonomous vehicles (AVs) must navigate not only motion-based hazards but also socially complex situations whose danger is constituted by inter-agent relationships rather than movement statistics alone. A child running away from a guardian, a person being carried by another, or a pursuer chasing a pedestrian across a sidewalk are all anomalous in social context, yet none produces an obvious motion signal that current anomaly detectors are equipped to flag. We introduce SENSE-VAD, the first synthetic video anomaly detection benchmark for autonomous driving explicitly designed around socially complex anomalies. Using the CARLA simulator and Unreal Engine (UE), we generate distinct anomaly scenarios across multiple categories: individual behaviors, group behaviors, person--object interactions, cyclist interactions, vehicle agent, each annotated with per-frame binary labels. A key design principle is the separation of social anomaly from motion-based or appearance-based anomaly: many scenarios involve motion of objects that appears unremarkable in isolation but is anomalous in relational context. We additionally provide real-world normal and anomalous videos as a sim-to-real transfer probe. We evaluate state-of-the-art video anomaly detection baselines and demonstrate that socially complex anomalies constitute a distinct and currently unsolved challenge. Our dataset, annotations, and generation code are publicly available.

26. 【2606.31839】owards Voxel Spacing Consistency for Medical Image Segmentation

链接：https://arxiv.org/abs/2606.31839

作者：Xin You,Runze Yang,Minghui Zhang,Hanxiao Zhang,Han Li,Yi Yu,Jie Yang,Nassir Navab,Yun Gu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Volumetric medical image, intraoperative guidance, preoperative diagnosis, diagnosis and intraoperative, Volumetric medical

备注： 12 pages, 6 figures

点击查看摘要

Abstract:Volumetric medical image segmentation is essential for both preoperative diagnosis and intraoperative guidance. While recent years have witnessed rapid progress in segmentation architectures, comparatively little attention is paid to the physical voxel spacing of anatomical data. Indeed, volumetric image resampling is a ubiquitous preprocessing step before segmentation, yet its interaction with downstream segmentation has not been systematically exploited. In this work, we study the correlation between image resampling and segmentation, and propose Consispace, a semantic-aware resampling framework that achieves consistent voxel spacing in the axial direction while preserving anatomical and semantic consistency. Consispace introduces an ODE-based anatomical constraint to model inter-slice dynamics with a continuous interpolator, enabling faithful reconstruction under complex anatomical transitions beyond discrete interpolation. To further couple resampling with segmentation objectives, we leverage dense features from a pretrained vision model to build intra-slice semantic correlation maps and inject class-wise semantic consistency via feature reweighting during resampling. Both intra-slice and inter-slice constraints are integrated into an implicit neural network, supporting arbitrary-scale resampling. Extensive experiments on multiple datasets demonstrate that Consispace achieves superior reconstruction quality and perceptual fidelity, produces smoother inter-slice anatomy, and improves downstream segmentation performance when used as a preprocessing step.

27. 【2606.31834】Real-Time Source-Free Object Detection

链接：https://arxiv.org/abs/2606.31834

作者：Sairam VCR,Varun Gopal,Poornima Jain,Vineeth N Balasubramanian,Muhammad Haris Khan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：source-free object detection, existing source-free object, Real-world detectors, autonomous driving, robotics must handle

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Real-world detectors for autonomous driving, surveillance, and robotics must handle domain-shifts under strict latency and memory constraints, yet existing source-free object detection (SFOD) methods rely on heavyweight architectures that prioritize accuracy alone. We show this trade-off is unnecessary: building on YOLOv10, an NMS-free dual-head detector, we achieve state-of-the-art adaptation accuracy while being faster and more compact. We observe that directly applying vanilla mean-teacher self-training to dual-head detectors leads to suboptimal adaptation performance due to two key factors. First, simple pseudo-label generation strategies, such as using a single head or directly combining high-confidence predictions from both heads, yield suboptimal supervision under domain-shift. We propose DHF (Dual-Head Pseudo-Label Fusion) which selectively admits one-to-one (O2O) and one-to-many (O2M) head predictions, preserving precision and recovering missed objects. Second, we observe domain-shift collapses multi-scale feature discriminability. We propose the use of our MARD (Multi-scale Adaptive Representation Diversification) loss which mitigates this by enforcing detection-aware variance and covariance constraints on multi-scale feature maps. Both modules are training-time only, leaving inference unchanged. Across domain-shift benchmarks, our method, RT-SFOD yields 1.4 to 3.5\% mAP gains, 1.3$\times$ higher throughput, with $\sim$2$\times$ fewer parameters than prior state-of-the-art SFOD methods, thus advancing the Pareto frontier of the speed-accuracy-model size trade-off. We report main results with YOLOv10, and demonstrate generalizability with additional YOLO- and DETR-based dual-head detectors. Code is available here: this https URL

28. 【2606.31830】PriorEye: Geospatial Visual Priors for End-to-End Autonomous Driving

链接：https://arxiv.org/abs/2606.31830

作者：Kyuhwan Yeon,Benjamin Ramtoula,Daniele De Martini

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：human drivers employ, anticipatory foresight human, foresight human drivers, instantaneous sensor observations, methods rely solely

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Most end-to-end autonomous driving methods rely solely on instantaneous sensor observations, limiting them to reactive behavior without the anticipatory foresight human drivers employ through prior experience. We introduce geospatial visual priors, street-level visual context anchored to the intended driving route, providing visual-spatial foresight independent of real-time sensors. We propose a memory augmentation module featuring a dual-memory architecture and an adaptive memory gate, which can be easily integrated into existing end-to-end approaches. This design pairs a contextual memory for retrieved priors with a persistent fallback memory, and dynamically regulates the influence of memories based on current state compatibility. Evaluated on the NAVSIM-v2 benchmark, our approach consistently improves performance across diverse end-to-end baselines. Furthermore, because these priors are independent of onboard sensors, our method inherently improves robustness against sensor corruption, while the dual-memory design ensures safe fallback when the retrieved priors themselves become unreliable. Our project page is available at this https URL.

29. 【2606.31825】Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

链接：https://arxiv.org/abs/2606.31825

作者：Junha Jung,Minbyul Jeong,Suhyeon Lim,Sungwook Jung,Jaehoon Yun,Taeyun Roh,Mujeen Sung,Jaewoo Kang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remain predominantly outcome-centric, large language models, shown great promise, existing post-training pipelines, post-training pipelines remain

备注：

点击查看摘要

Abstract:Recent multimodal large language models have shown great promise in clinical image reasoning, but existing post-training pipelines remain predominantly outcome-centric, relying on final answer correctness or sequence-level preferences. This suffers from sparse credit assignment, making it difficult to optimize the reasoning process essential for clinical applications. Our analysis reveals that cascading errors from early-stage reasoning failures are a leading cause of incorrect predictions in medical visual question answering (VQA) benchmarks. Motivated by this, we propose Medical Reasoning-aware Policy Optimization (MRPO), an RL algorithm that incorporates step-wise process rewards. When the final answer is incorrect, MRPO assigns exponentially larger penalties to tokens in earlier invalid reasoning steps, breaking failure cascades without compromising successful paths. Across three multimodal LLM backbones, MRPO consistently outperforms standard GRPO and a recent RL baseline, and on Qwen3-VL-8B-Instruct even surpasses substantially larger medical MLLMs such as HuatuoGPT-Vision-34B by 2.79 points. Moreover, MRPO reduces early-stage reasoning failures from 64.0% to 13.0%, showing that targeted mitigation of cascading failures improves both reasoning quality and final answer accuracy. Our code is available at this https URL

30. 【2606.31824】Absorption-Feature-Guided Distance-Decoupled Estimation and Band Selection for LWIR Hyperspectral Passive Ranging

链接：https://arxiv.org/abs/2606.31824

作者：Shuo Liu,Chen Fan,Zhihe Chen,Xiaolin Huang,Lilian Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Long-wave infrared, distance-dependent atmospheric absorption, providing a physical, observations contain distance-dependent, physical basis

备注： 18 pages, 9 figures

点击查看摘要

Abstract:Long-wave infrared (LWIR) hyperspectral observations contain distance-dependent atmospheric absorption signatures, providing a physical basis for long-range passive ranging. However, in natural scenes, these signatures are nonlinearly coupled with target temperature, material emissivity, and path radiance, making distance inversion from observed radiance ill posed. Existing methods typically rely on full-band measurements and pixel-wise joint optimization, which is computationally expensive and does not explicitly exploit sharp atmospheric absorption structures. This paper proposes an Absorption-Guided Distance-Decoupled Estimation and Refinement (ADER) framework for LWIR hyperspectral passive ranging. ADER represents emissivity with B-spline control points under a smoothness prior, suppressing overfitting to atmospheric absorption structures and enabling distance-decoupled estimation. It further uses ozone-absorption cues to classify pixels into emission-dominant and reflection-dominant groups. For emission-dominant pixels, ADER compensates path radiance and transmittance and estimates distance by one-dimensional absorption-residual minimization. For reflection-dominant pixels, ADER refines the initial estimate using downwelling-radiance compensation based on the complete radiative model. To reduce spectral redundancy, ADER also introduces a greedy band selection strategy based on multi-scene effective Fisher information for the distance parameter. Experiments on real scenes show that ADER recovers LiDAR-consistent spatial distance structures under both full-band and 20-band settings, improves ranging accuracy in the evaluated regions, and achieves approximately two orders of magnitude speedup over a public full-band hyperspectral ranging method.

31. 【2606.31814】Generative Lane Topology Reasoning via Autoregressive Model with Geometry Prior

链接：https://arxiv.org/abs/2606.31814

作者：Jiahui Fu,Zehao Huang,Han Li,Naiyan Wang,Si Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：onboard sensor observations, topology reasoning aims, Lane topology reasoning, lane graph, Lane

备注： ECCV 2026

点击查看摘要

Abstract:Lane topology reasoning aims to construct a lane graph from onboard sensor observations. Existing methods follow a detection and association paradigm that treats each lane instance independently, leading to geometric inconsistency at connected endpoints and incomplete graphs due to visual occlusions. To address these issues, we propose TopoGPT, a generative framework that learns the geometry prior from typical lane graph structures through autoregressive sequence modeling. Specifically, we construct a large-scale map dataset comprising 3.3M scenes. For each lane graph, a lane tokenizer serializes it into discrete tokens, while a scene context encoder converts it into a rasterized image and extracts global features as scene tokens. We pre-train an autoregressive lane sequence transformer via scene-conditioned next-token prediction, endowing the model with the geometry prior over lane graph structures. Building upon this prior, a perception adapter aligns BEV features from multi-view images with the pre-trained scene condition, transferring the learned geometry prior to sensor-based lane graph prediction. On the OpenLane-V2 benchmark, TopoGPT outperforms existing methods by an average of +6.4 on lane-level and +11.6 on point-level metrics, and produces geometrically consistent and structurally complete lane graphs.

32. 【2606.31811】MuSViT: A Foundation Vision Model for Sheet Music Representation

链接：https://arxiv.org/abs/2606.31811

作者：Carlos Penarrubia,Antonio Rios-Vila,Eliseo Fuentes-Martinez,Juan C. Martinez-Sevilla,Francisco J. Castellanos,María Alfaro-Contreras,Jorge Calvo-Zaragoza

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：providing rich, processing by providing, transfer across diverse, music, Sheet music

备注： Accepted at European Conference on Computer Vision (ECCV'26)

点击查看摘要

Abstract:Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score Vision Transformer): the first foundation vision model for sheet music representation -- a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the IMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: a synthetic warm-up on typeset scores followed by large-scale training on the full IMSLP corpus. We evaluate MuSViT on four downstream tasks -- full-page and staff-level music score recognition, music symbol detection, and score difficulty classification -- under two scenarios: linear probing (frozen encoder) and fine-tuning. Under linear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Under fine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additional embedding-transcription consistency analysis reveals that MuSViT encodes symbolic musical structure directly in its representation space -- unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone for sheet music understanding.

33. 【2606.31785】Self-Supervised Temporal Regularization for Landmark-Based Cardiac Segmentation with Automatic AHA Regional Mapping

链接：https://arxiv.org/abs/2606.31785

作者：David Montalvo-García,Nicolás Gaggion,María J. Ledesma-Carbayo,Enzo Ferrante

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：population-level analysis capabilities, Graph-based cardiac segmentation, image sequences exhibit, Graph-based cardiac, reliable clinical measurements

备注： Accepted at MICCAI 2026

点击查看摘要

Abstract:Graph-based cardiac segmentation with implicit anatomical correspondences provides topological guarantees and population-level analysis capabilities, but models trained on independent frames of image sequences exhibit temporal discontinuities that affect reliable clinical measurements, particularly in cardiac ultrasound. In this work, we introduce self-supervised temporal regularization as a post-training refinement stage that exploits the temporal coherence in image sequences to enforce consistent cardiac segmentation and motion estimation over time, without requiring per-frame annotations. By penalizing velocity and acceleration discontinuities across consecutive frames, our method achieves temporally consistent segmentations while maintaining the learned anatomical correspondences. We further leverage these correspondences to automatically map landmarks to the AHA 17-segment clinical standard, enabling standardized regional assessment and detection of pathological myocardial motion patterns. Validation on CAMUS dataset demonstrates the clinical utility of combining temporal consistency with automatic regional mapping. The code is publicly available at this https URL

34. 【2606.31781】SpikeLogBERT: Energy-Efficient Log Parsing Using Spiking Transformer Networks

链接：https://arxiv.org/abs/2606.31781

作者：Thuan Bui,Duong Do,Tung Vu,Duc-Tho Mai,Cong-Kha Pham

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：transforming raw system, structured event templates, automated log analysis, raw system logs, system monitoring

备注：

点击查看摘要

35. 【2606.31777】Mesh BDF: Barycentric Dominance Field for 3D Native Mesh Generation

链接：https://arxiv.org/abs/2606.31777

作者：Gaochao Song,Haohan Weng,Luo Zhang,Zibo Zhao,Shenghua Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently achieved remarkable, achieved remarkable progress, discrete data structures, Barycentric Dominance Field, largely due

备注： 15 pages, 6 figures

点击查看摘要

Abstract:Autoregressive (AR) modeling has recently achieved remarkable progress in native 3D mesh generation, largely due to its natural ability to handle variable-length, discrete data structures. However, the inherent constraints of the AR paradigm severely restrict the generated meshes, leading to limited face counts, bounded vertex resolutions, and difficulties in supporting textures. To overcome these bottlenecks, we propose the Barycentric Dominance Field (BDF), a continuous representation defined on triangular mesh surfaces that elegantly encodes vertex topological connectivity. BDF bridges the fundamental gap between discrete mesh topology and continuous diffusion-based generative modeling by transforming connectivity into a continuous surface signal. As an intrinsic mesh property, BDF shares strong similarities with texture maps, enabling its seamless integration into existing 3D diffusion pipelines without requiring architectural modifications. Extensive experiments demonstrate that BDF empowers diffusion models to generate native meshes with significantly higher quality, greater scalability, and stronger robustness compared to state-of-the-art autoregressive methods.

36. 【2606.31764】NURBS Splatting: A Unified Differentiable Rendering Framework for Vector Graphics

链接：https://arxiv.org/abs/2606.31764

作者：Jingye Qiu,Shizhe Zhou

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely underexplored, splines remains largely, largely underexplored, graphics and design, differentiable vector renderers

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Differentiable rendering of planar rational splines remains largely underexplored, despite their widespread use in vector graphics and design. Existing differentiable vector renderers primarily focus on Bézier curves and rely on analytic rasterization, which can suffer from gradient instability and limited flexibility. We propose NURBS Splatting, a unified framework that represents planar rational curves as continuous Gaussian fields. By sampling Gaussians along the curve parameter domain and inside closed regions, rendering is reformulated as a smooth accumulation process with stable gradients. Our method naturally supports long splines, rational weights, non-uniform knots, and closed-region filling. We demonstrate its effectiveness in calligraphy reconstruction, vectorization frameworks, and long-spline image abstraction, showing improved stability and reconstruction quality over existing approaches.

37. 【2606.31760】Estimating Velocity of Spheres from Rolling-Shutter Image(s)

链接：https://arxiv.org/abs/2606.31760

作者：Wenjie Xue,Jun Yang,Jingmin Wang,Limin Shang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cameras introduce characteristic, Rolling-shutter cameras introduce, introduce characteristic distortions, imaging fast moving, fast moving objects

备注：

点击查看摘要

Abstract:Rolling-shutter cameras introduce characteristic distortions when imaging fast moving objects, and these effects are typically treated as artifacts to be corrected. In this work, we instead leverage rolling-shutter distortions as a valuable source of temporal information to estimate the 3D translational and angular velocities of rapidly moving spherical objects from a single rolling-shutter frame. We design a robust and easily detectable spherical pattern and propose a correspondence-free formulation that recovers motion by enforcing geometric consistency in a back-projection framework. By exploiting the geometry of the sphere, translational and rotational motions are decoupled and estimated through a two-stage optimization process, enabling reliable velocity recovery even for textureless objects. Extensive experiments on both synthetic and real datasets demonstrate accurate and robust estimation of motion parameters under challenging high-speed conditions.

38. 【2606.31745】JL1-CCQA: Extending the JL1-CD Benchmark with Change Captioning and Question Answering

链接：https://arxiv.org/abs/2606.31745

作者：Ziyuan Liu,Ruifei Zhu,Ouqiao Ma,Yuantao Gu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：pixel-level binary segmentation, traditionally focuses, focuses on pixel-level, sensing change detection, Remote sensing

备注： 10 pages, 8 figures

点击查看摘要

Abstract:Remote sensing change detection (CD) traditionally focuses on pixel-level binary segmentation, which identifies where changes occur but neither what nor why. To bridge this semantic gap, we introduce JL1-CCQA, a multi-task benchmark that extends the JL1-CD dataset with two complementary annotation layers: change captioning (CC) and change question answering (QA). Built upon 5,000 bi-temporal image pairs acquired by the Jilin-1 satellite at 0.5-0.75m ground sample distance, the benchmark comprises: (i) JL1-CC, providing 17,021 quality-verified captions that describe diverse land-cover transformations; and (ii) JL1-QA, offering 20,060 question-answer pairs across eight question types, enabling fine-grained, interactive interrogation of surface changes. All annotations are produced via a three-stage pipeline consisting of multi-modal large language model (LLM) generation, vision-grounded LLM judging, and human expert verification. We hope that JL1-CCQA, as a benchmark unifying binary change masks, change captions, and change-oriented QA over the same image set, will serve as a valuable resource for the community to advance multi-task change understanding in remote sensing. The dataset is available at this https URL.

39. 【2606.31736】Rhythm-Structured Predictive Learning for Remote Photoplethysmography

链接：https://arxiv.org/abs/2606.31736

作者：Ba-Thinh Nguyen,Huu-Dung Nguyen,Thi-Duyen Ngo,Thanh-Ha Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：skin color variations, induced skin color, Remote photoplethysmography, subtle pulse induced, pulse induced skin

备注：

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) estimates physiological signals from facial videos by analyzing subtle pulse induced skin color variations. Despite recent progress, existing self-supervised rPPG methods mainly reconstruct masked pixels or low-level visual representations, which can bias the model toward facial appearance rather than latent physiological dy namics. Moreover, most recent Mamba-based approaches scan facial video tokens only in chronological order, limiting their ability to exploit the cyclic structure of pulse signals. To ad dress these limitations, we propose RhythmJEPA, a rhythm structured joint-embedding predictive learning framework for rPPG. Instead of reconstructing RGB frames, RhythmJEPA predicts latent teacher representations from masked facial videos, thereby encouraging physiology-aware representation learning in the embedding space. To explicitly model pulse-related tem poral structure, we introduce a Cyclic Rhythm-State Plan ner (CRSP), which estimates frame-wise latent physiological states and decodes the most plausible cyclic state path via dynamic programming with a constrained transition grammar. Guided by the decoded states, we further design a Dual Order Mamba Encoder (DOM), which combines conventional chronological scanning with state-ordered scanning to capture both local temporal continuity and long-range rhythm-consistent dependencies. Finally, a lightweight Spatial Pulse Mixer (SPM) extracts compact pulse-sensitive facial tokens with a favorable balance between complexity and performance. Experiments on PURE, UBFC-rPPG, and MMPD show competitive performance over representative rPPG methods. The codes are available at this https URL.

40. 【2606.31734】MemLearner: Learning to Query Context memory for Video World Models

链接：https://arxiv.org/abs/2606.31734

作者：Jiwen Yu,Jianxiong Gao,Jianhong Bai,Yiran Qin,Kaiyi Huang,Quande Liu,Xintao Wang,Pengfei Wan,Kun Gai,Xihui Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：predict future world, future world states, world states based, Video World Models, World Models

备注： ECCV 2026, Project Page: [this https URL](https://yujiwen.github.io/memlearner/)

点击查看摘要

Abstract:Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.

41. 【2606.31732】UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

链接：https://arxiv.org/abs/2606.31732

作者：Yaozhi Zheng,Yilei Jiang,Manyuan Zhang,Yuxuan Wan,Kaituo Feng,Tianshuo Peng,Bo Zhang,Xiangyu Yue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, standard Multimodal Large, Large Language Models, transforms scientific plots, Multimodal Large

备注：

点击查看摘要

Abstract:Visual-to-Code generation, which transforms scientific plots, vector graphics, and webpages into executable scripts, demands a level of pixel-precise alignment that standard Multimodal Large Language Models (MLLMs) fail to achieve through Supervised Fine-Tuning (SFT) alone. While Reinforcement Learning (RL) offers a theoretical pathway to bridge this gap, its application is hindered by two fundamental obstacles: (1) \textit{Reward Coarseness}, where semantic metrics like CLIP scores fail to penalize fine-grained element deviations, and (2) \textit{Exploration Stagnation}, where the sparse, heterogeneous code search space prevents the policy from bootstrapping valid trajectories. To overcome these limitations, we introduce UniCoder, a unified RL framework that integrates two novel mechanisms. First, we propose \textbf{Symbolic Attribute Alignment}, which employs a lightweight auxiliary LLM to parse generated code into discrete visual attributes (e.g., hex colors, coordinate limits), enabling dense, element-wise reward computation. Second, to escape local optima, we devise \textbf{Reference-Guided Code Optimization}, a strategy that dynamically injects ground-truth trajectories into low-performing rollout groups, transforming blind exploration into guided policy improvement. Extensive experiments on ChartMimic, UniSVG, Design2Code and ScreenBench benchmarks demonstrate that our 8B-parameter model not only surpasses all open-source baselines but also achieves state-of-the-art performance comparable to proprietary models, establishing a new paradigm for generalized visual-to-code synthesis.

42. 【2606.31715】Semantic-Aware Multiple Access via Spatial Redundancy Exploitation for Uplink-Dominant 6G Use Cases

链接：https://arxiv.org/abs/2606.31715

作者：Hamidreza Mazandarani,Masoud Shokrnezhad,Tarik Taleb,Onur Günlü

类目：Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)

关键词：limited wireless resources, high-volume visual data, Emerging uplink-dominant, cooperative vehicular streaming, require efficient transmission

备注：

点击查看摘要

Abstract:Emerging uplink-dominant 6G use cases, such as cooperative vehicular streaming, require efficient transmission of high-volume visual data over limited wireless resources. While semantic communications can reduce traffic by prioritizing task-relevant content, most existing approaches treat users independently and therefore overlook spatial redundancy among nearby devices' observations. This paper proposes a semantic-aware multiple access scheme that exploits overlapping fields of view among vehicular users to reduce redundant uplink transmissions. We formulate a joint perception and transmission control problem in which users decide which image patches to transmit, when to transmit them, and over which channel, subject to communication constraints. To address the resulting complexity, we introduce a practical two-phase approach. First, nearby vehicles share selected observation patches over Vehicle-to-Vehicle (V2V) links to calculate inter-user spatial redundancy. Second, users transmit only semantically important, non-redundant patches to the base station, where observations can be reconstructed using the received patches and complementary views from neighboring vehicles. Simulation results in a dense urban vehicular scenario demonstrate that our approach improves the proportion of users who achieve high-fidelity reconstruction, highlighting the potential of semantic-aware multiple access for sustainable and resource-efficient 6G uplink systems.

43. 【2606.31704】WIDER-FAIR: An Annotated Version of the WIDER-FACE Dataset for Fairness Evaluation

链接：https://arxiv.org/abs/2606.31704

作者：Maxime Moussi,Benoît Ronval,Siegfried Nijssen,Félicien Schiltz

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：real-world applications raises, applications raises important, important fairness concerns, raises important fairness, showcase performance disparities

备注：

点击查看摘要

Abstract:The deployment of face detection models in real-world applications raises important fairness concerns, as these systems may showcase performance disparities across demographic groups. A key obstacle to studying and mitigating such biases is the lack of face detection datasets with sensitive feature annotations. To address this gap, we introduce WIDER-FAIR, a new dataset built on the widely used WIDER-FACE benchmark, manually annotated with the perceived ethnicity and sex of each face. The dataset contains 16,256 images annotated across four ethnic groups: Asian, Black, Indian, and White, and two sex categories. We assess the quality and coherence of the annotations using face embeddings, a K-Nearest Neighbors classifier, and a t-SNE visualization, all of which support the consistency of the labeling process. As a demonstration of the dataset's potential, we train a YOLOv5 model and perform ablation studies on each sensitive feature. Among other findings, our experiments show that detection performance is notably lower for faces of Black individuals, and that excluding this group from training increases fairness disparity more than excluding any other ethnic group. These observations illustrate the value of demographically annotated datasets for understanding and evaluating bias in face detection models.

44. 【2606.31703】Phantom: A Unified Face-Swap Deepfake Protection Framework with Latent and Spatial Constraints

链接：https://arxiv.org/abs/2606.31703

作者：Jungkon Kim,Cheolseung Jung,Jong-Min Choi,Juseong Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unauthorized identity manipulation, enabling unauthorized identity, identity manipulation, Face-swapping deepfakes pose, pose an escalating

备注： Accepted to CVPR 2026 (Findings)

点击查看摘要

Abstract:Face-swapping deepfakes pose an escalating threat to personal privacy by enabling unauthorized identity manipulation. While adversarial approaches have demonstrated success against black-box face recognition (FR) models, their applicability to face-swapping scenarios remains underexplored. In particular, reliance on fixed or random targets yields ambiguous latent guidance, and the lack of explicit spatial constraints causes perturbations to spill into identity-irrelevant regions. These issues are further exacerbated by identity-style disentanglement, which suppresses adversarial signals during deepfake generation. In this paper, we present Phantom, a unified face-swap deepfake protection framework that jointly constrains perturbations in latent and spatial domains. Phantom adaptively synthesizes identity-shifted yet attribute-preserving targets to guide identity-aware latent optimization, and applies masked perturbations confined to semantically relevant facial regions. Extensive experiments on state-of-the-art face-swapping deepfakes demonstrate that Phantom improves protection success rates in dodging scenarios by 27.8%, 25.6%, and 16.6% on UniFace, INSwapper, and SimSwap, respectively, while also enhancing visual quality. Furthermore, Phantom generalizes to impersonation scenario, yielding up to 10.2% higher protection while improving perceptual fidelity. These results underscore the effectiveness of jointly leveraging latent and spatial constraints for robust and coherent facial privacy protection.

45. 【2606.31699】Look But Don't Touch with Sparse Autoencoders for Unlearning in Diffusion Models

链接：https://arxiv.org/abs/2606.31699

作者：Enrico Cassano,Riccardo Renzulli,Rayyan Ahmed,Marco Grangetto,Stephan Alaniz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：controllable intervention points, recently been proposed, proposed as interpretable, serve as controllable, Sparse autoencoders

备注：

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have recently been proposed as interpretable tools for concept-level manipulation, under the assumption that isolated features can serve as controllable intervention points. In this work, we systematically evaluate this assumption in the context of object erasure and steering in diffusion models. We show that while SAEs reliably detect and localize semantic concepts within diffusion model activations, direct intervention in their latent space frequently induces out-of-distribution activations, resulting in severe visual artifacts. To disentangle detection from intervention, we use SAE activations purely as semantic detectors to identify image regions containing the target object, and replace those patch embeddings with the ones that do not contain it. This detection-based replacement preserves the diffusion model's activation statistics and produces significantly cleaner erasure results than latent steering. Our findings reveal a fundamental gap between concept detection and concept intervention in diffusion models: monosemantic or sparse features are not inherently suitable as control knobs for steering. These results position SAEs as powerful interpretability tools for analyzing generative models, but highlight important limitations when used for direct manipulation, such as unlearning.

46. 【2606.31695】Intrinsically Stable Spiking Neural Networks: Overcoming the Performance Barrier in the Absence of Batch Normalization

链接：https://arxiv.org/abs/2606.31695

作者：Ruichen Ma,Xiaoyang Zhang,Jian Bai,Guanchao Qiao,Liwei Meng,Ning Ning,Yang Liu,Shaogang Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：spiking neural networks, deep spiking neural, neural networks, Intrinsically Stable SNN, spiking neural

备注： ECCV 2026 Accepted

点击查看摘要

Abstract:The performance of deep spiking neural networks (SNNs) often relies on batch normalization (BN). However, the advanced dynamic BN variants used in state-of-the-art models introduce runtime multiplications, which weaken the hardware-efficiency motivation of SNNs. To address this tension, we identify catastrophic firing-rate decay as a primary cause of severe performance degradation in normalization-free SNNs. Guided by this insight, this work proposes the Intrinsically Stable SNN (IS-SNN) architecture, which removes activation-normalization layers by enforcing signal homeostasis through topology-aware weight standardization and modified residual connections. By folding the standardization operations into static weights offline, IS-SNN removes the runtime statistics tracking and multiplications introduced by activation normalization, restoring an accumulation-oriented inference datapath. Comprehensive experiments show that IS-SNN achieves performance competitive with or superior to computationally expensive dynamic BN techniques across VGG, ResNet, and Transformer-based models. Notably, it achieves a competitive accuracy of 68.05\% on ImageNet and overcomes the severe depth limitations of prior BN-free attempts. Together with a 96.4\% reduction in FPGA lookup table resource consumption for neuron implementations, these results support IS-SNN as a practical framework for building accurate and hardware-friendly deep neuromorphic systems.

47. 【2606.31694】RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

链接：https://arxiv.org/abs/2606.31694

作者：Jingbo He,Michael Färber,Roberto Calandra

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：manipulating open-world objects, robots manipulating open-world, open-world objects, manipulating open-world, representations must generalize

备注：

点击查看摘要

48. 【2606.31688】Semantic Occupancy Prediction with Dual Range-Voxel Representation

链接：https://arxiv.org/abs/2606.31688

作者：Sitao Chen,Zhuangwei Zhuang,Hui Luo,Lizhao Liu,Qingyao Wu,Mingkui Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving systems, semantic occupancy prediction, comprehensive scene representation, driving systems, occupancy prediction

备注：

点击查看摘要

Abstract:LiDAR-based 3D semantic occupancy prediction, which aims to provide accurate and comprehensive scene representation, is crucial for autonomous driving systems. As point clouds suffer from sparsity and incompleteness, leading to insufficient semantic learning and difficult occupancy perception, existing methods often stack multi-sweep point clouds to obtain dense spatial information. However, such a naive strategy also results in efficiency (e.g., additional computational burden) and robustness (e.g., pose transformation noise) concerns, which hinder their practical applications. In this work, we propose a Dual Range-Voxel Representation (DRVR) that leverages the range-view context and voxel-view geometry of single-sweep point clouds for 3D semantic occupancy prediction, eliminating the concerns associated with the multi-sweeps. Specifically, we use the range-view encoder to extract the compact context of the scene. To fully exploit the spatial information, we design a geometry-aware voxel-view encoder that extracts multi-scale voxel-view features separately and combines them for better geometric occupancy prediction. Moreover, we propose a range-voxel fusion module to cooperate range- and voxel-view features via voxel-to-range and range-to-voxel fusions. Extensive experiments on nuScenes-Occupancy, SemanticKITTI and SemanticPOSS show the superiority of our method. Especially on nuScenes-Occupancy, our single-sweep DRVR achieves 5.4% improvement in mIoU and 2.1x acceleration compared to the multi-sweep method.

49. 【2606.31683】Histogram-constrained Image Generation

链接：https://arxiv.org/abs/2606.31683

作者：Haoming Liu,Yuanhe Guo,Yijia Cao,Shenji Wan,Hongyi Wen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：complex data distributions, enabling high-fidelity sampling, generative modeling, enabling high-fidelity, dominant paradigm

备注： Accepted to ECCV 2026; 31 pages, 16 figures

点击查看摘要

Abstract:Diffusion models have emerged as a dominant paradigm in generative modeling, enabling high-fidelity sampling from complex data distributions. Despite impressive capabilities, controlling diffusion models to produce outputs aligned with user intent remains an open challenge, especially when balancing global coherence with local precision. Existing control mechanisms vary in the granularity of their conditioning signals. For example, textual prompts guide generation globally through high-level semantics, while ControlNet-like approaches secure precise local structure via dense conditions. In this work, we introduce Histogram-constrained Image Generation (HIG), a novel control mechanism that falls into the middle ground of control granularity. Our framework enforces user-specified distributional constraints (e.g., color histograms or latent token distributions) during the generation process with exact precision. We model such control as an optimal transport (OT) problem and apply explicit guidance transformations during sampling, thereby driving the diffusion trajectory to align with the desired histogram. We demonstrate the versatility of HIG across diverse applications, including constrained generation via color/latent histograms and high-capacity information embedding through histogram-level encoding. Our findings underscore the promise of distributional control, a flexible and interpretable control scheme that is fully compatible with existing control mechanisms, diversifying the hybrid strategies for controllable image generation. Our project page is available at: this https URL.

50. 【2606.31680】ShellMaker: Language-Guided Exterior Completion under Structural Constraints

链接：https://arxiv.org/abs/2606.31680

作者：Ruiqi Xu,Daniel Aliaga

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely unexplored, generated interiors remains, interiors remains largely, synthesizing coherent building, synthesizing coherent

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:Despite advances in indoor scene generation, synthesizing coherent building exteriors consistent with generated interiors remains largely unexplored. Existing methods can generate floor plans and wall layouts but typically stop at a structural shell, lacking stylistically consistent facades and roofs. Completing these exteriors is challenging because the footprint, wall geometry, and opening semantics must remain fixed-constraints that unconstrained generative models often violate. We introduce ShellMaker, a language-guided exterior completion framework that operates under these structural constraints. Given a building scaffold and a text style prompt, ShellMaker generates a complete exterior mesh with PBR materials by combining parametric roof generation, LLM-based part-aware prompt refinement, joint wall-roof material retrieval, and geometry-aware assembly. Operating on a format agnostic scaffold representation, ShellMaker generalizes to indoor generators, CityGML, and CAD inputs, while maintaining structural consistency and improving architectural coherence over retrieval and unconstrained generative baselines. The project page is available at this https URL

51. 【2606.31679】Practical High-Fidelity Novel-View Synthesis of Mounted Lepidoptera

链接：https://arxiv.org/abs/2606.31679

作者：Kristof Overdulve,Lode Jorissen,Nick Michiels

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：natural history collections, Mounted butterflies, history collections, striking objects, objects in natural

备注：

点击查看摘要

Abstract:Mounted butterflies are among the most striking objects in natural history collections. However, their beauty is notoriously hard to digitize in 3D: they are small and fragile, with microscopic hairs and vein structures. Capturing them in sufficient detail, therefore, requires a macro lens, which has a very limited Depth of Field (DoF). Moreover, a camera body cannot be maneuvered beneath a pinned specimen to photograph its ventral surface (the underside of the wings). We introduce an end-to-end pipeline that resolves these challenges to turn such specimens into photo-realistic 3D models viewable from every direction. It combines three ingredients: handheld focus stacking for all-in-focus macro capture without a tripod, a non-contact first-surface mirror system that exposes the ventral surface without touching the specimen, and a segmentation-free, mirror-aware 3D Gaussian Splatting extension. We validate the reconstructions on four diverse specimens.

52. 【2606.31676】REDI: Corpus Aware Patch Ranking for DINOv3 Token Reduction

链接：https://arxiv.org/abs/2606.31676

作者：Chanjong Im,Sebastian Diem,Thomas Mandl

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Transformers seek favorable, Vision Transformers seek, seek favorable tradeoffs, token reduction methods, Transformers seek

备注： 10 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Most token reduction methods for Vision Transformers seek favorable tradeoffs between accuracy and efficiency by pruning, merging, or pooling patch tokens. REDI (Relevance for DINOv3 Token Reduction) studies this question through a controlled supervised reference: how should a fixed token budget be allocated across patches for image classification? REDI quantizes final block DINOv3 patch representations into a visual vocabulary and derives class conditioned corpus scores using supervised TF-IDF over visual words. For each validation image, the ground truth class selects a row of the TF-IDF table, and four transformed views produce a TF-IDF map aligned to a reference center crop. A separate dense pass on the same crop provides an attention map. After independent min max normalization, their elementwise product defines the REDI score. A fixed keep, merge, and compress operator then uses score rank to assign patch roles and score magnitude to weight merging and compression. With precomputed REDI scores, a frozen DINOv3 ViT-B/16 backbone, and the same linear classifier used for dense evaluation, the operator reduces the sequence length from 201 to 107 tokens, a 46.8% sequence reduction. The REDI variant based on incoming attention mass achieves 84.706% Top-1 accuracy on ImageNet-1K, compared with 83.514% for the dense baseline, 82.634% for incoming attention mass alone, and 81.796% for supervised TF-IDF alone. The same corpus term also improves reduced classification for three alternative attention formulations relative to their attention only counterparts. Together, these controlled comparisons indicate that class specific corpus statistics and image specific attention provide complementary signals for patch ranking in this setting.

Comments:
10 pages, 2 figures, 3 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68T45

ACMclasses:
I.2.10

Cite as:
arXiv:2606.31676 [cs.CV]

(or
arXiv:2606.31676v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.31676

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

53. 【2606.31672】WorldRoamBench: An Open-World Benchmark for Long-Horizon Stability of Interactive World Models

链接：https://arxiv.org/abs/2606.31672

作者：Ting-Bing Xu,Jiacheng Sui,Zhe Gao,Kewei Shi,Wenjin Yang,Zhicheng Liu,Zhaoxu Sun,Mingchao Sun,Hongyu Pan,Fan Jiang,Mu Xu,Qi Fan,Yong Li,Baoquan Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：existing benchmarks evaluate, interactive world models, benchmarks evaluate action, rapid progress, progress in interactive

备注：

点击查看摘要

Abstract:Despite rapid progress in interactive world models (IWMs), existing benchmarks evaluate action following only at trajectory level and ignore memory and interaction physics. We introduce WorldRoamBench, an open-world benchmark for long-horizon stability across four dimensions, each with tailored innovations: (i) Action: per-frame action metric bypassing cross-model semantic scale disparity and exposing failures hidden by trajectory; (ii) Vision: segment-based drift metric capturing non-monotonic mid-sequence collapse missed by start-vs-end comparisons; (iii) Physics: controllability-gated evaluation over mechanics, optics, and 3D consistency, scoring plausibility under faithful action execution; (iv) Memory: action-decoupled protocol evaluating scene memory via transition-localized 3D point-cloud reconstruction and subject memory via tracking-plus-VLM reasoning. The benchmark comprises 600+ test cases across Nature, Urban, and Indoor scenes in first/third-person views with WASD 10-60s continuous interaction. Evaluating 10+ open/closed-source models reveals none reliably satisfies all dimensions; even the best achieves only moderate scores. Advances on WorldRoamBench are steps toward IWMs that are stable, physically grounded, memory-faithful, and deployable in real-world applications.

54. 【2606.31668】SAMBA: A Scatter-Guided Masked Bidirectional Mamba Foundation Model for SAR Target Recognition

链接：https://arxiv.org/abs/2606.31668

作者：Ke Wang,Xiaoyi Pan,Zhaoyu Gu,Xiaofeng Ai,Zhiming Xu,Feng Zhao,Shunping Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthetic aperture radar, annotated training data, aperture radar automatic, scarce annotated training, Synthetic aperture

备注： 15 pages, 5figures

点击查看摘要

Abstract:Synthetic aperture radar automatic target recognition (SAR ATR) is critical for Earth observation and defense, but its practical deployment is constrained by scarce annotated training data. Self-supervised pre-training alleviates this label bottleneck, yet prevailing Transformer architectures incur prohibitive quadratic computational complexity, and conventional universal masking neglects the unique electromagnetic scattering properties intrinsic to SAR imagery. To address these limitations, we propose SAMBA (Scattering-Guided Bidirectional Mamba), an efficient self-supervised pre-training foundation model for SAR target interpretation. Our framework features three core innovations: (i) a linear-complexity Mamba encoder with a mid-sequence class token to mitigate computational bottlenecks; (ii) a three-level hierarchical Scattering-Guided Masked Autoencoder (SG-MAE) masking strategy guided by SAR physical priors, aligning the pretext task with SAR's intrinsic imaging mechanism; (iii) a lightweight SpatialMix feature interaction module to enhance cross-region feature fusion. We also design a two-stage cross-domain pre-training pipeline to optimize the overall pre-training process. Extensive evaluations demonstrate that SAMBA consistently delivers superior performance across all pre-training configurations, with substantially fewer parameters than both CNN and Transformer baselines. Compared with the default masking strategy in standard MAE, the proposed SG-MAE strategy further boosts the model's few-shot transfer capability. Benchmarking on seven downstream datasets covering classification and detection tasks shows SAMBA achieves state-of-the-art (SOTA) performance on most metrics, fully validating its robust generalizability across diverse SAR interpretation tasks. Source code and pre-trained weights are publicly available at this https URL.

55. 【2606.31664】Sparsity-Inducing Divergence Losses for Biometric Verification

链接：https://arxiv.org/abs/2606.31664

作者：Dimitrios Koutsianos,Ladislav Mošner,Yannis Panagakis,Themos Stafylakis

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：margin-penalty softmax losses, largely driven, driven by margin-penalty, divergence loss functions, divergence loss

备注： Accepted at ECCV 2026

点击查看摘要

Abstract:Performance in face and speaker verification is largely driven by margin-penalty softmax losses such as CosFace and ArcFace. Recently introduced $\alpha$-divergence loss functions offer a compelling alternative, particularly due to their ability to induce sparse solutions (when $\alpha1$). However, standard geometric margins are designed for the softmax function and do not naturally extend to this generalized probabilistic framework. In this paper we propose Q-Margin, a novel $\alpha$-divergence loss that introduces a principled probabilistic margin. Unlike conventional methods that apply geometric penalties to the logits (unnormalized log-likelihoods), Q-Margin encodes the margin penalty directly into the reference measure (prior probabilities). This formulation naturally encourages discriminative embeddings while preserving the beneficial sparsity properties of the $\alpha$-divergence. We demonstrate that Q-Margin achieves competitive or superior performance on the challenging IJB-B and IJB-C face verification benchmarks and similarly strong results in speaker verification on VoxCeleb. Crucially, against ArcFace and CosFace baselines trained under an identical recipe, Q-Margin consistently improves at low False Acceptance Rates (FARs), a capability critical for practical high-security applications. Finally, the extreme sparsity of the Q-Margin posteriors enables exact and memory-efficient training, offering a scalable solution for datasets with millions of identities.

56. 【2606.31654】DynFly: Dynamic-Aware Continuous Trajectory Generation for UAV Vision-Language Navigation in Urban Environments

链接：https://arxiv.org/abs/2606.31654

作者：Wen Jiang,Hanfang Liang,Li Wang,Kangyao Huang,Wang Xu,Wei Fan,Jinyuan Liu,Shaoyu Liu,Hongwei Duan,Bin Xu,Xiangyang Ji,Huaping Liu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：significantly improved UAV, improved UAV vision-language, multimodal large models, Recent advances, UAV vision-language navigation

备注： 34 pages, 9 figures

点击查看摘要

Abstract:Recent advances in multimodal large models have significantly improved UAV vision-language navigation (UAV-VLN) by enhancing high-level perception and reasoning. However, existing methods mainly focus on predicting discrete actions, local targets, or sparse waypoints, while the continuous transition from navigation intent to executable UAV motion remains weakly modeled. This motion-interface gap limits the continuity, stability, and executability of generated UAV trajectories. To address this gap, we propose DynFly, a dynamic-aware continuous trajectory generation framework that bridges high-level navigation reasoning and executable UAV motion. DynFly bridges high-level navigation intent and continuous UAV motion through a lightweight trajectory generation layer. Specifically, it represents expert trajectories in B-spline control-point space and employs a Spline-DiT generator to learn conditional trajectory generation via flow matching. Furthermore, we introduce UAV-oriented dynamic-aware supervision over position, finite-difference velocity, finite-difference acceleration, heading consistency, and local target alignment, enabling the generated trajectories to better satisfy UAV motion characteristics. And our trajectory generation framework can also be integrated with an existing UAV-VLN framework while preserving its original visual-language reasoning pipeline. Extensive experiments on the OpenUAV UAV-VLN benchmark show that DynFly improves both navigation performance and trajectory quality. On the Test Unseen Full split, DynFly improves the strongest baseline by 4.69 NDTW, 2.40 SDTW, 2.14 SR points and 4.87 OSR points, while reducing NE by 4.51 m.

57. 【2606.31645】chnical Report of RoboSpatial Challenge at CVPR 2026: Selective Reasoning Activation and Reference-Frame Disambiguation for Embodied Spatial Reasoning

链接：https://arxiv.org/abs/2606.31645

作者：Yuxiang Xie,Qi Lv,Jianming Xing,Zijian Hong,Xiang Deng,Weili Guan,Liqiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-language models achieve, models achieve strong, achieve strong general, strong general perception, Vision-language models

备注：

点击查看摘要

Abstract:Vision-language models achieve strong general perception but often struggle with the spatial reasoning required for embodied tasks. We present RoboSpatialBrain, our submission to the RoboSpatial Challenge at the Embodied Reasoning in Action Workshop, CVPR 2026, built on RoboBrain2.5-8B-NV. RoboSpatialBrain combines two training-free, inference-time mechanisms: a forced think prefix activation strategy paired with a task-specific post-prompt that elicits deliberate reasoning on context and compatibility tasks, and an explicit reference-frame redirection pipeline that resolves camera-centric and object-centric ambiguity for context tasks. We additionally explore fine-tuning RoboBrain2.5 on compatibility data and present a detailed analysis of its interaction with prompting. RoboSpatialBrain achieved first place in the RoboSpatial Challenge, with an overall success rate of 80.9\% on RoboSpatial-Home. Code is available at this https URL.

58. 【2606.31636】LiteMatch: Lightweight Zero-Shot Stereo Matching via Cost Volume Stabilization

链接：https://arxiv.org/abs/2606.31636

作者：Md Raqib Khan,Santosh Kumar Vipparthi,Subrahmanyam Murala

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cost volume processing, learning-based stereo matching, cost volume, computationally intensive, resulting in substantial

备注：

点击查看摘要

Abstract:Despite rapid progress in learning-based stereo matching, high accuracy is often achieved at the cost of heavy backbones and computationally intensive 3D cost volume processing, resulting in substantial memory and runtime overhead. More critically, these methods frequently struggle to generalize across domains, limiting their practical deployment. We present \textit{LiteMatch}, a lightweight stereo matching framework that achieves strong zero-shot generalization through cost volume stabilization-without expensive 3D convolutions. LiteMatch employs two complementary encoders: a Cross-View Correspondence Encoder (CVCE) to capture global cross-view interactions, and a High-Frequency Encoder (HFE) that enhances fine structural details via FFT-based frequency cues. To stabilize the cost volume, we introduce the \textit{Cost Volume Consistency Loss (CVC-Loss)}, a voxel-wise binary cross-entropy objective applied to softmax-normalized cost distributions. By encouraging sharp and unimodal disparity probabilities, CVC-Loss promotes stable cost distributions and enables rapid convergence. A lightweight refinement module further produces sharp full-resolution disparities with low-iteration updates, avoiding heavy recurrent refinement. With a flexible design ranging from 3.36M to 9.58M parameters, LiteMatch achieves exceptional zero-shot generalization, delivering competitive EPE and D1 performance across Scene Flow, KITTI, Middlebury, ETH3D, and DrivingStereo. Our results establish that lightweight architectures can indeed generalize across domains without sacrificing accuracy. \href{this https URL}{\textcolor{blue}{Code}}

59. 【2606.31626】PrISM-IQA: Image Quality Assessment Made Practical for Smartphone Photography

链接：https://arxiv.org/abs/2606.31626

作者：Shuyan Zhai,Jiaqi He,Weixia Zhang,Liang Wang,Zhenjie Lee,Zufeng Zhang,Kede Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：methods commonly reduce, Existing smartphone image, commonly reduce perceptual, Existing smartphone, methods commonly

备注：

点击查看摘要

Abstract:Existing smartphone image quality assessment (IQA) methods commonly reduce perceptual quality to a single score. However, this scalar formulation is poorly aligned with practical image signal processor (ISP) tuning, where engineers must identify specific quality issues, estimate their severities, and determine whether they are acceptable or require intervention. In this work, we introduce a Practical ISP-aware Structured Model for IQA (PrISM-IQA), which reformulates smartphone IQA as a multi-issue ordinal diagnosis problem. Rather than regressing a single quality score, PrISM-IQA predicts an \textit{ordered} severity level -- absent, minor, severe, or critical -- for each ISP-relevant issue, covering both global image-level artifacts and local content-dependent defects. To produce logically consistent predictions, PrISM-IQA combines cumulative ordinal encoding with structured inference that captures within-issue monotonicity as well as cross-issue subsumption and exclusion relations. We evaluate PrISM-IQA on a reconstructed SPAQ benchmark annotated with $53$ ISP-relevant quality issues and on a small-scale expert-annotated real-world dataset. Experimental results demonstrate the effectiveness of PrISM-IQA for practical issue-level diagnosis, reveal transferable perceptual quality representations through linear probing, and further show how its predictions can support actionable and meaningful ISP tuning.

60. 【2606.31613】Robust Autonomous UAV Landing on Maritime Platforms via Multimodal Agentic AI and Active Wave Compensation

链接：https://arxiv.org/abs/2606.31613

作者：Francisco S. Neves,Pedro N. Pereira,Raul D.S.G. Campilho,Andry M. Pinto

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Autonomous aerial inspection, Unmanned Aerial Vehicle, Autonomous aerial, stochastic sea states, Unmanned Surface Vehicle

备注：

点击查看摘要

Abstract:Autonomous aerial inspection of marine infrastructure is frequently compromised by stochastic sea states, introducing risks of high-kinetic impacts, post-landing toppling, and sensory occlusion. This paper proposes a decoupled, multi-vehicle landing framework synchronizing an Unmanned Surface Vehicle (USV) equipped with a 3-RPU stabilized platform with a robust Unmanned Aerial Vehicle (UAV). The architecture utilizes two independent Deep Reinforcement Learning (DRL) agents: a Soft Actor-Critic (SAC) agent providing high-frequency wave-motion compensation for the landing deck, and a multimodal RL agent for the UAVs final approach. Evaluated in high-fidelity maritime simulations, the system achieved a 100% landing success rate across 15 trials in wave states varying from calm to rough. Results show a mean stabilization efficacy of 87.8%, maintaining the landing surface within 1 degree of the horizontal plane for 96% of the mission duration in rough conditions, effectively contributing to safer landings.

61. 【2606.31612】What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States

链接：https://arxiv.org/abs/2606.31612

作者：Chen Liu,Ling Chen,Hanzhang Zhou,Xu Zhang,Quyu Kong,Panrong Tong,Wenhao Wang,Xin Yu,Steven Hoi,Yue Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：GUI agents increasingly, Mobile GUI agents, reusing task-relevant data, agents increasingly face, Mobile GUI

备注：

点击查看摘要

Abstract:Mobile GUI agents increasingly face long-horizon tasks that require reading, updating, and reusing task-relevant data across pages and applications. Existing memory methods treat memory largely as passive storage, where past observations are accumulated and retrieved when needed. Yet retrieving a value does not reveal its current role in the workflow. The agent must still infer from accumulated records whether the value should be used now, has already been used, or must wait for a later dependency. This implicit reconstruction becomes unreliable in long trajectories with similar fields, repeated values, distractors, and outdated states, causing repeated or missed operations. We propose Active Task Driving Memory (ATMem), which shifts GUI-agent memory from passive storage to an actively maintained execution state. ATMem maintains task-relevant information as a continually updated execution state that links each value to its role and current status, enabling action selection based on the current workflow state. We therefore introduce \textbf{STR-GRPO}, an online reinforcement learning method that learns to use ATMem selectively according to its contribution to task completion. STR-GRPO contrasts memory-on and memory-off rollouts to estimate when memory use improves execution, while memory-cost-aware reward discourages costly memory usage that does not improve execution. To evaluate whether agents can complete all in-scope work while avoiding out-of-scope actions over long-horizon execution, we build a challenging mobile benchmark. From a list of near identical entries, agents must act on every entry that satisfies the instruction and reject entries that violate its constraints.

62. 【2606.31609】Learning Structurally Consistent Representations for Multi-View Radar Semantic Segmentation

链接：https://arxiv.org/abs/2606.31609

作者：Ali Zia,Muhammad Umer Ramzan,Abdelwahed Khamis,Usman Ali,Abdul Rehman

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：make dense semantic, sensors provide reliable, semantic segmentation challenging, dense semantic segmentation, semantic measurements make

备注：

点击查看摘要

Abstract:Radar sensors provide reliable perception under adverse weather and lighting conditions, but their sparse, noisy, and weakly semantic measurements make dense semantic segmentation challenging. Most existing radar segmentation methods rely on grid-based encodings and pairwise interactions, which struggle to capture the higher-order relational structure formed by multiple radar returns from the same physical object. We introduce a unified higher-order structural alignment framework for multi-view radar segmentation. The proposed method refines radar feature representations using learnable hypergraphs to capture higher-order dependencies among spatially related responses. To ensure consistency across heterogeneous radar projections, we further align view-specific features using Unbalanced Optimal Transport (UOT), enabling correspondence-free alignment under varying measurement densities and partial observations. An adaptive attention mechanism then fuses complementary radar views while emphasising structurally informative responses under sparsity and noise. The resulting architecture learns structurally consistent representations across Range Angle (RA), Range Doppler (RD), and Angle Doppler (AD) views and is trained using supervised segmentation together with cross-view consistency regularisation. Experiments on the CARRADA and RADIal benchmarks demonstrate consistent improvements over strong radar-specific baselines, achieving 63.8% mIoU on CARRADA and 83.4% mIoU on RADIal, improving the previous best methods by +1.7 and +2.3 mIoU, respectively. These results highlight the importance of higher-order relational modelling for robust radar perception.

63. 【2606.31603】Preserve the Hard, Regenerate the Rest: Uncertainty-Guided Synthetic Training Data Augmentation with Diffusion Models

链接：https://arxiv.org/abs/2606.31603

作者：Nikolai Röhrich,Julian Gleißner,Ahmed H. A. Ibrahim,Silvan Mertes,Tobias Huber

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：autonomous mobility data, visually diverse regions, visually diverse, autonomous mobility, segmentation models struggle

备注： 13 pages, 7 figures

点击查看摘要

Abstract:Semantic segmentation models struggle with data sparsity and rare or visually diverse regions, e.g., dense regions or small objects in aerial or autonomous mobility data. While synthetic augmentation is an appealing solution, directly generating new labeled data risks misalignment of labels and generated pixels. Existing solutions to this problem often rely on external models, or employ coarse heuristics such as indiscriminately augmenting all foreground objects or entire backgrounds, which wastes capacity on uninformative pixels. To address this, we propose an uncertainty-guided synthetic context augmentation strategy that strictly preserves label validity and efficiently maximizes pixel informativeness per synthetic sample - no external guardrails required. Using a baseline segmenter's predictive entropy, we identify uncertain semantic regions and inpaint only the complementary visual context. When fine-tuning the segmenter on this synthetic data, we compute the loss only over the original pixels, excluding inpainted regions. This focuses learning on the unmodified, uncertain regions while presenting them in novel contexts. We demonstrate substantial mIoU gains on Cityscapes, UAVID, and BDD100K with the largest gains on rare and difficult classes such as buses, trains, or (from the aerial perspective) cars. Our results demonstrate that uncertainty-guided context augmentation is a highly effective lever to improve segmentation performance on complex datasets, with code provided at this https URL.

64. 【2606.31599】oken-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

链接：https://arxiv.org/abs/2606.31599

作者：Kaitao Chen,Weiqian Zhao,Jiamin Wu,Qihao Zheng,Shangquan Sun,Chunfeng Song,Xiaosong Wang,Mu Zhou,Mianxin Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：ignite remarkable progress, inform clinical decision-making, typically exhibit extremely, exhibit extremely sparse, extremely sparse visual

备注： ICML2026

点击查看摘要

Abstract:Vision-language models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit extremely sparse visual evidence to inform clinical decision-making. We recognize that pruning visual tokens outside the grounding region greatly enhances medical reasoning. However, a united RL framework for active visual token pruning (VTP) and medical multimodal reasoning remains unestablished. Here, we propose a dual-stream RL framework, ViToS, to fulfill token pruning and question answering. ViToS trains one policy model with two task branches, where one focuses on grounding while the other conducts token-sparse reasoning after VTP. Furthermore, we solve the coupled policy learning problem by introducing the cross-feedback sequential optimization, avoiding gradient conflict and facilitating convergence of the shared policy model. Evaluated on seven medical benchmarks, our method reduces visual tokens to 77% of the original sequence length while achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B. Overall, ViToS delivers superior performance and inference speedup, establishing an efficient paradigm for medical multimodal reasoning.

65. 【2606.31585】DPPE: Rethinking Camera-Based Positional Encoding for Scaling Multi-View Transformers

链接：https://arxiv.org/abs/2606.31585

作者：Shun Kenney,Teppei Suzuki

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：providing spatial cues, positional encoding, camera-aware positional encoding, Transformers has expanded, camera-based positional encoding

备注：

点击查看摘要

Abstract:The remarkable scalability of Transformers has expanded their application to 3D computer vision, where camera-aware positional encoding is crucial for providing spatial cues in multi-view geometry. Recent advancements have established the practice of using camera parameters -- such as extrinsics or projection matrices -- as relative positional encoding into the query, key, and value vectors of the attention mechanism. However, when scaling up the training recipe of novel view synthesis (NVS) models with the camera-based positional encoding, we observe a significant issue: model performance stagnates in the late stages of training. In this paper, we investigate the cause of the performance bottleneck when scaling up and demonstrate that storing rotation and translation given by the positional encoding in the same dimensions of the value vector causes indeterminacy in their independent identification, hindering training scalability. To address this, we propose Decoupled Pose Positional Encoding (DPPE), a novel camera-based positional encoding that explicitly decouples rotation and translation. Extensive evaluations on NVS tasks demonstrate that DPPE enables stable long-term training even in scaled-up training setup. Furthermore, it exhibits superior generalization performance in extrapolation settings, such as handling an increased number of viewpoints and zoom-in scenarios.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.31585 [cs.CV]

(or
arXiv:2606.31585v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.31585

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

66. 【2606.31577】Localized Conformal Prediction for Image Classification with Vision-Language Models

链接：https://arxiv.org/abs/2606.31577

作者：Clément Fuchs,Tim Bary,Benoît Macq

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：attracted significant attention, uncertainty quantification, field of uncertainty, conformal predictions literature, Conformal predictions

备注： 7 pages, 2 figures, 3 tables, code availables, accepted to EUVIP 2025

点击查看摘要

Abstract:Conformal predictions have attracted significant attention in the field of uncertainty quantification, mainly because of their strong marginal coverage guarantees. Full conditional guarantee is not an attainable goal, a well known fact in conformal predictions literature. As a result, several approaches have tried to approximate this behavior by adapting the conformal sets of test-time samples according to their similarity to calibration examples. Although the latter has gained traction and shown impressive performances for regression problems, its application to image classification remains under-explored. We conduct an extensive benchmarking on natural image classification tasks with vision-language models (VLMs), using our open source implementation of a recent localized conformal prediction algorithm. We show that straightforward usage of the cosine similarity between test-time and calibration visual features, an intuitive choice for VLMs, is not sufficient to improve over the non-local baselines. In response, we propose a simple non-linear transformation of the cosine similarities, which conserves marginal coverage guarantees and achieves statistically significant mean set sizes reduction. Code is available at this https URL.

67. 【2606.31574】mperature Field Reconstruction of Tungsten Monoblock Divertor on EAST using Physics-aware Neural Operator Transformer

链接：https://arxiv.org/abs/2606.31574

作者：Zikang Yan,Xiao Wang,Qingquan Yang,Zhendong Yang,Gaoting Chen,Zehua Chen,Bo Jiang,Jin Tang,Guosheng Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：preventing material melting, divertor temperature field, Accurate modeling, Finite Element Method, divertor temperature

备注：

点击查看摘要

Abstract:Accurate modeling of the divertor temperature field is essential for preventing material melting and damage and for extending the service life of fusion devices. However, conventional numerical methods, such as the Finite Element Method (FEM), are computationally expensive and therefore unsuitable for real-time applications. Therefore, a fast and generalizable method is required for real-time reconstruction of the divertor temperature field and subsequent real-time control. To address the above issue, we propose a Physics-aware Neural Operator Transformer (PNOT) to characterize the spatiotemporal evolution of the divertor temperature field. It models boundary heat-flux relations as a structured graph and employs graph attention to explicitly capture spatial physical dependencies. Inspired by physics-aware attention, we further develop a physics-aware neural operator module to aggregate query points with similar physical conditions via slicing and model heat diffusion, while a gradient-constrained Sobolev regularization loss enforces consistency between function values and their derivatives. Experimental results show that these physical constraints improve prediction accuracy while preserving physical consistency. The source code of this paper will be released on this https URL

68. 【2606.31570】Mitigating Positional Leakage in 3D Masked Autoencoders for Robust Representation Learning

链接：https://arxiv.org/abs/2606.31570

作者：Xu Yan,Huiqun Wang,Chen Wang,Lei Ren,Di Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Masked autoencoding, prominent paradigm, paradigm for self-supervised, masked autoencoding directly, achieving competitive performance

备注：

点击查看摘要

Abstract:Masked autoencoding has emerged as a prominent paradigm for self-supervised learning on 3D point clouds, achieving competitive performance across downstream tasks. Unlike its 2D counterpart, 3D masked autoencoding directly reconstructs spatial coordinates, making it inherently susceptible to positional leakage. In this work, we identify that the decoder in existing 3D MAE frameworks tends to over-rely on positional information, which weakens semantic representation learning and leads to suboptimal feature quality. To address this issue, we propose MPL-MAE, a masked point learning framework that mitigates positional over-reliance while enhancing the utilization of encoder features. Specifically, we introduce a recalibrated positional embedding module that suppresses metric-dominant coordinate signals while preserving geometric topology, together with a gated positional interface module that dynamically regulates positional injection during reconstruction. These designs promote a more balanced interaction between spatial priors and semantic features, yielding robust and informative representations. Extensive experiments across downstream tasks demonstrate that MPL-MAE consistently achieves competitive performance, validating its effectiveness. Code is available at this https URL.

69. 【2606.31556】AugSplat: Radiance Field-Informed Gaussian Splatting for Sparse-View Settings

链接：https://arxiv.org/abs/2606.31556

作者：Lorenzo Lazzaroni,Riccardo Bollati,Daniel Barath,Michael Niemeyer,Keisuke Tateno

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating high-quality, frame rates remains, Gaussian Splatting, real-time frame rates, frame rates

备注： 9 pages, 5 figures

点击查看摘要

Abstract:Generating high-quality novel views at real-time frame rates remains a central challenge in 3D vision, particularly in sparse-view scenarios. Neural radiance fields have demonstrated robust reconstruction from limited observations, but their reliance on volumetric rendering leads to high computational cost and slow inference. In contrast, Gaussian Splatting methods achieve real-time rendering through rasterization, but their optimization is highly sensitive to the quality of the initial geometry. This sensitivity becomes especially problematic in sparse-view settings, where limited observations often lead to incomplete or noisy point-cloud reconstructions. In this work, we present AugSplat, a simple framework for improving Gaussian Splatting in sparse-view regimes using radiance-field-based view augmentation. We first train a radiance field on the sparse input views and use it to synthesize additional images from nearby novel viewpoints, increasing the effective view-space coverage available for supervision. These synthetic views are then used as auxiliary supervision during Gaussian Splatting optimization. We study two variants: Staged AugSplat, which uses synthetic views for an initial optimization phase before switching to real images, and Dual AugSplat, which jointly trains on real and synthetic views with a decaying synthetic loss weight. Experiments on sparse-view mip-NeRF 360 scenes show that AugSplat improves reconstruction quality over standard Gaussian Splatting. Staged AugSplat achieves the strongest average performance, while Dual AugSplat provides a closely performing formulation that keeps real-image supervision active throughout training, and both variants preserve real-time rendering at inference.

70. 【2606.31537】DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

链接：https://arxiv.org/abs/2606.31537

作者：Siyu Yan,Yizhen Gao,Yilin Wang,Dongxing Mao,Alex Jinpeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词：visually realistic images, semantically aligned, simultaneously produce visually, produce visually realistic, render legible

备注：

点击查看摘要

Abstract:Text-rich image generation is one of the most challenging settings in image generation, since models must simultaneously produce visually realistic images and render legible, semantically aligned, and layout-consistent text. Existing data pipelines usually follow a static crawl-filter-freeze paradigm. They collect candidate samples, filter them once, and freeze the accepted data for training. However, rejected samples are usually discarded, although they often contain useful failure signals such as OCR errors and semantic mismatches. As a result, later construction rounds may repeat the same failure modes. To address these limitations, we propose DataEvolver, a self-evolving multi-agent framework for text-rich image data construction. DataEvolver treats data construction as feedback-driven construction policy evolution. A Retriever collects candidate samples, a Verifier assigns quality scores and rejection causes, a Critic summarizes round-level feedback into semantic feedback, and a Generator completes under-covered regions through targeted synthesis. The updated feedback memory then guides the next construction round. Experiments on text-rich image generation benchmarks show that DataEvolver produces more useful training data than fixed-dataset baselines under matched data budgets. At the 0.75M scale on PixArt-alpha, DataEvolver improves OCR-F1 over the strongest baseline by 85.3 percent on TextScenesHQ and 35.3 percent on LongTextBench. The improvements are consistent across both evaluated benchmarks and also transfer to Show-o2, indicating that the benefit of DataEvolver is not tied to a single downstream generator. These results suggest that rejected samples can provide actionable feedback for improving text-rich image data construction.

71. 【2606.31533】MV-GEL: Language-Driven Multi-View Geometric Entity Localization on Meshes

链接：https://arxiv.org/abs/2606.31533

作者：Kartik Bali,Roland Aydin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Identifying and grounding, grounding precise geometric, Vision Language Models, planar regions, robotic manipulation

备注：

点击查看摘要

Abstract:Identifying and grounding precise geometric entities, such as edges, planar regions, and curved surfaces within 3D objects, is foundational to computer-aided design (CAD), robotic manipulation, and scientific simulation. Although modern Vision Language Models (VLMs) have advanced referring segmentation (RIS) in the image domain, extending such language-driven localization to structured 3D geometry is substantially harder. The 3D object appearance is highly sensitive to viewpoints; a single perspective may render a target entity clearly observable, while another may suffer from severe occlusion or foreshortening. In this work, we attempt to solve these challenges with MV-GEL (Multi-View Geometric Entity Localization), a framework for localizing fine-grained geometric entities on polygon meshes from natural language queries. Our key insight is that reliable CAD entity (i.e., faces, edges or solids) localization depends on selecting views that make the queried entity maximally interpretable. We introduce GELviews, a prompt-conditioned ranking module that prioritizes viewpoints based on language prompted observability of geometric CAD entities. Selected views are processed by a VLM-based reasoning segmentation backbone, and predicted masks are lifted to the corresponding meshes via geometry-aware ray casting. Our framework is completely CAD agnostic and relies only on 3D meshes. Experiments show up to a 1.7X improvement in face-level IoU and over 4.5X gains in edge-level F1 compared to vanilla baselines, substantially outperforming CLIP-based and random view sampling, particularly for thin and view-sensitive this http URL dataset, code and trained checkpoints are available at this https URL.

72. 【2606.31517】Unsupervised Data-Efficient Cross-Modal Retrieval with Global-Neighborhood Alignment Hashing

链接：https://arxiv.org/abs/2606.31517

作者：Runhao Li,Xiaoxu Ma,Zhenyu Weng,Yue Zhang,Guibo Luo,Huiping Zhuang,Zhiping Lin,Yap-Peng Tan

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Compared to supervised, unsupervised CMH reduces, unlabeled image-text pairs, image-text pairs, unsupervised CMH

备注：

点击查看摘要

73. 【2606.31513】PRISM: Latent Composition Consistency for Single-Image Reflection Removal

链接：https://arxiv.org/abs/2606.31513

作者：Junseong Shin,Tae Hyun Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Single-image reflection removal, severely ill-posed problem, Single-image reflection, severely ill-posed, VAE latent spaces

备注：

点击查看摘要

Abstract:Single-image reflection removal (SIRR) seeks to recover the transmission layer from a mixture corrupted by reflections -- a severely ill-posed problem. Existing methods operate in pixel space, where the nonlinear sRGB formation model entangles the two layers and limits generalization. We observe that pretrained VAE latent spaces exhibit substantially lower coherence between image layers compared to pixel space, providing a more favorable working space for decomposition. Building on this finding, we propose \textbf{PRISM} (Pretrained-latent Reflection Image Separation Model), which reinterprets SIRR as a latent linear separation problem. Under an approximate additive formulation in latent space, PRISM learns a flow matching velocity field on a pretrained FLUX backbone that recovers both transmission and reflection in a single forward pass. To enforce robust disentanglement, we introduce a Latent Composition Consistency (LCC) strategy that constructs synthetic mixtures by swapping reflection latents across samples and enforces consistent decomposition via a cycle loss. We further propose a Layer Contrastive Separation (LCS) loss that promotes semantic separation between layers through patch-level contrastive learning, without requiring explicit reflection targets. Experiments on six benchmarks demonstrate that PRISM consistently outperforms state-of-the-art methods by significant margins, with strong generalization to in-the-wild images.

74. 【2606.31504】SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search

链接：https://arxiv.org/abs/2606.31504

作者：Ming Dai,Zhihong Lu,Jinjie Gu,Jiedong Zhuang,Yefeng Liu,Wankou Yang,Jian Wang,Chunhua Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal agentic search, Factorized Adaptive Rollout, practical framework, framework for multimodal, Factorized Adaptive

备注： Technical Report

点击查看摘要

Abstract:We present SimpleSearch-VL, an efficient, reliable, and practical framework for multimodal agentic search. Its core idea is to improve the agent's own search-and-verification process rather than scaling data, tools, or auxiliary model components. For efficiency, Factorized Adaptive Rollout (FAR) improves sampling efficiency by forming more informative training groups while using redundant samples to mitigate long-tail latency and expose hard samples. For reliability, SimpleSearch-VL performs evidence-verified reasoning, explicitly using chain-of-thought verification to assess the relevance of retrieved visual and textual cues to the original context. For practicality, SimpleSearch-VL keeps a lightweight tool interface and performs webpage self-summary within the agent, requiring no additional external dependencies. With only 5K supervised tool-interleaved trajectories and 2K RL data, SimpleSearch-VL improves Qwen3-VL agentic baselines by 15.8 and 16.0 average points for the 8B and 30B-A3B variants, respectively. The SimpleSearch-VL-30B-A3B model further achieves performance competitive with agentic Gemini-3-Pro.

75. 【2606.31502】Fully Automated High-Precision Segmentation of Retinal Atrophy and Ellipsoid Zone Thickness in OCT: A Reliable Tool for Real-World GA Monitoring

链接：https://arxiv.org/abs/2606.31502

作者：Wolf-Dieter Vogl,Hlynur Skulason,Oliver Leingang,Ursula Schmidt-Erfurth,Amir Sadeghipour,Ariadne Whitby

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires precise monitoring, relevant structural biomarkers, Geographic atrophy, assess disease stage, Dice RPE loss

备注： 31 pages, 6 tables, 7 figures, contain 3 supplemental figures and 2 supplemental tables

点击查看摘要

Abstract:Geographic atrophy (GA) secondary to age-related macular degeneration (AMD) requires precise monitoring of relevant structural biomarkers to assess disease stage, progression, and treatment response. This paper presents a fully automated, deep learning-based framework for the high-precision, pixel-wise segmentation of key biomarkers in optical coherence tomography (OCT) imaging: retinal pigment epithelium (RPE) loss, ellipsoid zone (EZ) loss, and EZ thinning. The proposed pipeline uses three specialized semantic segmentation models to delineate RPE loss, EZ boundaries (including interruptions), and Bruch's membrane. To ensure robustness and generalizability, the models were developed on a diverse dataset of 298 SD-OCT volumes representing the full phenotypic spectrum of AMD (GA:222, intermediate AMD: 40, neovascular AMD: 17, healthy: 19) and validated on an independent external dataset (n=43). The comprehensive evaluation was further strengthened using additional datasets to assess repeatability, inter-reader reliability, the impact of B-scan density on measurement accuracy, and subgroup performance stratified by lesion size. Results demonstrated high segmentation accuracy (Dice RPE loss: 0.88, Dice EZ loss: 0.87, Pearson's r 0.99). Total EZ thickness measurements exhibited a sub-pixel average deviation of 2.15 $\mu m$, and segmentation reliability was confirmed by a strong reproducibility score (ICC 0.98). By accurately and consistently quantifying outer photoreceptor degeneration and RPE loss, this fully automated framework provides a highly reliable tool for GA assessment in both clinical trials and routine real-world ophthalmic care.

76. 【2606.31496】HVPNet: A Bio-Inspired Network for General Salient and Camouflaged Object Detection

链接：https://arxiv.org/abs/2606.31496

作者：Jiawei Xu,Qiangqiang Zhou,Zhouping Li,Yanjiao Shi,Yugen Yi,Jiacong Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cross-modal feature fusion, recent years, typically aims, camouflaged object detection, object detection

备注：

点击查看摘要

Abstract:In recent years, most research on multimodal salient object detection (SOD) and camouflaged object detection (COD) typically aims to improve performance through complex cross-modal feature fusion and decoding structures. However, this approach leads to an excessively large model parameter scale and often fails to deliver satisfactory detection performance due to structural redundancy. In contrast, the human visual process is able to efficiently perform salient and camouflaged object identification without such complex structures. This contrast raises an important question: Can we draw conceptual inspiration from the human visual process to achieve a simpler modeling strategy, and still realize accurate and efficient object detection? To answer this question, we propose HVPNet, a simple yet general bio-inspired computational architecture. Drawing on the multi-layered information integration of the retina as a conceptual metaphor, we designed a Retinal Integration Module (RIM), which effectively integrates multimodal features through a level-specific multi-stage integration strategy. To fully exploit these features, we further design a cortical decoder (CD) that breaks down the decoding process into low- and high-level visual stages, abstracting the hierarchical processing in the human visual cortex. Benefiting from these designs, HVPNet can readily extend to seven tasks across four modalities. Without bells and whistles, it establishes an excellent accuracy-efficiency trade-off across 22 datasets spanning these seven tasks. Our code is available at this https URL.

77. 【2606.31488】DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation

链接：https://arxiv.org/abs/2606.31488

作者：Chi Huang,Wenhao Zhang,Hang Yin,YuAn Wang,Hao Li,Bosheng Wang,Xun Sun,Liang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving faces, models deliver pixel-aligned, deliver pixel-aligned dense, geometry-scale conflict, pixel-aligned dense visual

备注：

点击查看摘要

Abstract:Dense depth estimation for autonomous driving faces a geometry-scale conflict: depth foundation models deliver pixel-aligned dense visual geometry without reliable metric scale, while projected LiDAR provides metric anchors that are sparse, noisy, and misaligned with image structures. Existing sparse-prompted methods incorporate LiDAR by regenerating depth from scratch, overriding the foundation model's coherent geometry and producing structural artifacts on visually continuous surfaces. Our key insight is that foundation models already capture geometrically coherent relative depth; no additional surface structure learning is required-only a per-pixel scale factor mapping relative geometry to metric coordinates. Based on this, we propose DrivingDepth, which treats sparse LiDAR as geometric prompts that locally calibrate a frozen foundation prior through residual pixel-wise scale correction, preserving dense visual geometry by construction. On nuScenes with 4-frame surround-view input, DrivingDepth achieves an AbsRel of 11.19 and an EdgeCR of 5.741, outperforming MapAnything (11.99/1.914) by simultaneously delivering SOTA metric accuracy and geometric consistency.

78. 【2606.31478】One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution

链接：https://arxiv.org/abs/2606.31478

作者：Jie Ma,Binfei Chu,Jie Gao,Jinlu Zhang,Yiwei Ma,Yi Tan,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：run experiments, experiments fail, draft hypotheses, Autonomous research agents, brittle when experiments

备注：

点击查看摘要

Abstract:Autonomous research agents can now draft hypotheses, write code, run experiments, and produce papers, but they remain brittle when experiments fail. Under the prevailing paradigm, failure recovery is usually delegated to a single free-form reflection: a rich trajectory of metrics, logs, and design choices is compressed into one verbal critique, which often leads either to localized trial-and-error or to hard pivots that discard useful context. We propose SAGE, a Self-correcting, Autonomous, Grounded Experimenter, to tackle this failure-recovery bottleneck. Its core mechanism, Multi-Hypothesis Failure Attribution (MHFA), treats recovery as a structured causal diagnosis. By analyzing dynamic trajectory features, MHFA systematically generates multiple evidence-grounded explanations for a failure, independently evaluates their severity, and deterministically routes the verified root cause to the correct intervention level (hypothesis, experimental design, or implementation). To guarantee scientific honesty, SAGE further employs a grounded reporting mechanism that explicitly constrains drafted results to actual measured values, redacting hallucinated numbers. On a 12-topic, 5-domain benchmark, SAGE increases metrics-bearing outputs from 42% to 92% over a reflection baseline, improves artifact quality from 5.00 to 6.75/10, and blindly outscores AI-Scientist-v2 (52.0 vs. 48.2), with gains concentrated in code development and execution. While fully autonomous scientific writing and generating conference-ready papers remain notoriously difficult open problems for the entire field, SAGE successfully produces significantly more reliable and higher-quality scientific artifacts. Ultimately, by coupling structured recovery with explicit grounding constraints, SAGE significantly outperforms monolithic reflection paradigms, establishing a highly trustworthy foundation for future autonomous research.

79. 【2606.31471】hink While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs

链接：https://arxiv.org/abs/2606.31471

作者：Deniz Bickici,Michael Pabst,Shohei Mori,Dieter Schmalstieg

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：methods typically operate, vision-language models, typically operate, graph methods typically, graph

备注： Accepted to ECCV 2026. Project page: [this https URL](https://denizbickici.github.io/thinkgraphs/)

点击查看摘要

Abstract:Open-vocabulary 3D scene graph methods typically operate in two stages: first reconstruct, then enrich with vision-language models, leaving the graph unqueryable during exploration. We argue that this sequential coupling is unnecessary and propose an asynchronous architecture in which lightweight online mapping runs concurrently with heavyweight semantic refinement. A probabilistic voxel-based backbone maintains stable object identities incrementally, while background VLM agents progressively enrich the graph. This framework resolves duplicate object tracks through semantic loop closure, attaches fine-grained visual attributes and derives spatial relations between objects. A multi-target frame scheduler amortizes VLM cost by selecting a small set of informative frames that jointly cover multiple targets. The resulting scene graph is queryable during exploration and grows in semantic richness over time. Our method matches or outperforms existing open-vocabulary 3D scene graph methods on semantic segmentation (ScanNet, Replica) and surpasses the prior state-of-the-art across three visual grounding benchmarks (Sr3D+, Nr3D, ScanRefer) by 15.3 to 18.8 A@0.25. Project page: this https URL

80. 【2606.31467】AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience

链接：https://arxiv.org/abs/2606.31467

作者：Wenyi Zhang,Fanglong Yao,Youzhi Liu,Peng Hu,Zhengqiu Zhu,Chen Gao,Xian Sun,Kun Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unmanned Aerial Vehicles, enabling Unmanned Aerial, Aerial Vehicles, Unmanned Aerial, aerospace embodied intelligence

备注： 21 pages, 10 figures and 8 tables

点击查看摘要

Abstract:With the rapid advancement of aerospace embodied intelligence, enabling Unmanned Aerial Vehicles (UAVs) to autonomously understand and reason about complex environments has become increasingly important. However, existing UAV-based spatial reasoning approaches face critical limitations: single-view perception renders them vulnerable to occlusions and perspective distortions, while most VLMs lack explicit geometric modeling, relying on semantic cues and yielding inconsistent reasoning under viewpoint and scale variations. To address these challenges, we propose SatAgent, a UAV-Satellite collaborative spatial reasoning model inspired by the dual-pathway mechanism of the human visual system. By jointly leveraging satellite and UAV perspectives, SatAgent enables robust, accurate reasoning in complex urban environments. We first introduce a Geometric-Aware 3D Reconstruction Encoder that elevates 2D UAV features into explicit 3D spatial representations. Next, we design a multi-view topology-semantic alignment module integrating cross-view features within a unified BEV coordinate system. We further introduce a multi-view consistency loss encouraging viewpoint-invariant representations. Finally, we construct SatAgent-SR130K, the first large-scale UAV-Satellite collaborative multi-view spatial reasoning dataset. Experiments show SatAgent outperforms state-of-the-art general-purpose foundation models and specialized spatial reasoning models by 25.91\% and 11.69\%, respectively, across diverse tasks, achieving particularly high accuracy in complex geometric relationship reasoning.

81. 【2606.31454】owards a foundational model for recognising diastematic Gregorian notation

链接：https://arxiv.org/abs/2606.31454

作者：Daniel Kurek,Jan Hajič jr

类目：Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词：Optical recognition, Gregorian notation, recently been attempted, Optical, datasets introduced

备注：

点击查看摘要

Abstract:Optical recognition of Gregorian notation has recently been attempted with end-to-end methods, with four datasets introduced. However, each of these datasets is in a different encoding. We design a common encoding based on the S-GABC proposal, convert all four datasets to this common encoding, and train a shared end-to-end foundational model for diastematic Gregorian notation that establishes a new state of the art across all four datasets.

82. 【2606.31446】Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

链接：https://arxiv.org/abs/2606.31446

作者：Stefan Larson,Attila Nagy,Sam Desai,Cyrus Desai,Nicole C. Lima,Yixin Yuan,Siddharth Betala,Kaushal K. Prajapati,Jamiu T. Suleiman,Sharad Duwal,Kevin Leach

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：benchmarking document classifiers, RVL-CDIP, test-train overlap, popular dataset, document classifiers

备注： DocEng 2026

点击查看摘要

83. 【2606.31444】mporal Training Strategies for Left Atrium and Left Atrial Appendage Segmentation in Dynamic Contrast 4DCT

链接：https://arxiv.org/abs/2606.31444

作者：David Montalvo-García,Lauren Severance,Elliot R. McVeigh,María J. Ledesma-Carbayo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：left atrial appendage, assessing blood stasis, enables time-resolved analysis, Dynamic contrast-enhanced cardiac, left atrial

备注： Accepted at CinC 2026

点击查看摘要

Abstract:Dynamic contrast-enhanced cardiac CT enables time-resolved analysis of contrast filling and washout in the left atrium (LA) and left atrial appendage (LAA), with potential applications for assessing blood stasis in atrial fibrillation (AF). Accurate segmentation across all frames is required for such analysis but is challenging due to large temporal contrast variations and the use of a single annotation per registered sequence. This creates a trade-off between training for robustness and limiting label noise. In this study, we investigate how temporal training-set design affects nnUNet-based segmentation of the LA and LAA in dynamic 4DCT. We compare training using a minimal two-frame dataset reflecting standard clinical practice, a physiologically selected subset of frames, and the full 27-frame sequence. We further evaluate the impact of foreground-based normalization. Training with all frames yielded the best performance in early low-contrast phases. However, the physiologically selected subset achieved comparable performance from the filling phase onward. Applying normalization parameters derived from the full dataset improved performance of reduced datasets in low-contrast frames, but did not fully close the gap. These findings highlight the importance of temporal diversity in training data for robust segmentation in dynamic CT, while indicating that carefully selected frame subsets may provide an effective trade-off between performance and efficiency for downstream applications.

84. 【2606.31427】No Prompt, No Leaks: A Robust Generative Steganography Framework via Prompt-Free Diffusion

链接：https://arxiv.org/abs/2606.31427

作者：Jingwen Cai,Fen Xiao,Shuhua Deng,Xieping Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词：Generative image steganography, Generative image, image steganography synthesizes, steganography synthesizes stego, image steganography framework

备注：

点击查看摘要

Abstract:Generative image steganography synthesizes stego images directly from secret information to achieve inherent security advantages. Latent Diffusion Models (LDMs) have recently emerged as a fundamental image steganography framework that modulates secret latent representations with text prompts. Limited by the inflexibility of text prompts, these methods still struggle to generate high-quality stego images and accurately recover secret images. In this work, we propose a prompt-free diffusion image steganography framework that integrates style semantic priors to control more robust and reliable stego image generation. Specifically, a Cascaded Affine Coupling Module (CACM) establishes a bijective, deterministic mapping between a secret image and its latent representation. Then, style semantics are integrated into the diffusion process to control latent representation and ensure visual imperceptibility in the generated stego images. To mitigate trajectory deviations stemming from the unconditioned reverse process, a predictor-corrector mechanism is introduced to iteratively refine the generation trajectory via feedback from the current and predicted next states. Extensive experimental results show that the proposed method achieves competitive performance compared to state-of-the-art methods in terms of security, secret image reconstruction accuracy and controllability.

85. 【2606.31421】mporal Preservation over Processing: Diagnosing and Designing Spatiotemporal Single-Stage Video Detectors

链接：https://arxiv.org/abs/2606.31421

作者：Karam Tomotaki-Dawoud,Anna Hilsmann,Peter Eisert,Sebastian Bosse

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Single-stage video object, single informative frame-a, informative frame-a gap, frame-a gap hidden, Single-stage video

备注：

点击查看摘要

Abstract:Single-stage video object detectors are increasingly deployed in time-critical applications, yet it remains unclear whether these models genuinely reason over temporal context or merely exploit a single informative frame-a gap hidden by standard metrics, which reward correct predictions regardless of how they are reached. We address this from two complementary directions: first, we propose TemporalLens, a model-agnostic diagnostic framework probing temporal dependence through controlled perturbations, structured occlusions, temporal shuffling, redundancy injection, and resolution degradation, revealing whether a detector actually uses information across time. Applied to stacked-frame 2D detectors and our YOLO-3D architecture, it exposes behavioural differences invisible to mAP: stacked 2D models collapse when the target frame is removed, while spatiotemporal models recover predictions from earlier frames, a signature of real temporal reliance. Second, we detail YOLO-3D, a modular real-time spatiotemporal detector built on YOLOv8, and show that simply preserving temporal depth through the backbone is the dominant performance driver (+3.7 pp mAP@50 at 32 frames averaged across scales). Together, the diagnostics and architecture turn "does this detector reason over time?" into a measurable, actionable question.

86. 【2606.31407】Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

链接：https://arxiv.org/abs/2606.31407

作者：Ta Duc Huy,Trang Nguyen,Townim Chowdhury,Ankit Yadav,Minh-Son To,Zhibin Liao,Johan W. Verjans,Vu Minh Hieu Phan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：visually ambiguous inputs, produce confident answers, biased predictions, produce confident, visually ambiguous

备注： Accepted at ECCV2026

点击查看摘要

87. 【2606.31394】Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

链接：https://arxiv.org/abs/2606.31394

作者：Jisung Park,Seohyeon Kang,Daeun Yoo,Eunsu Lee,Seoin Cho,Wooyeop Choi,Ian Choi,James R. Evan,Daesoo Kim,Sonia Gandhi,Minee L. Choi

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词：solve biological challenges, Artificial intelligence, intelligence is transforming, transforming our capability, capability to solve

备注： 10 pages, 7 figures (plus 14 in appendix), 1 table, NeurIPS 2026 preprint

点击查看摘要

Abstract:Artificial intelligence is transforming our capability to solve biological challenges. In dimensionality bottleneck regimes exacerbated by high-dimensional biological data, Neural networks force distinct concepts into the lower dimensions known as superposition. Although this superposition is widely known to hinder interpretability, its impact on corrupting the geometry of latent spaces remains critically overlooked. Here, we utilized sparse autoencoders (SAEs) trained on over 100,000 multiplexed images of patient-derived Parkinson's disease and healthy neurons to resolve superposition. This approach bypasses the mathematical non-uniqueness of feature attribution by shifting to interpretable latent representation analysis. We theoretically and empirically demonstrate that superposition contaminates representational metric spaces, and thereby SAEs successfully recover geometric fidelity. By treating these geometrically purified representations as single-cell state vectors, we adapted single-cell RNA sequencing (scRNA-seq) data analysis methodologies directly to the image domain. Finally, we introduce GW-map, utilizing Gromov-Wasserstein optimal transport to align these image representations with authentic scRNA-seq data \emph{de novo}. This coupling reconstructs hierarchical neuronal pathology pathways such as Calcium-AIS scaffold, without reference spatial transcriptomics, establishing a scalable foundation for spatial biology. Code is available at this https URL

88. 【2606.31388】One Video, One World: Turning Monocular Video into Physical 4D Scenes

链接：https://arxiv.org/abs/2606.31388

作者：Junhao Chen,Boran Zhang,Mingjin Chen,Henghaofan Zhang,Saining Zhang,Congcong Zhu,Hao Zhao,Ruqi Huang,Zhihao Li,Yufei Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：system that reconstructs, single monocular video, training-free system, downstream physics simulation, watertight mesh scenes

备注： Accepted by ECCV 2026. Project Page: [this https URL](https://OneVideoOneWorld.github.io/)

点击查看摘要

Abstract:We introduce \textbf{OVOW}, the first training-free system that reconstructs \emph{instance-level, simulation-ready} 4D mesh scenes from a single monocular video. Recent 4D reconstruction achieves impressive rendering quality, but its outputs (\eg, implicit fields, Gaussian primitives, or point clouds) lack the watertight topology, instance separation, and standardized physical interfaces required by physics simulators and embodied AI. OVOW closes this gap with a four-stage pipeline: a vision-language model discovers, labels, and motion-classifies all instances; category-aware reconstruction yields per-instance meshes for rigid objects and topology-consistent mesh sequences for deformable ones; an iterative render-match-optimize procedure recovers metric scale and 6-DoF pose trajectories; and physics-grounded assembly enforces ground contact and inter-object support. Crucially, we model all motion, rigid and non-rigid, through direct vertex deformation without category-specific priors or skeleton rigging, producing watertight mesh scenes ready for downstream physics simulation and editing. We further establish the first benchmark for \emph{structured Video-to-4D} evaluation, with metrics for geometric correctness, instance separation, and physical plausibility beyond visual fidelity; the same pipeline doubles as a scalable engine for \emph{synthesizing} paired video-to-4D simulation data for future 4D world models and embodied AI. Across two synthetic benchmarks (static and 4D), OVOW attains the best overall layout and geometry accuracy and the lowest photometric and semantic error among all baselines, and on monocular video runs one to two orders of magnitude faster than the baselines, while downstream physics simulation confirms its physical stability.

89. 【2606.31383】MS-Resampler: Multi-Scope Visual Resampling for Efficient Multimodal LLMs

链接：https://arxiv.org/abs/2606.31383

作者：Zhongyang Li,Yaqian Li,Faming Fang,Rinyoichi Takezoe,Zi-Hao Bo,Cheng Qian,Mo Guang,Guixu Zhang,Kaiwen Long

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：typically employ resampling-based, employ resampling-based projectors, compact token sequence, large language models, typically employ

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) typically employ resampling-based projectors to transform dense visual features into a compact token sequence for language modeling. Most existing resamplers adopt a single, fixed aggregation scope via global cross-attention, which can blur fine-grained local evidence and limit the ability to capture both local details and global context within a fixed token budget. In this work, we propose MS-Resampler, a multi-scope visual resampling framework for MLLMs. MS-Resampler instantiates multiple scope-specific resamplers by injecting explicit spatial scope priors into the resampling attention, enabling each branch to aggregate visual information at a particular granularity from local to global. The outputs of these scope-specific resamplers are then adaptively fused to produce the final visual representations for language modeling. Extensive experiments on ten public multimodal benchmarks show that MS-Resampler consistently improves visual understanding and multimodal reasoning over conventional single-scope resamplers, while introducing only minimal computational overhead.

90. 【2606.31378】MAPE: Defending Against Transferable Adversarial Attacks Using Multi-Source Adversarial Perturbations Elimination

链接：https://arxiv.org/abs/2606.31378

作者：Xinlei Liu,Jichao Xie,Tao Hu,Peng Yi,Yuxiang Hu,Shumin Huo,Zhen Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image classification tasks, Neural networks, meticulously crafted adversarial, adversarial, adversarial perturbations

备注： 18 pages

点击查看摘要

Abstract:Neural networks are vulnerable to meticulously crafted adversarial examples, leading to high-confidence misclassifications in image classification tasks. Due to their consistency with regular input patterns and the absence of reliance on the target model and its output information, transferable adversarial attacks exhibit a notably high stealthiness and detection difficulty, making them a significant focus of defense. In this work, we propose a deep learning defense known as multi-source adversarial perturbations elimination (MAPE) to counter diverse transferable attacks. MAPE comprises the single-source adversarial perturbation elimination (SAPE) mechanism and the pre-trained models probabilistic scheduling algorithm (PPSA). SAPE utilizes a thoughtfully designed channel-attention U-Net as the defense model and employs adversarial examples generated by a pre-trained model (e.g., ResNet) for its training, thereby enabling the elimination of known adversarial perturbations. PPSA introduces model difference quantification and negative momentum to strategically schedule multiple pre-trained models, thereby maximizing the differences among adversarial examples during the defense model's training and enhancing its robustness in eliminating adversarial perturbations. MAPE effectively eliminates adversarial perturbations in various adversarial examples, providing a robust defense against attacks from different substitute models. In a black-box attack scenario utilizing ResNet-34 as the target model, our approach achieves average defense rates of over 95.1\% on CIFAR-10 and over 71.5\% on Mini-ImageNet, demonstrating state-of-the-art performance.

91. 【2606.31373】Domain Adaptive Object Detection via Dual-Stream Bilevel-Cycle Optimization

链接：https://arxiv.org/abs/2606.31373

作者：Yannan Chen,Wenqiang Wang,Ruoyu Chen,Jiancheng Wang,Mingbo Yang,Yaowei Wang,Wei Wang,Xiaochun Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Cycle self-training, shared classifier assumption, unsupervised domain adaptation, unlabeled target data, exploits unlabeled target

备注：

点击查看摘要

Abstract:Cycle self-training (CST) breaks the shared classifier assumption of the standard self-training framework, which is effective for unsupervised domain adaptation and exploits unlabeled target data by training with target pseudo-labels. CST introduces a target classifier and employs an inner-outer loop updating strategy, addressing the issue of unreliable pseudo-labels and enabling pseudo-labels to generalize across domains. Despite its success in image classification, extending CST to object detection faces three main challenges. First, the upper bound of CST in object detection is constrained by three types of unreliable pseudo-labels, such as classification error alone, localization error alone, and their combination. Second, since object detection involves detecting multiple target objects, directly applying CST leads to training insta bility. Third, a wider numerical range of regression coordinates leads to exploding losses. To this end, we apply CST to both classification and regression and propose the Dual-Stream Bilevel-Cycle Optimization framework. Specifically, we construct CST upon Mean Teacher to prevent training instability and use extra normalization to map the regression bounding box into a standardized space, effectively addressing exploding losses. Also, we provide a theoretical derivation of the regression bound. Extensive experiments across four cross domain standard scenarios demonstrate that our framework achieves considerable results.

92. 【2606.31367】Evidence Triangulation for Multimodal Fact-Checking in the Wild

链接：https://arxiv.org/abs/2606.31367

作者：Stefanos-Iordanis Papadopoulos,Zacharias Chrysidis,Christos Koutlis,Symeon Papadopoulos,Panagiotis C. Petrantonakis

类目：Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)

关键词：reinforce false claims, fueled multimodal misinformation, false claims, proliferation of multimedia, multimedia content

备注：

点击查看摘要

Abstract:The proliferation of multimedia content on social platforms has fueled multimodal misinformation, where images are used to reinforce false claims. Consequently, Multimodal Fact-Checking (MFC) has emerged as an increasingly important research area. However, current progress is hindered by a reliance on synthetic training data and curated benchmarks that fail to capture the complexity of in-the-wild data. Furthermore, existing detection models rely on restricted intra-modality consistency or unconstrained all-to-all fusion, failing to capture nuanced relations between posts and external evidence. To address these limitations, we introduce X-POSE, a benchmark of real-world, community-annotated multimodal posts from X (formerly Twitter), augmented with full-length news articles retrieved via VLM-optimized search. Additionally, we propose TRENT, a novel MFC model that performs evidence triangulation using three parallel cross-attention streams alongside a relational fusion mechanism that explicitly models entailment and contradiction. Extensive evaluations demonstrate that TRENT consistently outperforms state-of-the-art specialized models and commercial VLMs. The code, prompt templates, and dataset are available at this https URL

93. 【2606.31363】Language-Assisted Super-Resolution from Real-World Low-Resolution Patches

链接：https://arxiv.org/abs/2606.31363

作者：Joonkyu Park,Kyoung Mu Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image super-resolution aims, reconstruct high-resolution, super-resolution aims, aims to reconstruct, Single image super-resolution

备注： 19 pages

点击查看摘要

Abstract:Single image super-resolution aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Training SR models typically requires paired HR-LR data, which is difficult to obtain in reality. As a result, most methods synthesize LR images by artificially degrading HR images with handcrafted kernels or camera ISP adjustments. However, these synthetic degradations fail to capture the complexity of real LR images, leading to poor generalization in practice. To address this, we observe that even within a single high-quality image, regions at different depths exhibit varying resolutions, where distant regions act as LR patches and closer ones as HR patches. This allows the extraction of real, degradation-induced LR patches from real images. Since these LR patches lack paired HR counterparts, we propose LA-SR (Language Assistant for SR), a novel framework for unpaired SR. The key idea of LA-SR is to redefine unpaired SR in the language space, using vision-language models to bridge the LR-HR gap. LA-SR projects images into a semantically rich space representing both content and quality, and applies two language-guided losses: linguistic content loss to preserve semantic fidelity, and linguistic quality loss to enhance perceptual realism. With this alignment, LA-SR effectively super-resolves real LR inputs, producing realistic outputs that overcome the limitations of synthetic-data-trained methods.

94. 【2606.31353】RCL-Mamba: A Dual-domain State Space Model for Measurement-oriented Image Restoration in Rotational Sparse-View Scanning Computed Laminography

链接：https://arxiv.org/abs/2606.31353

作者：Xuyang Duan,Genyuan Zhang,Zhenjiang Dong,Chuandong Tan,Zihao Wang,Junyao Wang,Fenglin Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scanning Computed Laminography, Rotational Scanning Computed, Computed Laminography, large planar components, Non-Destructive Testing

备注：

点击查看摘要

Abstract:Rotational Scanning Computed Laminography (RCL) is widely utilized for the Non-Destructive Testing(NDT) of large planar components. However, to facilitate rapid inspection, continuous sparse-view scanning is often employed, where the angular integration effect during exposure induces rotational blur in the projection domain. Furthermore, the data incompleteness inherent in sparse sampling manifests as sparse artifacts in the reconstructed image domain. To address these cross-domain degradations, this paper proposes RCL-Mamba, a measurement-oriented dual-domain State Space Model (SSM)-based image restoration network. The framework adopts a cascaded joint processing strategy: it first corrects the rotational blur in the projection domain and subsequently suppresses the sparse artifacts in the image domain. Additionally, we design a Mamba-CNN dual-branch module to adaptively balance large-scale blur correction with local detail recovery. Evaluations on both simulated datasets and real-world Printed Circuit Board (PCB) scans demonstrate that RCL-Mamba outperforms existing baselines in blur removal, artifact suppression, and structural preservation. Line-profile-based structural measurement further verifies that the proposed method better preserves via/pad boundaries and slender trace profiles. Crucially, by reducing the required scanning views from 512 to 64, our method enhances inspection efficiency by approximately 8-fold without compromising reconstruction quality, offering a robust measurement-oriented restoration solution for high-throughput RCL inspection with improved structural measurement fidelity.

95. 【2606.31348】Patient-Level Elbow Abnormality Detection: Leakage-Aware Evaluation of Learned Preprocessing, Calibration, and Triage-Oriented Operating Points

链接：https://arxiv.org/abs/2606.31348

作者：Ahmed Sallam,Ahmet Kaplan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：triage-oriented orthopedic abnormality, MURA dataset, orthopedic abnormality detection, abnormality detection task, radiographs from MURA

备注： Conference paper

点击查看摘要

Abstract:In this study, we examine learned preprocessing pipelines in the context of triage-oriented orthopedic abnormality detection task using elbow radiographs from MURA dataset. The evaluation focuses on patient-level detection of musculoskeletal abnormalities under a leakage-aware protocol. We compare multiple preprocessing pipelines, with and without a lightweight DnCNN module as a learned preprocessing component, to assess their impact on discrimination and calibration. Performance is assessed using discrimination metrics (AUROC, PR-AUC), calibration measures (ECE, Brier score), and validation-selected operating point analysis targeting high specificity. Results show that differences across preprocessing strategies are modest and configuration-dependent, with no consistent discrimination advantage over the raw-input DenseNet121 baseline. The raw and diverse inputs combined with the DnCNN front-end showed reduced ECE and Brier score, while CLAHE combined with DnCNN did not improve calibration. Overall, the results suggest that under patient-level evaluation, preprocessing gains are modest and configuration-dependent; the raw-input DenseNet121 baseline remains competitive throughout, and no tested preprocessing strategy produced a consistent discrimination advantage across all metrics.

96. 【2606.31326】Bridging Video Understanding and Generation in a Unified Framework

链接：https://arxiv.org/abs/2606.31326

作者：Yuqi Wang,Runyi Li,Ruoyu Feng,Renjie Chen,Wenfeng Lin,Mingyu Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：extensively explored, Recently, video, video understanding, understanding

备注： technical blog

点击查看摘要

Abstract:Recently, unified image generation and understanding have been extensively explored. However, extending such unified modeling paradigms to the video domain remains largely underexplored. A central challenge is that video understanding favors compact, discriminative semantic representations, whereas video generation requires dense signals that preserve visual details and temporal coherence. Videos naturally capture both spatial semantics and temporal dynamics, making them a more suitable modality for unified multimodal modeling compared to static images. In this paper, we propose Vega, a unified framework that bridges video understanding and generation. Vega leverages a shared vocabulary to jointly model text and visual representations and employs a hybrid architecture combining autoregressive (AR) prediction with diffusion-based rendering. Specifically, the AR model focuses on predicting semantically meaningful visual tokens for keyframes, providing a structured representation that guides the diffusion module in rendering dense, high-resolution video frames. Extensive experiments demonstrate that Vega achieves strong performance on video generation benchmarks such as VBench and video understanding benchmarks like VideoMME.

97. 【2606.31323】Accelerated Likelihood Maximization for Diffusion-based Versatile Content Generation

链接：https://arxiv.org/abs/2606.31323

作者：Hyunsoo Lee,Inwoo Hwang,Young Min Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating diverse, partially given inputs, inputs remains, remains a fundamental, fundamental challenge

备注： ECCV 2026. Project website: [this http URL](http://hleephilip.github.io/ALM)

点击查看摘要

Abstract:Generating diverse, coherent, and plausible content from partially given inputs remains a fundamental challenge for diffusion models. Existing approaches face clear limitations: training-based approaches offer strong task-specific results but require costly computation, and they generalize poorly across tasks. Training-free approaches offer better efficiency, but they do not explicitly optimize over unobserved variables, leading to globally inconsistent results. To address these limitations, we introduce Accelerated Likelihood Maximization (ALM), a novel training-free sampling strategy integrated into the reverse diffusion process that significantly extends the applicability of diffusion models beyond simple generation tasks. Unlike previous methods that implicitly influence missing regions through pre-generated region constraints, we directly optimize the unobserved region during the sampling process, enabling globally coherent and plausible generation. Furthermore, we incorporate an acceleration strategy that significantly improves computational efficiency without sacrificing performance. Experimental results demonstrate that ALM consistently outperforms state-of-the-art methods in various data domains and tasks, establishing a powerful paradigm for versatile content generation.

98. 【2606.31318】Wavelet-Optimized Pseudo-3D Accelerated Diffusion Model for Truncated Computed Laminography

链接：https://arxiv.org/abs/2606.31318

作者：Genyuan Zhang,Junyao Wang,Chuandong Tan,Fenglin Liu,Yongning Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Computed Laminography, large plate-shaped objects, plate-shaped objects, key technology, nondestructive testing

备注： 17 pages, 11 figures, 4 tables. Under review at NDTE International

点击查看摘要

Abstract:Computed Laminography (CL) is a key technology for the nondestructive testing of large plate-shaped objects. However, field-of-view (FOV) limitations inevitably lead to truncation of projected data, an ill-posed inverse problem that causes severe reconstruction artifacts. Existing deep learning methods typically rely on 2D architectures that lack rigorous data consistency constraints. Furthermore, they conventionally confine artifact removal strictly to the FOV, discarding potentially recoverable information outside it. To overcome these limitations, we first introduce a comprehensive CL FOV analysis, categorizing the space into data-complete, data-incomplete, and data-free regions. By extending our reconstruction target to encompass the data-incomplete region, we significantly expand the effective imaging range and enhance scanning efficiency. To achieve this, we propose a novel wavelet-optimized pseudo-3D accelerated diffusion model for CL truncation reconstruction (CL-DM). Our method utilizes a standard 2D diffusion model for slice aggregation, combined with a 3D model-based iterative reconstruction (MBIR) method to ensure strict data consistency. To mitigate inter-slice discontinuities, we introduce wavelet regularization along the z-direction, paired with a translation-invariant (TI) mechanism and a low-frequency preservation strategy. Finally, we introduce a 3D fast sampling architecture, significantly accelerating inference speed. Extensive simulations and real-world experiments demonstrate that CL-DM is superior in effectively eliminating truncation artifacts and restoring high-fidelity, continuous 3D structures.

99. 【2606.31293】Deep Spectral Models for Robust Dental Shape Generation

链接：https://arxiv.org/abs/2606.31293

作者：Tibor Kubík,François Guibault,Michal Španěl,Hervé Lombaert

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer-aided restoration design, orthodontic planning, Accurate modeling, dental crown morphology, fundamental for diagnosis

备注： Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) [this https URL](https://melba-journal.org/2026:016)

点击查看摘要

Abstract:Accurate modeling of dental crown morphology is fundamental for diagnosis, orthodontic planning, and computer-aided restoration design. However, datasets suitable for training such models are typically limited in size. We present ToothForge, a deep spectral generative framework that models dental crown geometries from compact, intrinsic representations. By operating in the spectral domain, ToothForge learns a latent manifold of 3D tooth shapes through synchronized spectral embeddings, ensuring consistent modeling across samples with varying connectivity. Spectral synchronization mitigates the instability of Laplace-Beltrami eigenbases and enables efficient learning in a low-dimensional space. The framework is thoroughly evaluated through robustness analysis, ablation studies, and benchmarking against PCA-based statistical shape models and point-based generative frameworks. Results show that synchronized spectral modeling achieves reconstruction and generative performance comparable to or exceeding spatial approaches, while maintaining compactness and geometric interpretability. Together, the compact synchronized coefficients and low-dimensional learning space make the framework particularly suitable for limited datasets, as often encountered in dental and medical domains, and applicable in real-world scenarios where guaranteeing consistent connectivity across shapes from various clinics is unrealistic.

100. 【2606.31278】Editing Everything Everywhere All at Once

链接：https://arxiv.org/abs/2606.31278

作者：Fabio Quattrini,Carmine Zaccagnino,Enis Simsar,Marta Tintoré Gazulla,Rita Cucchiara,Alessio Tonioni,Silvia Cascianelli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offering improved efficiency, single forward pass, Editing multiple elements, multi-turn image manipulation, Multimodal Diffusion Transformers

备注： Accepted at ECCV 2026

点击查看摘要

Abstract:Editing multiple elements of an image in a single forward pass is a practical alternative to multi-turn image manipulation, offering improved efficiency and potentially better harmonization. However, when several instructions target different regions, semantic interference often leads to attribute leakage and poor edit disentanglement, especially as the number of edits increases. In this work, we propose MICE (Multi-Instance Concurrent Editing), a training-free strategy for scalable multi-instance image editing with Multimodal Diffusion Transformers. MICE modifies the additive bias of joint attention to regulate interactions between instance-specific edit instructions, latent, and context tokens identified via user-provided segmentation masks. Specifically, MICE allows intra-instance attention, penalizes interactions between neighboring region tokens, and suppresses unrelated cross-instance attention. As a result, our method enforces attribute binding while preserving global visual consistency. We evaluate MICE on LoMOE-Bench and introduce MICE-Bench, a more challenging benchmark with an average of 8.5 concurrent edits per image. The experiments demonstrate that our approach outperforms strong baselines and recent competitors in terms of visual quality preservation and faithfulness to the editing instructions.

101. 【2606.31275】CLIMB: Centroid-Based Hierarchical Memory for Online Continual Self-Supervised Learning

链接：https://arxiv.org/abs/2606.31275

作者：Julien Lefebvre,Stefan Duffner,Mathieu Lefort

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Online Continual Self-Supervised, Online Continual, Continual Self-Supervised Learning, aims to learn, unlabeled data

备注： Accepted at CoLLAs 2026 conference

点击查看摘要

Abstract:Online Continual Self-Supervised Learning (OCSSL) aims to learn representations from a continuous stream of unlabeled data, without knowledge of task boundaries and under memory constraints. Existing methods rely either on replay buffers that exploit latent space structure, or on regularization alone. We present CLIMB (Continual Learning with Intelligent Memory Bank), which combines both simultaneously. Our method introduces a hierarchical centroid-based memory, bounded in total number of stored images, combined with knowledge distillation on replayed examples to limit representation drift. The memory groups similar images into centroids, providing hard-to-discriminate examples for contrastive learning while covering the diversity of observed distributions. Experiments on Split CIFAR-100 and Split ImageNet-100, on standard benchmarks from the state-of-the-art as well as a new protocol with irregular task distributions show that CLIMB outperforms state-of-the-art OCSSL methods.

102. 【2606.31270】Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

链接：https://arxiv.org/abs/2606.31270

作者：Xueqiao Sun,Xiaohan Wang,Ludwig Schmidt,Serena Yeung-Levy,Yuhui Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词：leverage multimodal large, multimodal large language, attracted significant attention, large language models, Computer-use agents

备注： Published in ECCV 2026

点击查看摘要

103. 【2606.31258】WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

链接：https://arxiv.org/abs/2606.31258

作者：Michael Green,Gavriel Habib,Dvir Samuel,Tal Berkovitz Shalev,Issar Tzachor,Rami Ben-Ari,Or Litany

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：object, warped rendering, view synthesis, input view, implicit camera cue

备注：

点击查看摘要

Abstract:Projection-conditioned novel view synthesis (NVS) warps an explicit 3D reconstruction of the input view into the target camera and conditions a generator on the warped rendering. This works well for small viewpoint changes but degrades sharply under large orbital motion: the warp becomes sparse around the orbited object, where hidden surfaces dominate the new view and mirror-like artifacts emerge, causing the generator to lose both pixel content and the implicit camera cue carried by the warp. We introduce WarpHammer, a training-free framework that resolves this failure mode by augmenting the warped scene with an explicit 3D reconstruction of the object obtained from a native 3D generative prior (e.g., SAM3D). The reconstructed object adds missing foreground surfaces and occludes background points that should no longer be visible, restoring both appearance and camera cues without fine-tuning the base model. The same explicit object representation further unlocks a capability current NVS pipelines do not support: incorporating auxiliary views of the object from sources outside the target scene, for example, a casual snapshot of a car paired with a manufacturer studio shot of the same model. We process the reference and auxiliary images jointly with a pretrained multi-view geometry foundation model, which predicts a unified point cloud that we fuse into the 3D object reconstruction. This yields substantially more faithful geometry than single-image reconstruction, without requiring user-provided camera poses for the auxiliary views. On five benchmarks, WarpHammer produces stable novel views at viewpoint deviations where strong baselines collapse, and is the first scene-level NVS method that can naturally fuse auxiliary, pose-unknown object views from an external source.

104. 【2606.31257】Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

链接：https://arxiv.org/abs/2606.31257

作者：Chih-Ting Liao,Fei Shen,Xin Cao,Tat-Seng Chua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：linear probe confirmed, read latent knowledge, systematically overstate, latent knowledge, read latent

备注：

点击查看摘要

Abstract:The standard way to read latent knowledge out of a model, a linear probe confirmed by a steering recovery, can systematically overstate what a vision-language model (VLM) actually grounds in the image. We show this on spatial reasoning, where the error is invisible to both probing and steering yet exposed by a one-line causal control: replacing the image with a gray blank. Probes decode the within-axis answer at 73--97% across axes, and a training-free projection lifts a near-chance axis from 59% to 79%, exactly the signature of unlocking latent knowledge. The blank-image arbiter refutes it, revealing three grounding regimes that probing conflates: an axis can be grounded (vision-dependent, correct), a prior (vision-independent, with its decode and its apparent recovery a directional default rather than perception), or, surprisingly, inverted: decodable, causally controllable, but deployed with the wrong sign, so the model scores below chance and the error requires looking. The taxonomy holds across the studied VLMs: in fourteen models spanning six language-model families and 2B--27B, horizontal is grounded, vertical is a prior, and depth is inverted, with the inversion emerging at scale within families. The decode-versus-deploy inversion replicates on seven of eight models across five families, and the minimal edit that re-deploys it varies with geometry: a training-free rotation matches a trained edit on the cleanest model, while distributed inversions need a trained low-rank edit, tracing a per-model correction-complexity spectrum. The cheap, self-calibrating arbiter cleanly separates grounded perception, inverted perception, and prior substitution; we argue it should be a default control for latent-knowledge and steering claims in VLMs.

105. 【2606.31249】Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition

链接：https://arxiv.org/abs/2606.31249

作者：Xiaochuan Guo,Jihao Gu,Haixu Liu,Yuxin Liu,Qi Wang,Yufei Wang,Fei Wang,Kun Li,Dan Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：XInsight Lab, place in Track, achieved first place, test accuracy, X-CLIP video features

备注：

点击查看摘要

Abstract:In this paper, we present the solution developed by our team, XInsight Lab, which achieved first place in Track 3 of the 4th EI-MIGA-IJCAI Challenge with a test accuracy of 0.76923. To address the challenge of weak and sparse implicit emotion evidence in long videos, this paper extends the winning solution from the previous competition and proposes a compact multi-modal temporal modeling framework. The framework integrates and evaluates the effects of multi-source features, including 2D/3D skeletons, facial expression Blendshapes, DINOv2/v3 vision foundation models, X-CLIP video features, and Gemini semantic priors. Architecturally, we propose a cross-attention mechanism that utilizes static pose features, denoted as Base, as the Query and dynamic micro-motion differential features, denoted as Offset, as the Key and Value. By capturing local relative velocities, this mechanism eliminates static biases related to individual body shape and identity. Concurrently, an adaptive pooling method based on Multiple Instance Learning is employed to extract instantaneous emotions while suppressing background noise in long sequences. Finally, the paper reveals the representation collapse phenomenon of general vision foundation models in micro-dynamic tasks, and analyzes the underlying mechanisms where networks fall into public-leaderboard-driven pseudo-generalization due to shortcut learning and rote memorization.

106. 【2606.31245】HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space

链接：https://arxiv.org/abs/2606.31245

作者：Yaojun Hu,Kun Yuan,Nassir Navab,Haochao Ying,Jian Wu,Nicolas Padoy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：adopt educational materials, vision-language foundation models, foundation models typically, models typically adopt, typically adopt educational

备注：

点击查看摘要

Abstract:Surgical vision-language foundation models typically adopt educational materials, such as surgical lecture videos, to transfer surgical knowledge encoded in language into visual representations. These knowledge are multi-dimensional and hierarchical: fine-grained action cues appear in narration, mid-level key steps are summarized in subsection headings, and global procedural context, such as patient history and surgical strategy, is described in abstract texts. Prior work largely collapses these heterogeneous signals into a single flat embedding space, implicitly assuming independence across hierarchy levels. However, this is suboptimal because it ignores cross-level semantic containment, e.g., actions belong to steps, steps compose phases, weakens long-range dependency modeling. To this end, we propose a hyperbolic surgical video-language pre-training framework that explicitly preserves the hierarchical structure by mitigating structural false negatives induced by procedural context and enforcing semantic consistency between parent phases and their constituent child steps. Extensive experiments on multiple surgical benchmarks show consistent gains in zero- and few-shot phase recognition across procedures and institutions.

107. 【2606.31242】UHD-MFF: Shattering Barriers in Multi-Focus Ultra-High-Definition Image Fusion via Learnable Lookup Tables

链接：https://arxiv.org/abs/2606.31242

作者：Yibing Zhang,Xunpeng Yi,Qinglong Yan,Yeda Wang,Han Xu,Jiayi Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Lookup Table, modern visual applications, increasingly essential, essential in modern, Coarse-Region Lookup Table

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:With the advancement of imaging technology, ultra-high-definition images have become increasingly essential in modern visual applications. However, existing multi-focus image fusion remains largely confined to low-resolution images and faces three major barriers in UHD scenarios, namely data availability, model adaptability, and deployment feasibility, which severely hinder its practical application. To shatter these barriers, first, we propose the UHD-MFF dataset, the first large-scale ultra-high-resolution multi-focus fusion dataset. Second, we propose a scale-specialized lookup-table framework tailored for ultra-high-resolution images, termed as UMF-LUT. It consists of Coarse-Region Lookup Table (C-LUT) and Detail-Edge Lookup Table (D-LUT). Specifically, C-LUT performs joint queries of multiple gradient cues and semantic cues at low-resolution scales to enable region-level decision-making. Also, D-LUT operates at high-resolution scales, leveraging efficient Laplacian cues to provide complementary edge-level decision information. Such a design makes the model particularly well-suited for ultra-high-resolution multi-focus image fusion. Finally, it offers strong deployability with minimal computational overhead, enabling real-time 4K multi-focus fusion and showing promising potential for smartphone. Extensive experiments demonstrate that it outperforms SOTA methods in both visual fidelity and quantitative metrics. It effectively advances the development of multi-focus image fusion toward ultra-high-resolution imaging scenarios. The code is available at this https URL.

108. 【2606.31226】ForgeDrive: Bidirectional Cross-Conditioning for Unified Visual-Action Generation in Autonomous Driving

链接：https://arxiv.org/abs/2606.31226

作者：Xuchang Zhong,He Zheng,Chenxu Zhao,Tianxiong Lv,Hangqi Fan,Bohua Wang,Yushan Liu,Zhihao Liao,Leigang Luo,Congyang Zhao,Yang Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：understand scene evolution, autonomous driving endows, scene evolution, understand scene, autonomous driving

备注：

点击查看摘要

Abstract:World-model-based autonomous driving endows the model with the ability to understand scene evolution. Yet this promise is undermined by the prevailing imagine-then-act paradigm, which allows errors from the more challenging visual generation stage to cascade into action planning. We introduce ForgeDrive, a unified autoregressive diffusion framework with visual-action cross-conditioning that closes this gap through act-then-imagine paradigm. ForgeDrive factorizes the future as a sequence of per-timestep frame-action pairs, intertwining each action with its corresponding visual observation. During training, we decouple the diffusion timesteps of the two modalities and introduce a UniDiffuser-style noise scheduler to get the ability to infer either modality from its counterpart and deepen understanding of relationships between images and actions. At inference, we propose a novel act-then-imagine inference paradigm, and find that at each step, action generation is a capability internalized during training, requiring no clean future frame as a prerequisite at inference time; instead, the generated action can improve the accuracy of future frame generation, which in turn enhances the quality of the next action. Additionally, we augment each step with future ego-status prediction, further sharpening planning ability. Extensive experiments on NAVSIM demonstrate that ForgeDrive not only unifies driving simulation, planning, and visual odometry into a single model, but also outperforms existing strong planners without any post-training strategy.

109. 【2606.31219】CooperScene: Multi-Modal Cooperative Autonomy Benchmark with C-V2X Communication Characterization

链接：https://arxiv.org/abs/2606.31219

作者：Bo Wu,Ruoshen Mo,Justin Yue,Yanyu Zhang,Janice Nguyen,Guoyuan Wu,Amit Roy-Chowdhury,Matthew J. Barth,Hang Qiu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enables cooperative perception, individual agents, field of view, view of individual, enables cooperative

备注： Accepted to ECCV 2026. 15 pages, 15 figures

点击查看摘要

Abstract:Cellular vehicle-to-everything (C-V2X) enables cooperative perception, prediction, and planning beyond the field of view of individual agents. However, existing datasets often overlook the complexities of real-world deployment, such as limited communication bandwidth and its dynamics, heterogeneous sensing modalities, and scalability beyond a single cooperative partner. In this paper, we introduce CooperScene, a high-fidelity cooperative autonomy dataset with real-world C-V2X communication characterization. The dataset is organized into diverse scenes, including intersections, highway ramps, and parking lots. These scenes involve three connected and autonomous vehicles (CAVs) and one infrastructure roadside unit (RSU), all equipped with multi-modal sensors and commercial off-the-shelf C-V2X communication radios. All scenes are annotated with globally consistent 3D labels at 10 Hz, totaling 344K objects across 59K frames, underpinned by tight sensor- and agent-synchronization, centimeter-level localization and spatial alignment, precise cross-modality calibration, and 3GPP-standard-compliant C-V2X communication. CooperScene establishes a rigorous benchmark for evaluating multi-agent scaling and actual performance in real-world deployable settings. Project website for data and benchmark: this https URL

110. 【2606.31211】AA: A Multi-view Multimodal Dataset for Screen-based Gaze Estimation

链接：https://arxiv.org/abs/2606.31211

作者：Chang Liu,Jiaqi Liu,Zhoutong Ye,Xinjie Shen,Chun Yu,Yuanchun Shi

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：screen-based gaze estimation, gaze estimation, screen-based gaze, Abstract, gaze

备注：

点击查看摘要

Abstract:We present AA, a multi-view multimodal dataset for screen-based gaze estimation. The dataset captures synchronized facial observations from eight fixed screen-mounted cameras and two additional side-view cameras, paired with precise screen-space gaze targets collected under controlled fixation conditions. Each sample contains multi-view face observations together with structured facial region crops, enabling multimodal learning from both global and local visual cues. Unlike existing single-view gaze datasets, AA provides multi-view coverage from both screen-mounted and side-mounted perspectives, enabling more robust modeling under viewpoint variation and occlusion. The dataset includes subject-independent evaluation splits and a standardized data processing pipeline to support reproducible research in gaze estimation.

111. 【2606.31204】AC3S: Adaptive Conditioning for 3D-Aware Synthetic Data Generation

链接：https://arxiv.org/abs/2606.31204

作者：Eric Ji,Qiran Hu,Wufei Ma,Sarthak Jain,Yingying Li,Minh N. Do,Yaoyao Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：improving data scalability, powerful tool, tool for improving, scalability in computer, Synthetic data generation

备注： Accepted by ECCV 2026. Project page: [this https URL](https://ac3s.cvmlgroup.web.illinois.edu/)

点击查看摘要

Abstract:Synthetic data generation has emerged as a powerful tool for improving data scalability in computer vision. Recent diffusion-based pipelines have demonstrated strong photorealism. However, how to enforce precise 3D structure and pose consistency in generated images remains challenging. Existing methods leverage visual prompts such as edge maps to guide diffusion models, but often suffer from over-conditioning artifacts that degrade image realism and limit dataset quality. In this paper, we present a diffusion-based image generation framework that enforces 3D structural alignment while preserving photorealism through adaptive conditioning. Our framework, Adaptive Conditioning for 3D-Aware Synthetic Data Generation (AC3S), introduces a self-supervised visual prompt modulator that dynamically adjusts the strength of ControlNet conditioning, preventing over-conditioning and enabling the diffusion model to retain its generative expressiveness. To further enhance diversity and semantic consistency, we develop a multi-agent vision language model framework that composes detailed and 3D-aware prompts aligned with the underlying geometric structure. Together, these components enable the scalable generation of high-quality synthetic datasets with accurate 2D and 3D annotations. Extensive experiments demonstrate that our method significantly improves image quality and downstream utility.

112. 【2606.31201】ExPLoRe: Expert Patch-Level Loss Routing for Multi-Objective Masked Image Modeling

链接：https://arxiv.org/abs/2606.31201

作者：Konstantinos Georgiou,Maofeng Tang,Hairong Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-objective masked image, combines complementary learning, ignoring spatial heterogeneity, masked image modeling, complementary learning signals

备注： Accepted to ECCV 2026. Main paper 15 pages, 3 figures; supplementary material included as appendix

点击查看摘要

Abstract:Multi-objective masked image modeling (MIM) combines complementary learning signals (token distillation, CLS alignment, and pixel reconstruction) but existing methods weight these objectives with global scalars, ignoring spatial heterogeneity across patches. We present ExPLoRe (Expert Patch-Level Loss Routing), which repurposes Soft Mixture of Experts (MoE) dispatch weights as learned, per-patch loss coefficients. The key mechanism is loss-coupling: allowing loss gradients to flow through dispatch weights to the router enables content-dependent specialization, where different patches receive different emphases across objectives. A detach ablation confirms loss-coupling as the core mechanism, degrading performance by 1.6% when gradients are blocked. On ImageNet-1K with ViT-Base, ExPLoRe improves over non-MoE baselines on two objective combinations (Token+CLS: +0.5% k-NN, +4.4% linear probe; Token+Pixel: +2.2% k-NN), achieving 80.6% linear probe and 85.3% finetuning accuracy, competitive with published methods. For downstream transfer, we develop adaptation recipes (Freeze Routing, Expert Dropout, and Freeze Attention) that improve MoE finetuning by +1.5% over the vanilla MoE, and close a 2.5--2.9 mIoU segmentation gap so that MoE models match or exceed non-MoE baselines on ADE20K.

113. 【2606.31198】Distilling Temporal Coherence into 2D Networks for Transrectal Ultrasound Prostate Video Segmentation

链接：https://arxiv.org/abs/2606.31198

作者：Dong Yeong Kim,JunGyu Lee,Jaewon Choi,June Young Seo,Myeongseop Kim,Jinwook Choi,Taek Min Kim,Young-Gon Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Transrectal Ultrasound, image-guided interventions, Temporally Consistent Learning, Consistent Learning Framework, essential for image-guided

备注： Accepted for publication at the 29th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2026)

点击查看摘要

Abstract:Real-time video segmentation of the prostate in Transrectal Ultrasound (TRUS) is essential for image-guided interventions. While conventional 2D methods suffer from inter-frame inconsistencies by disregarding temporal context, 3D architectures incur prohibitive latency. To resolve this dilemma, we present a Temporally Consistent Learning Framework that distills temporal coherence into a 2D network during training, preserving single-frame inference efficiency. Our design is driven by a key clinical observation: the prostate exhibits geometric stability, whereas the surrounding acoustic environment fluctuates due to physiological motion and transducer pressure. Because conventional temporal constraints propagate erroneous gradients from these unstable regions, we introduce a Confidence-Weighted Temporal Consistency objective derived from optical flow warping residuals, selectively attenuating contributions from unreliable regions. Complementing this pixel-wise constraint, a Dual-scale Prototype Alignment Module enforces semantic coherence through contrastive optimization of local boundary and global semantic features. Furthermore, to eliminate the need for dense per-frame video annotations, we employ geometric equivariance-based pseudo-labeling with knowledge distillation from a pretrained teacher. Extensive experiments on SUN-SEG and our newly introduced TRUS-V benchmark (2,679 frames) demonstrate state-of-the-art accuracy and temporal consistency at real-time speed. Code and dataset are available at this https URL.

114. 【2606.31187】Learning to Deny: Action Denial in Multimodal Large Language Models

链接：https://arxiv.org/abs/2606.31187

作者：Raiyaan Abdullah,Shehreen Azad,Yogesh Singh Rawat

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieving strong zero-shot, Multimodal large language, rapidly advanced video, large language models, large language

备注： Accepted to ECCV 2026 main conference

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have rapidly advanced video understanding, achieving strong zero-shot and few-shot recognition across standard benchmarks. Yet their ability to deny an action by recognizing when an activity is not happening despite strong contextual cues remains largely unexplored. We introduce UCF101-AD, a large-scale benchmark consisting of paired Action-Presence and Action-Denial clips, designed to evaluate this capacity for denial. Each negative video in UCF101-AD preserves the same contextual and motion cues, including persons, objects, and locations, as its positive counterpart, but the defining action itself is explicitly absent. Evaluating 20 state-of-the-art MLLMs reveals a consistent failure: models that exceed 85% accuracy on the positive action classes collapse below 50% on their action-denial counterparts, indicating a strong inclination to affirm plausible actions rather than verify that they truly occur. This exposes a critical blind spot in modern video understanding: the inability to reason causally about whether a motion actually happens. To probe this issue, we explore a causal graph formulation, CausalAct, which expresses scene structure through natural-language prompts linking context, interaction, and motion. Incorporating such causal cues substantially reduces false positives, demonstrating that denial is a learnable reasoning skill. UCF101-AD provides a new lens for diagnosing and improving causal reasoning in multimodal models. Dataset and relevant code: this https URL.

115. 【2606.31179】HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

链接：https://arxiv.org/abs/2606.31179

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world healthcare applications, rigorous and holistic, increasingly capable, holistic evaluation, evaluation is essential

备注：

点击查看摘要

116. 【2606.31177】GaussianMap: Learning Gaussian Representation for Multi-Sensor Online HD Map Construction

链接：https://arxiv.org/abs/2606.31177

作者：Hongyu Lyu,Julie Stephany Berrio Perez,Mao Shan,Stewart Worrall

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autonomous driving systems, driving systems benefit, provide critical information, Autonomous driving, benefit from high-definition

备注：

点击查看摘要

Abstract:Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach to generate local vectorized maps from onboard sensor observations. Existing methods commonly adopt bird's-eye-view (BEV) features as the intermediate scene representation, encoding the surrounding space with fixed-resolution dense grids. However, map elements are spatially sparse yet require fine-grained geometric localization, making uniformly allocated BEV representations redundant and less effective for vectorized map prediction. In this work, we propose GaussianMap, an online HD map construction framework that learns an adaptive Gaussian representation of the surrounding scene. This representation consists of a set of Gaussian primitives on the BEV plane, each encoding a flexible local region with geometric properties and a feature vector, allowing the model to allocate representational capacity to map-relevant regions. To generate such a representation from sensor observations, we introduce a feed-forward Gaussian encoder that progressively refines these primitives through Gaussian interaction modeling and multi-sensor feature aggregation. The refined Gaussian representation is then splatted into a BEV feature map and decoded into vectorized map predictions. Extensive experiments on nuScenes and Argoverse 2 datasets demonstrate that GaussianMap achieves state-of-the-art performance in both camera-only and camera-LiDAR fusion settings. Our code will be made publicly available.

117. 【2606.31172】HSDF-Lane: Height-Aligned Signed Distance Field with Semantic Lane Prior for 3D Lane Detection

链接：https://arxiv.org/abs/2606.31172

作者：Jiyong Boo,Byeongin Joung,Hyemin Yang,Kuk-Jin Yoon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：inherent depth ambiguity, remains challenging due, single image remains, image remains challenging, autonomous driving

备注： ECCV 2026, Project page: [this https URL](https://jiyongboo.github.io/HSDF-Lane-project-page)

点击查看摘要

Abstract:Monocular 3D lane detection plays a critical role in autonomous driving, yet recovering reliable 3D geometry from a single image remains challenging due to inherent depth ambiguity. Prior methods project image features into Bird's-Eye-View (BEV) space under a flat-ground assumption, causing geometric distortion on real-world roads. Recent methods instead predict explicit height maps to capture non-planar surfaces, but still rely on sparse anchor-based regression and exploit the recovered geometry merely for spatial transformation rather than semantic understanding. To overcome these limitations, we propose HSDF-Lane, which implicitly models the road surface as a Height-aligned Signed Distance Field (HSDF) over a densely sampled 3D feature volume. Through differentiable rendering, the HSDF jointly produces an accurate height map and surface-aligned features. We further introduce Lane-aware Semantic Positional Encoding (LSPE), which injects a lane-existence prior derived from the surface-aligned features into the transformer queries, coupling geometric structure with semantic guidance. Extensive experiments on the OpenLane benchmark show that HSDF-Lane achieves state-of-the-art performance in both 3D lane detection and height map estimation.

118. 【2606.31169】Beyond Single Character: Evaluating MLLMs for Sentence-Level Oracle Bone Inscription Understanding

链接：https://arxiv.org/abs/2606.31169

作者：Ziqi Li,Zijian Chen,Tingzhu Chen,Guangtao Zhai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：oracle bone inscription, complete divination charges, AI-assisted oracle bone, long-form textual coherence, Sentence-level OBI understanding

备注： 13 pages, 4 figures

点击查看摘要

Abstract:Existing AI-assisted oracle bone inscription (OBI) visual recognition and understanding studies mainly focus on character-level, ignoring the long-form textual coherence and contextual dependencies embedded in complete divination charges. Recently, the powerful visual perception capabilities of multimodal large language models (MLLMs) have opened new possibilities for OBI information processing. In this work, we introduce S-OBI, a novel benchmark for evaluating MLLMs in Sentence-level OBI understanding. Instead of using noisy and incomplete rubbings as the visual input, S-OBI synthesizes clear and standardized sentence-level OBI instances through glyph substitution and composition. According to 95 original rubbings with translations that have been identified, corrected, and verified by experts, we replace characters in the original rubbings with corresponding clean glyph samples sourced from existing OBI datasets while preserving the overall inscriptional structure and semantic organization. This mitigates the influence of low-level distortions and enables a more focused evaluation of sentence-level OBI understanding. Based on this, we design semantic matching, semantic slot extraction, and contextual reasoning tasks and obtain 695 question-answer pairs. Experiments reveal the inferiority of contemporary MLLMs on sentence-level OBI understanding. In particular, visual perception errors in unmasked regions propagate through the reasoning chain, leading to erroneous predictions for masked characters, which indicates that sentence-level OBI understanding in current models remains strongly dependent on character-level recognition. Overall, S-OBI provides a diagnostic benchmark for evaluating whether MLLMs can move beyond isolated character recognition toward structured inscription-level understanding.

119. 【2606.31164】Seeing Through the Weights: Privacy Leakage in Scene Coordinate Regression

链接：https://arxiv.org/abs/2606.31164

作者：Oleksii Nasypanyi,Jaemin Cho,Utku Ozbulak,Byungkon Kang,Francois Rameau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scene Coordinate Regression, Coordinate Regression, visual localization, increasingly adopted, adopted for visual

备注：

点击查看摘要

Abstract:Scene Coordinate Regression (SCR) methods are increasingly adopted for visual localization. In these approaches, the scene is implicitly encoded within a neural network that regresses a 3D world coordinate for each image pixel. Because the scene is represented only through the network parameters and not stored explicitly as images or maps, such methods are often assumed to be privacy-preserving. In this work, we show that this assumption is incorrect in practice. Specifically, we introduce a query-based attack that reconstructs the 3D geometry of the training environment from an SCR model under different levels of model access. To do so, we repeatedly query the model with batches of proxy images unrelated to the target scene to obtain dense pixel-wise 3D coordinates. Reliable points are identified through their stability under small input perturbations and can be further refined in a white-box setting. These stable points are accumulated across independent query batches to recover the scene geometry. From the recovered 3D representation, we also invert the network features to synthesize images from arbitrary viewpoints, revealing additional appearance information. Experiments on indoor and outdoor datasets demonstrate that substantial portions of training environments can be reconstructed with high geometric fidelity. Beyond geometry, we also recover an approximate color appearance, which exposes recognizable layout and potentially sensitive scene elements. This directly contradicts claims in the literature that SCR representations are privacy-preserving by design, and reveals a real risk when such systems are deployed in private or security-critical spaces. The project page is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.31164 [cs.CV]

(or
arXiv:2606.31164v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.31164

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

120. 【2606.31160】Reasoning-aware Speculative Decoding for Efficient Vision-Language-Action Models in Autonomous Driving

链接：https://arxiv.org/abs/2606.31160

作者：Anh Dung Dinh,Simon Khan,Flora Salim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving emit, reasoning step, autonomous driving, driving emit, reasoning

备注： 10 pages

点击查看摘要

Abstract:Modern Vision-Language-Action (VLA) planners for autonomous driving emit a chain-of-causation (CoC) reasoning step \emph{before} producing a trajectory. The reasoning is autoregressive and dominates inference latency, while the trajectory head is parallel and cheap. Latency is an operational constraint in autonomous driving, so accelerating the reasoning step is the central problem we address. We observe that CoC reasoning has two qualitatively different needs: most tokens continue routine setup that follows naturally from the ego-trajectory history, and a small fraction encode commitments that require fresh visual evidence about an unexpected situation. We split this reasoning into two specialized paths: a \emph{routine reasoner} that handles the predictable continuation by attending to trajectory history, and a \emph{deliberative reasoner} (the unmodified VLA target) that handles novel cases by attending to current visual evidence, using the speculative decoding framework as the architectural template for how the two paths cooperate. Unlike standard speculative decoding, our routine reasoner is not a smaller replica of the target; the two reasoners are deliberately specialized to read different parts of the prompt. We propose two techniques to realize this. First, we introduce \textbf{FlatRoPE}, a 1D rotary positional embedding in the draft that breaks the rotational symmetry of the target's 3D M-RoPE, redirecting attention away from visual tokens and onto trajectory-history tokens. Second, we introduce \textbf{Action-aware RL (AARL)}, a post-training stage that uses an action-quality reward together with a static-reference KL anchor. Together, our two-reasoner system reduces the reasoning-step running time by approximately $4\times$ relative to the original Alpamayo planner.

121. 【2606.31157】Rethinking Foundation Model Collaboration: Enhancing Specialized Models through Proxy Task Reasoning

链接：https://arxiv.org/abs/2606.31157

作者：Hongyi Lin,Yang Liu,Jinhua Zhao,Xiaobo Qu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：embodied intelligence systems, requires precise geometric, tasks requires precise, intelligence systems, numerical estimation

备注：

点击查看摘要

Abstract:Foundation models are increasingly integrated into embodied intelligence systems, but directly assigning them structured prediction tasks requires precise geometric and numerical estimation, where specialized models often remain stronger. This capability mismatch raises a key question: should foundation models replace task-specific predictors, or should they collaborate through tasks better aligned with their strengths? We propose FAT, a foundation-model-augmented task-specific reasoning framework that treats collaboration as task decomposition rather than model replacement. FAT decomposes structured prediction into specialist prediction, information-space reconstruction, and foundation-model proxy reasoning. The specialist generates geometrically and physically valid hypotheses in the native output space, while the foundation model performs a bounded proxy task, such as selection or verification, over reconstructed multimodal candidates. We instantiate this principle as ProxySelect with a vision--language model. Across 2D object detection, 3D object detection, trajectory prediction, and semantic segmentation, ProxySelect consistently improves specialized baselines and substantially outperforms direct foundation-model regression at lower computational cost. These results suggest a general collaboration principle: specialized models preserve task-specific structure, while foundation models refine their hypotheses through contextual proxy reasoning.

122. 【2606.31148】PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

链接：https://arxiv.org/abs/2606.31148

作者：Duc Cao Dinh,Khai Le-Duc,Florent Draye,Chris Ngo,Terry Jingchen Zhang,Bernhard Schölkopf,Zhijing Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：localize target objects, Visual Grounding, aims to localize, natural language descriptions, localize target

备注： Preprint

点击查看摘要

123. 【2606.31147】WaterGen: Decoupling Scene and Medium in Underwater Image Generation

链接：https://arxiv.org/abs/2606.31147

作者：Jiayi Wu,Tianfu Wang,Tianyi Xiong,Dehao Yuan,Xiaomin Lin,Md Jahidul Islam,Cornelia Fermuller,Christopher Metzler,Yiannis Aloimonos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision tasks, Underwater computer vision, Underwater, vision tasks, diverse

备注：

点击查看摘要

Abstract:Underwater computer vision tasks, such as detection, restoration, and segmentation, are limited by the scarcity of large-scale and diverse training data. We introduce WaterGen, a method for generating large-scale, realistic, and diverse underwater images that provides independent control of the scene and water medium conditions. Our approach treats underwater image generation as the decoupled control of two factors: realistic and diverse scene content (what is in the image), and accurate and controllable water medium effects (what the water does to the image). Existing methods generally achieve only part of this objective: they either provide controllability with limited realism or diversity, or generate realistic scenes without accurately and independently modeling water-medium effects. Our key insight, that allows us to avoid this compromise, is that scene generation and medium modeling can be decoupled within a latent diffusion framework, enabling diverse scene generation together with accurate and controllable underwater appearance. To do this, we decompose underwater image synthesis into two stages. First, we fine-tune the latent diffusion U-Net using degradation-free underwater images so that it learns to generate diverse and realistic latent embeddings of underwater scene content without medium-induced degradation. Second, we formulate the physically accurate medium degradation synthesis as a conditional decoding process applied to these latent embeddings. This decoupled design allows our model to generate diverse scenes with full control of underwater appearance. We leverage WaterGen to build large-scale synthetic underwater datasets that are diverse in scene structures and accurate in water effects and pseudo-labels. We demonstrate that our synthetic data consistently improve downstream performance in underwater restoration and semantic segmentation.

124. 【2606.31136】FROST: Training-Free Few-Shot Segmentation with Frozen Features and Nonparametric Statistics

链接：https://arxiv.org/abs/2606.31136

作者：Junghwan Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：imagery departs sharply, natural images, remote sensing, backbones are pretrained, delineate a target

备注： 20 pages

点击查看摘要

Abstract:Few-shot segmentation asks a model to delineate a target class in a query image from only a handful of annotated examples, a setting most acute in remote sensing, where labels are scarce and the imagery departs sharply from the natural images on which vision backbones are pretrained. Prevailing approaches either train a segmenter on labelled episodes, which raises accuracy within the training distribution but binds the model to it, or reduce each class to a lossy summary of frozen features, a single prototype, a few cluster prototypes, or a discrete clustering, none of which preserves the internal structure of a multimodal class. We argue that a class is better described by a distribution than by a point, and that frozen self-supervised features already carry enough structure to estimate that distribution directly. We introduce FROST, a training-free few-shot segmenter that treats the reference foreground and background as two point clouds on the unit sphere of frozen DINOv3 features and labels each query token by a nonparametric density ratio, with a threshold the Bayes rule fixes at zero under equal priors. Because the variance of a density estimate shrinks as its sample grows, the decision sharpens as references accumulate, and every remaining quantity from the kernel bandwidth to the spatial gate is read from the support set rather than tuned. We develop FROST for overhead imagery, where a class is typically a scatter of many small and dissimilar instances that a density tracks but a lossy summary blurs. Across seventeen remote-sensing benchmarks FROST surpasses both training-free and learning-based methods, leading by 5.6 mIoU from a single annotated example and widening its lead as the support set grows, all while remaining among the smallest models compared. Code is available at this https URL.

125. 【2606.31135】MSNN-LINet: Cross-Modal Learning via Continuous Linear Integration

链接：https://arxiv.org/abs/2606.31135

作者：Gabriel Clinger

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Multi-Stream Neural Network, Linear Integration Network, Neural Network, Multi-Stream Neural, Integration Network

备注： 14 pages, 6 figures, 3 tables

点击查看摘要

Abstract:We present LINet (Linear Integration Network), a Multi-Stream Neural Network (MSNN) for RGB-D scene classification. Current multi-modal architectures treat feature fusion as a discrete, ad-hoc event: early fusion entangles representations prematurely, late fusion isolates them until the final layer, and hybrid or attention-based methods require architectural guesswork to place intermediate fusion blocks. LINet addresses this structural compromise by maintaining three dedicated parallel streams (RGB, depth, and integration) where a novel Linear Integration Convolution (LIConv2d) operator enables continuous cross-modal learning at every layer. The integration stream receives raw filtered signals from both modality streams and combines them before the nonlinear activation threshold, conceptually inspired by somatic integration preceding the neuronal firing decision. Implementing continuous integration exposes a critical initialization pathology: Kaiming initialization of the bridging weights scrambles gradients before they reach the stream backbones, producing a failure mode that resembles overfitting but is corrupted gradient flow. A 1/N constant initialization mitigates this. We employ progressive modality dropout, a curriculum adapted to continuous fusion in which blanking probability increases from zero, preventing pathway collapse, a form of negative co-learning, by forcing robust independent stream representations. Trained from scratch on SUN RGB-D 19-class scene classification, LINet reaches 45.2% mean class accuracy at ResNet18 scale, outperforming prior from-scratch results, and rises to 49.6% with in-domain RGB-D (ScanNet) pretraining.

126. 【2606.31127】SkillSpotter: Pose-Aware Multi-View Skilled Action Detection and Grading in Ego-Exo Videos

链接：https://arxiv.org/abs/2606.31127

作者：Björn Braun,Christian Holz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Augmented Reality glasses, fixed camera setups, Augmented Reality, coaching using Augmented, Reality glasses

备注： Accepted for publication at European Conference on Computer Vision (ECCV)

点击查看摘要

Abstract:To enable personalized, real-time coaching using Augmented Reality glasses or fixed camera setups in domains such as sports, cooking, or music, a system must understand not just what a person does, but how well they execute an activity. In an ego-exo video setting, this requires simultaneously detecting individual skilled actions and classifying each as correct or needing improvement, which Ego-Exo4D's proficiency demonstration benchmark formalized. We first adapt seven state-of-the-art temporal action detection architectures to this task, extend the evaluation protocol to disentangle detection from grading, and show that existing methods grade near-randomly. We then introduce SkillSpotter, a pose-aware multi-view architecture that jointly detects and grades skilled actions through three task-specific modules: (1) adaptive temporal suppression to handle the varying density of skilled actions across diverse activities, (2) gated 3D body pose fusion to leverage body kinematics as a complementary signal to visual features, and (3) bidirectional cross-view attention to combine ego and exo views effectively. SkillSpotter improves class-specific mAP from 12.40 to 21.82 (+76%) and balanced accuracy from 55.99% to 60.40% over the best baseline. SkillSpotter's modules transfer to other temporal action detection models with consistent gains, and our method generalizes beyond Ego-Exo4D to HoloAssist. Code: this https URL

127. 【2606.31125】WildProp: Visual Estimation of Wildlife Body Proportions at Scale

链接：https://arxiv.org/abs/2606.31125

作者：Mustafa Chasmai,Aaron Sun,Subhransu Maji

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：physical specimen handling, traditionally require controlled, require controlled imaging, measurements underpin ecological, limiting scalability

备注： Accepted to ECCV 26

点击查看摘要

Abstract:Population-level morphometric measurements underpin ecological and evolutionary studies but traditionally require controlled imaging or physical specimen handling, limiting scalability. We present WildProp, a training-free framework that estimates wildlife body proportion distributions directly from large-scale, unconstrained image repositories. We cast morphometric estimation as a retrieval-driven correspondence problem: given a single user-annotated canonical image, WildProp performs pose-aware retrieval using foundation model features, transfers part endpoints via dense patch-level matching, filters predictions using geometric consistency, and aggregates measurements across retrieved images to estimate population-level ratio distributions. Unlike supervised keypoint pipelines, our approach adapts to arbitrary species and user-defined parts without per-species training. Evaluations on three large morphometric datasets spanning birds and amphibians show median relative errors of 10-20%. We further highlight the broad applicability of our approach through a number of case studies measuring various proportions across diverse taxa, including birds, frogs, insects, and flowers. Ablations demonstrate that pose-aware retrieval is critical for stable estimation, while robust aggregation mitigates keypoint and pose noise. Our results indicate that carefully curated 2D correspondences over web-scale imagery can provide scalable morphometric proxies for comparative and subgroup analyses across taxa, geography, and seasonality.

128. 【2606.31115】JacobianAvatar: Temporally Consistent Semi-rigid Avatar Reconstruction from a Monocular Video

链接：https://arxiv.org/abs/2606.31115

作者：Changyeon Won,Min-Gyu Park,Seonghwan Park,Ju Hong Yoon,Hae-Gon Jeon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating realistic human, Generating realistic, realistic human avatars, clothing dynamics, requires modeling

备注：

点击查看摘要

Abstract:Generating realistic human avatars in complex motions--such as clothing dynamics--requires modeling of global and local deformations which remains challenging in monocular settings. We address this problem by leveraging neural Jacobian fields (NJFs) for representing semi-rigid deformations. We train self-supervised neural networks for predicting Jacobian matrices that give the pose-dependent deformations, by solving a Poisson equation. However, monocular input presents several difficulties such as self-occluded regions and invisible surfaces. To address these issues, we introduce three key components: a constrained Poisson solver, signed distance-based Jacobian regularization, and a deformation-guided residual flow loss, which together suppress boundary artifacts, recover frequently occluded regions such as armpits and thighs, and enforce temporal consistency during motion. Experiments on benchmark and in-the-wild videos demonstrate that our method generates temporally stable and geometrically coherent avatars, outperforming state-of-the-art approaches.

129. 【2606.31109】InfiniVerse: Occupancy Guided Unbounded Scene Generation for Autonomous Driving

链接：https://arxiv.org/abs/2606.31109

作者：Xiaoyu Ye,Leheng Li,Xinyu Ji,Yingjie Cai,Hongda He,Xu Yan,Guanyi Zhao,Ying-Cong Chen,Bingbing Liu,Shuguang Cui,Zhen Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving community, temporally coherent urban, coherent urban environments, Generating realistic, driving community

备注：

点击查看摘要

Abstract:Generating realistic, controllable, and temporally coherent urban environments is a critical yet unresolved challenge in the autonomous driving community. In this paper, we introduce InfiniVerse, a unified pipeline for long-range, 2D-3D-aligned, and controllable synthesis of dynamic urban scenes from a single frame. In practice, our approach first reconstructs a 3D occupancy representation from the input multi-view frame. This representation serves as a foundation for autoregressive scene extension along arbitrary trajectories. Subsequently, a video diffusion model translates the coarse occupancy grid into realistic, spatiotemporally consistent video sequences. Moreover, we propose a hierarchical sketch-and-refine paradigm, in which the generated videos are re-projected as image-conditioned feedback to enhance the 3D occupancy representation, establishing cross-modal alignment and mutual enhancement between the visual and spatial domains. Extensive evaluations on the Waymo Open Dataset and nuScenes demonstrate that InfiniVerse achieves state-of-the-art performance, with a FID of 6.4 and FVD of 67.97, significantly outperforming existing benchmarks in both duration and stability.

130. 【2606.31100】axoMIL: Taxonomy-Constrained Learning for Hierarchical Whole Slide Image Analysis

链接：https://arxiv.org/abs/2606.31100

作者：Chaeyeon Lee,Khang Nguyen Quoc,Jinsol Song,Yosep Chong,Kwangil Yim,Jin Tae Kwak

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multiple instance learning, slide image, analysis is central, computational pathology, instance learning

备注： Accepted at ECCV 2026

点击查看摘要

Abstract:Whole slide image (WSI) analysis is central to computational pathology, with multiple instance learning (MIL) emerging as the standard pipeline for slide-level diagnosis. However, conventional approaches formulate WSI diagnosis as a flat classification task over discrete labels, contradicting the inherently hierarchical, coarse-to-fine nature of clinical reasoning. Although recent hierarchical classifiers and vision-language models (VLMs) have sought to address this structural gap, they either fail to capture semantic continuity between related diagnoses or suffer from unconstrained text generation that produces taxonomic hallucinations and parent-child label violations. To address these limitations, we propose TaxoMIL, a taxonomy-constrained framework that reformulates WSI diagnosis as a multi-granularity text generation task. TaxoMIL utilizes a dual-head Transformer decoder to generate coarse- and fine-level diagnostic text, and introduces taxonomy-guided objectives that explicitly structure the label embedding space and strictly ground slide-level visual representations within the clinical taxonomy. Extensive experiments across three diverse WSI datasets demonstrate that TaxoMIL consistently outperforms state-of-the-art MIL classifiers and VLM-based generative methods, yielding accurate and hierarchy-aware diagnostic predictions. The code is released at this https URL

131. 【2606.31099】Seeing Through Multiple Views: Parameter-Efficient Fine-Tuning via Selective Neurons for Consistent Radiology Report Generation

链接：https://arxiv.org/abs/2606.31099

作者：Yucheng Chen,Jinjing Zhu,Yang Yu,Yufei Shi,Hane Naghshbandi,Jinhua Liu,Angela S. Koh,Fang Fen,Kian Eng Ong,Si Yong Yeo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：predominantly adopt direct, adopt direct feature, direct feature fusion, Recent years, multi-view X-ray images

备注： Accepted by MICCAI2026

点击查看摘要

Abstract:Recent years have seen substantial advances in radiology report generation (RRG), yet existing approaches predominantly adopt direct feature fusion when handling multi-view X-ray images. Such approaches overlook the potential clinical inconsistencies and inaccuracies arising when a single model processes different views, adversely impacting performance and clinical reliability. To this end, we introduce View-PNDF (View-specific Pattern Neuron Detection and Fine-tuning), a parameter-efficient framework that fosters view-consistent report generation from a neuronal perspective. Specifically, View-PNDF comprises: (i) a view-specific neuron detection module identifying neurons responsive to particular views, (ii) a verification module quantifying the existence of these neurons, and (iii) a selective fine-tuning strategy strengthening detected neurons while preserving view-agnostic representations. By updating only view-specific neurons, View-PNDF achieves consistent diagnoses across different views with reduced computational costs. Subsequently, we employ Large Language Models (LLMs) to consolidate the view-specific reports into a complete radiology report. Furthermore, we use traditional Natural Language Generation (NLG) metrics-based assessment on integrated reports for baseline comparison and employ LLM-based assessment (e.g., GPT-4o) on view-specific reports to capture clinical significance. Extensive experiments on two medical RRG benchmarks demonstrate that View-PNDF substantially improves view-specific chest X-ray report generation quality while maintaining robust general-view performance.

132. 【2606.31098】PiLoT v2: Pixel-to-Orthogonal Map Alignment for Free-view UAV Geo-localization

链接：https://arxiv.org/abs/2606.31098

作者：Xinyi Liu,Xiaoya Cheng,Rouwan Wu,Zhaochen Wang,Shen Yan,Maojun Zhang,Yu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：drift-free UAV geo-localization, GNSS-denied environments, essential for autonomous, autonomous missions, missions in GNSS-denied

备注：

点击查看摘要

Abstract:Real-time, drift-free UAV geo-localization is essential for autonomous missions in GNSS-denied environments. The pioneering system, PiLoT, achieves high precision via Neural Pixel-to-3D Registration, aligning UAV video streams with a single rendered reference view from 3D meshes. However, its reliance on heavy 3D meshes incurs massive storage overheads, complex map acquisition, and significant computational rendering costs, severely hindering deployment on embedded platforms. To address these bottlenecks, we propose PiLoT v2, a lightweight yet robust evolution that shifts the paradigm to direct pixel-to-orthogonal map registration for free-view UAV geo-localization. By leveraging True Digital Orthophoto Maps (TDOMs) and Digital Surface Models (DSMs) as the reference substrate, PiLoT v2 replaces GPU-intensive 3D rendering with a highly efficient, CPU-friendly map cropping operation. To bridge the severe geometric discrepancy between these 2.5D orthogonal crops and free-view oblique UAV imagery, we train a cross-view feature registration network using a novel, large-scale geometrically annotated dataset. Furthermore, we integrate onboard sensor prior--specifically gravity direction and single-point laser rang--directly into the pose optimization manifold to enhance robustness against cross-view visual degradation. Experimental results demonstrate that PiLoT v2 achieves performance comparable to, or even exceeding, its Pixel-to-3D predecessor, while offering drastically lower storage and computational costs.

133. 【2606.31096】Horizon3D: Sparse Radar-Camera Fusion for Long-Range 3D Perception in Autonomous Driving

链接：https://arxiv.org/abs/2606.31096

作者：Geonho Bang,Geunju Baek,Dongyoung Lee,Wonjun Jeong,Jun Won Choi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：safe autonomous driving, extended ranges, methods remain limited, critical for safe, safe autonomous

备注： Accepted to ECCV 2026. Project page: [this https URL](https://geonhobang.github.io/horizon3d-project-page) . Code: [this https URL](https://github.com/geonhobang/ECCV2026_Horizon3D)

点击查看摘要

Abstract:Long-range 3D object detection is critical for safe autonomous driving at highway speeds, yet existing radar-camera fusion methods remain limited at extended ranges. BEV-based methods capture scene-level context but incur rapidly growing computation and often lose fine-grained object detail, while query-based methods are efficient but provide limited scene-level context. Temporal fusion further requires both multi-frame accumulation for sparse distant observations and object-level motion modeling for fast-moving objects. We propose Horizon3D, a sparse radar-camera fusion framework for long-range 3D object detection that combines Gaussian primitives with sparse BEV features. Horizon3D initializes Gaussian primitives at radar- and camera-estimated object keypoints using Keypoint-Guided Gaussian Initialization, refines them through Object-Centric Sparse Fusion, and splats them onto the BEV plane to fuse object-level detail with sparse radar BEV context. It further introduces Dual-Path Temporal Fusion, which aggregates temporal cues through a BEV path for scene-level accumulation and a Gaussian path for object-level motion propagation. Experiments on TruckScenes show that Horizon3D achieves state-of-the-art radar-camera 3D detection performance. On the validation set, it outperforms the previous best method by +3.0 NDS and +1.6 mAP while maintaining competitive inference speed.

134. 【2606.31095】Do Not Break the Vessels: Structure-Preserving Mean Flow for Vascular Image Translation

链接：https://arxiv.org/abs/2606.31095

作者：Changjin Sun,Zhuo Hu,Kaini Wang,Baixuan Wu,Shuo Gao,Runan Zheng,Cheng Xue,Yudong Zhang,Guangquan Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reconstructing anatomically faithful, clinically accessible imaging, accessible imaging modalities, substantial clinical significance, Reconstructing anatomically

备注：

点击查看摘要

Abstract:Reconstructing anatomically faithful vascular structures from clinically accessible imaging modalities is of substantial clinical significance. However, existing cross-modal translation methods mainly emphasize pixel-level fidelity or visual realism and treat structure preservation as a property of the final output rather than an invariant of the generative process. This limitation often leads to structural discontinuities and artifacts, compromising anatomical coherence and clinical reliability. In this work, we propose a Structure-Preserving Mean Flow (SPMF) framework that formulates vascular image translation as a topology-invariant transport process. Based on a structural invariance principle, we derive an orthogonality constraint on the flow velocity field that formally separates appearance transport from topological distortion. We implement this constraint as a time-weighted surrogate objective within a Brownian bridge diffusion model to preserve topology at every diffusion step. Moreover, we propose a Prototype-Guided Structural Refinement (PGSR) module to align degraded inference-time structures with reliable training-time structures. Experiments on paired NIRII-to-2PF and fundus datasets demonstrate consistent improvements over state-of-the-art methods, achieving peak PSNR values of 24.96 dB and 24.83 dB, respectively.

135. 【2606.31089】Anchoring on Reality: Breaking the Pseudo-Target Ceiling in Makeup Transfer

链接：https://arxiv.org/abs/2606.31089

作者：Bo Wei,Xianhui Lin,Yi Dong,Zhongzhong Li,Zonghui Li,Zirui Wang,Jiachen Yang,Xing Liu,Hong Gu,Xiaoming Li,Wangmeng Zuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：source face, face while preserving, Makeup transfer applies, transfer applies, Makeup

备注： Accepted by ECCV 2026

点击查看摘要

Abstract:Makeup transfer applies a reference cosmetic style to a source face while preserving its identity and geometry. However, this task is severely hindered by the lack of real paired training data. Current methods rely on either weak priors or synthetic pseudo-targets from large-scale editing models. These paradigms provide suboptimal guidance, often leading to degraded fine-grained details, synthetic artifacts, and identity drift. To this end, we propose Anchoring on Reality Makeup Transfer (ART), a two-stage framework with a reality-anchored refinement cycle. In Stage I, the model is initialized with pseudo-targets to establish basic semantic alignment and global makeup placement. Crucially, Stage II shifts supervision from pseudo-targets to the real reference, reconstructing it from its bare-skin counterpart through a differentiable cycle that penalizes any omitted detail and overrides synthetic artifacts. Furthermore, we introduce MakeupFaces2K (MF2K), the first 2K-resolution in-the-wild makeup portrait dataset comprising 8,573 images. Extensive experiments demonstrate that our method achieves superior makeup fidelity, strong background stability, and robust identity preservation, especially for complex makeup styles.

136. 【2606.31088】owards Flexible, Natural, Efficient Interaction for Conversational Talking Face Generation

链接：https://arxiv.org/abs/2606.31088

作者：Baiqin Wang,Sen Chen,Jiankuo Zhao,Xiangyu Liu,Zhen Lei,Xiangyu Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracted increasing attention, recently attracted increasing, increasing attention, aiming to synthesize, characters speak

备注： 17 Pages,8 figures

点击查看摘要

Abstract:Conversational talking face generation has recently attracted increasing attention, aiming to synthesize interactive talking videos where characters speak, listen, and respond dynamically to each other. This task presents three core challenges: 1) Flexibility: enabling multi-round dialogues with an arbitrary number of participants; 2) Naturalness: maintaining coherent motion and appropriate non-verbal feedback throughout the interaction; and 3) Efficiency: achieving real-time generation and low computation overhead for long-term continuous online conversation. Despite recent advances, existing methods still fall short in balancing all three requirements. To bridge this gap, we introduce InterTalk, a novel and efficient framework designed for highly interactive conversational talking face generation. Built upon a motion-based architecture, InterTalk supports real-time conversation synthesis. Our method achieves strong flexibility by explicitly modeling multi-round conversational dynamics among each participant, eliminating constraints on their numbers. To enhance interactivity, we incorporate motion feedback from multiple participants and introduce an iterative generation strategy for more natural behaviors. Besides, we disentangle motion into several facial components, enabling targeted refinements for natural response such as precise lip sync and realistic eye blinking. Finally, we construct a new multi-person conversational dataset and enrich it with 3D face-based data augmentation. Extensive experiments demonstrate that InterTalk achieves superior interaction quality while maintaining real-time performance at 30 FPS.

137. 【2606.31086】CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction

链接：https://arxiv.org/abs/2606.31086

作者：Yuzhou Ji,Xiaotian Yang,Zhipeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long-horizon task execution, task execution, rise of home-deployed, home-deployed embodied, embodied AI systems

备注： Accepted to ECCV2026

点击查看摘要

Abstract:The rise of home-deployed embodied AI systems is driving a growing need for fast, metric 3D reconstruction of residential spaces to support navigation, interaction, and long-horizon task execution. However, the commonly used pinhole-camera 3D reconstruction pipelines struggle to model large indoor residences efficiently due to their limited field of view, to which achieving full coverage across multiple rooms often requires thousands of images and incurs drift from long chains of incremental alignment. In this work, we present CasaMaestro (Spanish words meaning ``house'' and ``master''), a feedforward model that can take only twenty to fifty sparse multi-view indoor panoramas as input and directly predicts metric depth along with camera poses, allowing fast point-cloud reconstruction of the entire house with full coverage. CasaMaestro is the first model that supports house-scale reconstruction with multi-view panoramas. Experiments show that CasaMaestro can robustly provide high quality results in both real-world and synthetic scenes, which can serve as a strong foundation for acquiring house-scale 3D indoor assets to be applied in close-loop simulation.

138. 【2606.31082】Fleet: Few Shots Lead Effective AI-generated Image Detection

链接：https://arxiv.org/abs/2606.31082

作者：Jiaan Wang,Sirui Liu,Yu Li,Kaiyuan Yang,Juan Cao,Sheng Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：open-world adversarial defense, Nano Banana Pro, detection is undergoing, adversarial defense, undergoing a critical

备注： 25 pages, accepted by ICML 2026

点击查看摘要

Abstract:AI-generated image (AIGI) detection is undergoing a critical transition from laboratory benchmarks to open-world adversarial defense. The prevalent paradigm focuses on finding static feature spaces, assuming that some invariant artifacts learned from historical data can achieve universal zero-shot generalization. While achieving saturation on several AIGI benchmarks, this static hypothesis suffers a severe performance drop against rapidly evolving generators (e.g., SD3, Nano Banana Pro). To address these limitations, we propose that the field should expand beyond "static generalization" to a new paradigm of "dynamic adaptation". We introduce Fleet, a framework that pioneers a dynamic paradigm of continuous few-shot evolution, enabling rapid alignment with emerging generative threats. Fleet improves few-shot adaptation by replacing unconstrained feature updates with constrained routing correction, where avoidance routing redirects novel AI samples away from Non-AI-dominated routes within decoupled subspaces. To validate this, we present Treasure, a benchmark spanning 64 models and 360k images, featuring diverse architectures and 20 closed-source commercial engines. Experiments reveal that while static SOTA methods fail catastrophically on modern generators, Fleet restores performance from 20.4% to 73.1% with only 10-shot adaptation on "Doubao Seedream 4.0". Code and data are available at this https URL .

139. 【2606.31077】AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images

链接：https://arxiv.org/abs/2606.31077

作者：Meng Yang,Zizhuo Li,Linfeng Tang,Fan Fan,Jiayi Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multi-sensor fusion, essential for visual, visual localization, localization and multi-sensor, Multi-modal

备注：

点击查看摘要

Abstract:Multi-modal image matching is essential for visual localization and multi-sensor fusion, but it is hindered by the scarcity of large-scale training data with precise geometric annotations. Existing real-world datasets suffer from prohibitive costs, limited scene diversity, and errors in SfM-MVS pipelines, while synthetic methods struggle to maintain 3D geometric consistency or achieve photorealistic appearance. To address this, we propose AnyMatch, a novel framework that leverages abundant, easily accessible single-view images at minimal cost to generate rich multi-modal training data. AnyMatch integrates monocular depth estimation, 3D reprojection, diffusion-based inpainting, and crossmodal image translation to synthesize multi-view, multi-modal image pairs with 3D geometric fidelity. Crucially, our method provides annotations that strictly adhere to 3D geometric consistency through explicit 3D reprojection, avoiding SfM-MVS error accumulation. Furthermore, AnyMatch offers strong scalability, enabling controllable scene diversity and annotation difficulty via adjustable input and camera parameters. We construct Any-syn, a large-scale synthetic multi-modal dataset using AnyMatch. Experimental results show that matching networks (e.g., LoFTR, EDM, RoMa) fine-tuned on Any-syn achieve substantial performance gains on multi-modal benchmarks, exhibiting superior generalization and robustness compared to models trained on existing data.

140. 【2606.31071】Hierarchical 3D Scene Graph Construction and Belief-based Planning for Semantic Navigation

链接：https://arxiv.org/abs/2606.31071

作者：Bing Wu,Zuyao Chen,Changwen Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：embodied agents operating, operating in unseen, embodied agents, agents operating, fundamental task

备注： Camera-ready version accepted at ECCV 2026

点击查看摘要

Abstract:Semantic navigation is a fundamental task for embodied agents operating in unseen environments, requiring both semantic understanding and long-term decision-making. Recent foundation models have empowered agents with rich semantic priors for this task. However, without structured global representations, decision-making often falls back on local observations and greedy strategies, resulting in inefficient exploration and myopic behaviors, especially in long-distance navigation. To address these challenges, we propose a zero-shot semantic navigation framework. Our method incrementally maintains an online Hierarchical 3D Scene Graph (HSG) to form a multi-granular semantic topology over objects, zones, and regions, serving as a compact state abstraction for global planning. Building on this memory, we introduce a hierarchical belief-based planning framework that fuses semantic priors with exploration evidence on the HSG, and performs finite-horizon rollouts on an HSG-based simulator to explicitly estimate the long-term expected returns of candidate macro-actions. This enables globally consistent decisions and reduces redundant backtracking. Extensive experiments in high-fidelity simulation environments across multiple tasks and datasets demonstrate that our method outperforms existing state-of-the-art methods, particularly in long-distance scenarios, where our approach improves SR and SPL by an average of 9.4\% and 5.0\%, respectively.

141. 【2606.31068】Hybrid Unet-Transformer Model for Generating Stress and Strain Fields from Composite Geometrics

链接：https://arxiv.org/abs/2606.31068

作者：Shrey Patel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：physics-informed material design, finite element method, conventional finite element, hierarchical composite microstructures, Accurate prediction

备注： International Conference on Emerging Digital Intelligence and Generative Engineering

点击查看摘要

Abstract:Accurate prediction of stress and strain fields in hierarchical composite microstructures is critical for physics-informed material design, yet conventional finite element method (FEM) simulations are computationally prohibitive at scale, requiring minutes to days per evaluation. In this work, we propose a hybrid UNet-Transformer architecture that predicts complex mechanical field distributions directly from composite microstructure geometry images, serving as an efficient surrogate for FEM across ten distinct stress and strain field types spanning diverse two-phase composite configurations including square, hexagonal, and triangular tessellations, multiple boundary conditions, and high-resolution geometries. Results demonstrate that the proposed architecture achieves strong predictive performance across the majority of subdatasets, with peak accuracy on periodic tessellation geometries reaching R2=0.9991, SSIM=0.9936, and MAE=0.0050 on the boundary condition subdataset and the triangular tessellation subdataset respectively. Across six of the eight evaluated subdatasets, MAE remains below 0.05 on the normalized [0,1] pixel scale. Encoder attention analysis via Grad-CAM and Grad-CAM++ confirms that the model develops physically meaningful internal representations, localizing attention at mechanically critical regions including phase boundaries, ligament junctions, and indenter contact zones without explicit structural supervision. Performance degrades on irregular square-grid geometries with sparse soft-phase inclusions, with the S11 normal stress subdataset yielding R2=0.7735 and SSIM=0.7126, consistent with the known limitation of smooth-loss image translation models in reproducing sharp stress discontinuities.

142. 【2606.31065】Diffusion-Based Material Regularization for Physics-Based Inverse Rendering

链接：https://arxiv.org/abs/2606.31065

作者：Jingwang Ling,Lifan Wu,Feng Xu,Shuang Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reconstructing physics-based, graphics and vision, core problem, problem in computer, computer graphics

备注： Accepted to ECCV 2026. Includes supplementary material. Project page: [this https URL](https://gerwang.github.io/diffusion-regularized-inverse-rendering/)

点击查看摘要

Abstract:Reconstructing physics-based 3D assets -- geometry, materials, and illumination -- from multi-view images is a core problem in computer graphics and vision, and a prerequisite for realistic relighting and editing. Physics-based inverse rendering offers an accurate image-formation model, but is severely underconstrained: without strong priors, illumination is baked into materials, and reconstructions generalize poorly to novel views and lighting. Data-driven diffusion models, in contrast, predict visually plausible materials, yet their predictions rarely satisfy the rendering equation and are not directly usable for physics-based rendering. We bridge these two paradigms rather than replacing either. Our key idea is to treat the predictions of a state-of-the-art diffusion model not as target material values but as a similarity kernel for optimization: we introduce a regularization loss that penalizes deviations in the optimized material over surface regions where the diffusion predictions are near-constant, while leaving the optimization free to match the input images. Built on this regularizer, our end-to-end pipeline jointly reconstructs geometry, materials, and illumination, yielding high-quality assets that drop into standard rendering pipelines and relight faithfully. On the Synthetic4Relight, Stanford-ORB, and DTC-Synthetic datasets, our method significantly outperforms state-of-the-art baselines in both reconstruction accuracy and relighting quality.

143. 【2606.31061】Online TT-ALS for Streaming Tensor Decomposition with Incremental Orthogonalization

链接：https://arxiv.org/abs/2606.31061

作者：Hiroki Takeda,Yuto Miyatake,Daisuke Furihata

类目：Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：analyzing high-dimensional data, Tensor Train, analyzing high-dimensional, high-dimensional data, Train

备注： 19 pages, 7 figures. The Julia source code is available at [this https URL](https://github.com/hirokin0919/Online-TT-ALS)

点击查看摘要

Abstract:Tensor Train (TT) decomposition is a powerful technique for analyzing high-dimensional data. Existing algorithms for computing TT decompositions can be categorized into two main types: conventional batch-based approaches and recursive online methods. In the context of streaming data, batch methods typically achieve higher reconstruction accuracy but often suffer from memory exhaustion, while online methods provide greater computational efficiency. In this work, we introduce Online TT-ALS (Alternating Least Squares), an algorithm that sequentially enforces orthogonality constraints. This approach allows for efficient and exact updates of the core tensor while maintaining high reconstruction accuracy. Theoretically, we prove that enforcing these orthogonal gauge constraints guarantees monotonic decrease of the local objective function and temporal smoothness. Computationally, our deterministic single-sweep update reduces the rank dependence from quadratic to linear, achieving an overall complexity of $\mathcal{O}(I^{n-1} r)$. Experimental results demonstrate that the proposed method outperforms existing online techniques not only in terms of mathematical approximation accuracy but also in human perception-based video quality metrics. Furthermore, compared to recent deep learning-based paradigms, our algebraic approach achieves speedups of several orders of magnitude. Consequently, our method exhibits high computational efficiency and is suitable for low-latency real-time processing applications.

144. 【2606.31054】ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

链接：https://arxiv.org/abs/2606.31054

作者：Zhiyuan Yao,Zheren Fu,Zhixiao Zheng,Jiajun Li,Yi Tu,Zhendong Mao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：Multimodal Large Language, Large Language Models, Large Language, generating content inconsistent, Multimodal Large

备注： Accepted by ECCV 2026

点击查看摘要

145. 【2606.31050】Learning Video Dynamics with Predictive Differentiable Rendering

链接：https://arxiv.org/abs/2606.31050

作者：Yujin Tang,Tian Zhou,Xin Lin,Cheng Tan,Yifan Hu,Rong Jin,SouYoung Jin,Liang Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：high-fidelity future world, accurately predict, predict a high-fidelity, high-fidelity future, future world

备注： Accepted by ECCV 2026. 18 pages, 5 figures, 11 tables

点击查看摘要

Abstract:How to accurately predict a high-fidelity future world? While the visual world is inherently continuous, existing deterministic video prediction models operate in discrete pixel space and are mainly optimized with pixel-wise mean squared error (MSE), which often leads to over-smoothed predictions and a lack of fine-grained visual details. To address these limitations, we propose Predictive Differentiable Rendering (PDR), a novel end-to-end video prediction paradigm that bridges the gap between discrete and continuous representations. Inspired by recent progress in 3D reconstruction with 3D Gaussian Splatting, we introduce PredGS, a lightweight and plug-and-play adapter based on 2D Gaussian representation, which could be seamlessly integrated with existing pixel space predictors, significantly improving spatial detail preservation with negligible computational overhead. Furthermore, we develop predgsplat, a CUDA-accelerated differentiable 2D Gaussian renderer supporting arbitrary channels. Each Gaussian is defined by 5 + C learnable parameters (position, scale, rotation, and C channel amplitudes) and achieves up to 10x faster rendering than the baseline. Optimized by a combined L1 and SSIM loss, PDR overcomes the inherent blurring tendencies of MSE Loss, significantly enhancing the prediction performance. Extensive experiments on diverse real-world benchmarks, including TaxiBJ, WeatherBench, KTH, and Human3.6M, demonstrate that PDR consistently surpasses existing methods, delivering superior detail preservation, visual fidelity, and predictive accuracy.

146. 【2606.31029】rraDiT-$Ω$: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive

链接：https://arxiv.org/abs/2606.31029

作者：Brian Wei,Srikumar Sastry,Daniel Cher,Eric Xing,Nathan Jacobs

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, imagery remains challenging, satellite imagery remains, remarkable progress, remains challenging

备注： European Conference on Computer Vision 2026

点击查看摘要

Abstract:Generative models have achieved remarkable progress, yet applying them to satellite imagery remains challenging. Unlike natural imagery, satellite scenes are structured by spatially complex and semantically distinct geometries. Prior work addresses this complexity by adapting natural image frameworks using dense rasters or sparse prompts, trading off annotation cost and fidelity while breaking compatibility with vector primitives commonly used to represent geographic information. We introduce TerraDiT-$\Omega$, a unified spatial control framework that generates satellite imagery directly from any native geospatial primitive. By jointly leveraging precise annotations (polygons, polylines) and coarser ones (bounding boxes, points), the model supports controllable layouts across varying annotation budgets, broadening applicability to design tasks such as urban planning while remaining naturally compatible with end-to-end GeoAI workflows. To effectively leverage these primitives during generation, we propose Geometry-Aware Local Attention, a conditioning mechanism that injects explicit geometric cues into the attention space. Across all conditioning formats, our approach consistently outperforms both dense-control and sparse-control baselines. Furthermore, this flexibility enables controllable synthetic data augmentation using a single generative model, improving downstream performance on land-cover segmentation, object detection, road graph extraction, and scene classification. Code, data, and weights are available at this https URL.

147. 【2606.31018】WarpI2I: Image Warping for Image-to-Image Translation

链接：https://arxiv.org/abs/2606.31018

作者：Shen Zheng,Anurag Ghosh,Gaurav Parmar,Srinivasa Narasimhan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved strong results, achieved strong, strong results, results in tasks, latent diffusion models

备注： ECCV 2026

点击查看摘要

Abstract:Image-to-image (I2I) translation has achieved strong results in tasks like human relighting and driving scene translation using latent diffusion models (LDMs). However, compact LDMs often struggle to preserve fine-grained structures because the encoder compresses high-resolution inputs into a spatially downsampled latent space. To address this issue, we propose a simple saliency-guided warp-unwarp framework that reallocates spatial representation toward salient regions before encoding, enabling better preservation of structural details without increasing latent resolution. The warped image is processed by the original diffusion model and then mapped back via an inverse warp. In addition, we propose a simple and efficient outpainting-based synthetic data generation pipeline to produce high-quality paired data for image relighting. Our method is model-agnostic, requires no architectural modification, and introduces negligible computational overhead. Experiments on human relighting, driving scene relighting, and translation demonstrate improved structural preservation, lighting faithfulness, and image quality, with our framework extending naturally to video via frame-by-frame application with good temporal stability. Project Webpage: this https URL

148. 【2606.31015】Dual Sparse Aggregation Transformer for Multispectral Object Detection

链接：https://arxiv.org/abs/2606.31015

作者：Wencong Wu,Xiuwei Zhang,Hanlin Yin,Hongxi Zhang,Yanning Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：model long-range dependencies, detection tasks due, obtained excellent performance, Dual Sparse Transformer, Sparse Aggregation Transformer

备注：

点击查看摘要

Abstract:Transformer-based approaches have obtained excellent performance in multispectral object detection tasks due to their ability to model long-range dependencies and capture complementary information. However, previous transformer-based multispectral detection methods tend to use all available tokens for similarity calculation, which results in redundant information interaction from irrelevant areas, leading to degraded detection performance. To overcome this challenge, we propose a novel Dual Sparse Aggregation Transformer (DSAFormer) for multispectral object detection, which consists of a Dual Sparse Transformer (DSFormer) and a Learnable Addition Fusion Block (LAFB). Specifically, the DSFormer is designed to exploit and boost cross-modal complementary information, thereby improving detection performance. It incorporates three key components: A Spatial Sparse Multi-Head Cross-Attention (SSMHCA) mechanism selectively captures cross-modal relationships at the spatial level by reserving only the high query-key similarity scores, eliminating irrelevant interactions. A Channel Sparse Multi-Head Cross-Attention (CSMHCA) mechanism performs similar sparse calculations at the channel level to enhance feature representation and filter out low matching query-key. A Multi-Scale Feature Refinement Layer (MSFRL) is developed to aggregate hierarchical features and suppress redundant information. To effectively fuse multimodal features, the LAFB is introduced to aggregate intramodal and intermodal feature information by feature reweighting. Extensive experimental results have demonstrated that our proposed DSAFormer achieves better detection performance against state-of-the-art methods on four public datasets, including the MFAD, FLIR, M$^3$FD, and LLVIP. The source code of our DSAFormer will be released at this https URL.

149. 【2606.31007】Dense Structural Priors for Sparse Functional Landmark Localization in Surgical Videos

链接：https://arxiv.org/abs/2606.31007

作者：Chenyan Jing,Hao Ding,Lalithkumar Seenivasan,Jacob M. Delgado López,Mathias Unberath

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：provide transferable object-level, transferable object-level structure, transferable object-level, segmentation outputs, explicitly encode

备注：

点击查看摘要

Abstract:Vision foundation models such as SAM 3 can provide transferable object-level structure across diverse surgical video conditions, but segmentation outputs do not explicitly encode the action-conditioned semantics that define functional surgical landmarks. Estimating instrument extent and geometry differs from localizing the tip or anchor relevant to clipping, grasping, or dissecting. We investigate vision foundation model-enabled sparse action-aware landmark localization, using zero-shot, point-prompted structural masks to provide dense instrument-level context without manual pixel-level mask annotations. We propose a lightweight refinement framework that uses SAM 3 as a structural prior. A coarse multi-frame network predicts tip and anchor prompts, generating non-oracle masks that are fused with visual and heatmap features to refine functional landmark predictions. We compare direct mask-augmented supervision, prediction-derived mask-prior refinement, and auxiliary mask supervision to examine how vision foundation model-derived structure should enter a precision-oriented localization system. Experiments on 7,867 clips from 60 surgical videos spanning YouTube, Cholec80, HeiChole, SurgVU, and CRCD evaluate the approach under heterogeneous conditions. Without manual pixel-level mask annotations for training, the proposed model achieves overall F1 scores of 72.4% for tip and 58.0% for anchor localization. Directly imposing masks on heatmap targets biases learning toward broad tool regions, whereas prediction-derived priors and auxiliary supervision provide effective intermediate structural guidance for action-dependent landmark prediction.

150. 【2606.31004】Auditing Generalization in AI-Generated Video Detection: A Six-Control Protocol and the VidAudit Toolkit

链接：https://arxiv.org/abs/2606.31004

作者：Mert Onur Cakiroglu,Zhihe Lu,Mehmet Dalkilic,Hasan Kurban

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：AI-generated video detection, inflate reported generalization, video detection benchmarks, leave uncontrolled confounds, protocols leave uncontrolled

备注：

点击查看摘要

Abstract:AI-generated video detection benchmarks such as GenVidBench and AIGVDBench are the de facto leaderboards, yet most evaluation protocols leave uncontrolled confounds that can inflate reported generalization. As an existence proof, a three-feature clip-length classifier reaches a leave-one-generator-out (LOGO) AUC of 0.998 on GenVidBench under unaudited evaluation, while measuring nothing about motion. A 20-paper survey finds none applying all six standard controls that would catch this, so we combine them into an audited protocol and apply it to six representative feature sources (three published detectors and three repurposed signal sources), re-running it cross-dataset on AIGVDBench. The audit both debunks and certifies: the trivial classifier collapses to near chance (0.529), a CLIP baseline is caught carrying dataset identity, and the 2025 forensic detector WaveRep clears the floor at out-of-distribution LOGO AUC 0.996 with chance-level real-vs-real coherence. At a deployable FPR of 0.1%, multiple high-AUC methods fall to single-digit recall and the leaderboard order changes, so we recommend an audited tuple (AUC, above-floor margin, operating-point recall, and calibration) over a single number. As a white-box positive control, we add TemporalSpec (codec motion vectors); via cross-substrate feature fusion (XSFF), a second substrate adds genuine complementarity that survives the audit. We release VidAudit, to our knowledge the largest unified and audited detector collection for this task, providing 14 detectors behind one plugin API, a leaderboard, and Croissant metadata, available at this https URL. Together, the protocol and toolkit move evaluation from leaderboard rank toward whether a result measures what it claims.

151. 【2606.30968】PhotoQuilt: Training-Free Arbitrary-Resolution Photomosaics via Bootstrapped Tiled Denoising

链接：https://arxiv.org/abs/2606.30968

作者：Koorosh Roohi,Javad Rajabi,Andrew Fleet,Babak Taati

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：coherent scene, Abstract, global, Photomosaics, PhotoQuilt

备注： 17 pages, 9 figures. Project page: [this https URL](https://kooroshrh.github.io/photo-quilt/)

点击查看摘要

Abstract:Photomosaics are large images whose local regions are seen as independent tiles while their overall arrangement forms a coherent scene. Generating them at high resolution, with every tile convincing in its own right, is computationally expensive, since the canvas must hold many detailed tiles at once. We present PhotoQuilt, a training-free framework that generates photomosaics at arbitrary resolution. Diffusion models struggle to satisfy both scales at once, as direct high-resolution generation is costly and tends toward one smooth image rather than a mosaic, while patch-based tiling keeps local detail but loses global structure. PhotoQuilt resolves this with a bootstrapped tiled denoising procedure. We first produce a global composition at low resolution to fix the layout, then upscale it in latent space and re-inject noise to restore generative capacity. Denoising proceeds within fixed tiles, so each forms its own image while the shared global structure holds them in one layout. Because tile generation is handled separately, PhotoQuilt scales to large canvases without quadratic attention cost. Experiments show that PhotoQuilt outperforms current baselines on both global structure and local realism.

152. 【2606.30951】Learning Where to Look: A Reinforcement Learning Framework for Robust Micro-Ultrasound Prostate Cancer Detection

链接：https://arxiv.org/abs/2606.30951

作者：Mohammad Mahdi Abootorabi,Sina Namazi,Armin Saadat,Lyuyang Wang,Obed Dzikunu,Paul F. R. Wilson,Zhuoxin Guo,Brian Wodlinger,Parvin Mousavi,Purang Abolmaesumi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：promising imaging modality, suspicious tissue remains, tissue remains highly, remains highly dependent, substantial inter-observer variability

备注： Early Accept at MICCAI 2026 (top 9%)

点击查看摘要

Abstract:Micro-ultrasound ($\mu$US) is a new, emerging, and promising imaging modality for prostate cancer (PCa) detection, but accurate identification of suspicious tissue remains highly dependent on clinical experience, leading to substantial inter-observer variability. Machine-learning assistance can reduce this variability; however, training reliable deep models is challenging because supervision is sparse and noisy -- typically limited to core-level histopathology outcomes (e.g., cancer grade and its percentage in a biopsy core) without pixel-level lesion annotations and under severe class imbalance. We introduce Prost-RL, which reframes $\mu$US PCa detection as a spatially aware, policy-driven inference problem by learning where to look before decoding. Prost-RL integrates a lightweight reinforcement-learning policy into a foundation-model encoder-decoder to generate interpretable spatial attention maps that act as soft prompts for both cancer-likelihood heatmap prediction and image-level classification. We further propose Adaptive Policy Optimization (APO) to stabilize hybrid supervised-RL training and a noise-robust objective combining symmetric cross-entropy with negative-entropy regularization to mitigate weak-label noise and encourage sharp localization. On a cohort of 6,607 biopsy cores from 693 patients across five clinical sites, Prost-RL achieves $79.0\pm3.5$ AUROC with $64.6\pm6.3$% sensitivity at 80% specificity for core-level detection (+2.1 AUROC and +4.5 sensitivity points over the strongest baseline), and $79.3\pm5.8$ AUROC for clinically significant cancer classification. The learned policy highlights biopsy-aligned regions, providing transparent, spatially grounded evidence alongside quantitative risk predictions. Code is available at: this https URL.

153. 【2606.30937】No Adaptation Without Observation: Observability-Constrained Test-Time Prompt Tuning for LiDAR Semantic Segmentation

链接：https://arxiv.org/abs/2606.30937

作者：Linlian Jiang,Wentao Ju,Sadman Rakib Pinon,Jianwei Xian,Zhixiang Chi,Xinxin Zuo,Yang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：evolving sensing conditions, retraining is impractical, real-world deployment due, degrades under real-world, due to evolving

备注： IROS 2026

点击查看摘要

Abstract:LiDAR semantic segmentation often degrades under real-world deployment due to evolving sensing conditions, while collecting new annotations for retraining is impractical. Test-time adaptation (TTA) updates model parameters online using pseudo-label supervision, but directly applying standard TTA strategies to LiDAR data is challenging. Because pseudo-label reliability is spatially heteroscedastic under range-dependent sparsity and occlusion, uniform updates on globally shared parameters can inject unstable gradients and destabilize adaptation. We propose a geometry-constrained test-time prompt tuning framework for LiDAR semantic segmentation. Our method estimates per-location sensing reliability from depth-consistent beam terminations and neighborhood support, and uses it to reweight spatial supervision. Adaptation is confined to lightweight prompt adapters inserted into a frozen backbone, with spatial gating to prevent unreliable regions from perturbing globally shared representations. A temporally smoothed prototype alignment strategy further stabilizes online updates by accumulating reliable semantic evidence over time. Experiments on standard LiDAR benchmarks demonstrate improved adaptation stability and segmentation performance under deployment variations without additional annotations.

154. 【2606.30901】GRAPE: Graph-Augmented Prototype Explanations for Interactive Medical Image Diagnosis

链接：https://arxiv.org/abs/2606.30901

作者：Rasul Khanbayov,Hasan Kurban

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：silently amplify unsafe, unsafe physician feedback, amplify unsafe physician, require full retraining, silently amplify

备注：

点击查看摘要

Abstract:Prototype-based medical image classifiers present three clinical limitations: they treat findings as independent, silently amplify unsafe physician feedback, and require full retraining whenever a new finding is needed. We present GRAPE (Graph-Augmented Prototype Explanations), a unified architecture that addresses all three challenges. First, a Graph Attention Task Head models anatomical concept co-occurrence, boosting macro-F1 by +13.8,pp over the prototype baseline on TBX11K. Second, a Concept-Mismatch Safety Check - the first such mechanism in prototype-based medical classifiers - warns when the model's dominant finding inside a doctor-drawn region conflicts with the claimed label, catching 85% of erroneous annotations versus 51% for MC-Dropout with no extra inference cost. Third, Open-Vocabulary Prototype Anchoring aligns visual prototypes to clinical text, allowing a new finding to be added from a single labeled image without modifying any other component. On NIH ChestX-ray14, one Effusion example recovers full-supervision localization accuracy; on TBX11K, prototype maps achieve 2.6x better lesion localization than end-to-end baselines. All three capabilities add only +1~ms latency at interactive batch size. The project page is this https URL.

155. 【2606.30896】Knowledge-Driven Dimension Estimation from a Single Image -3D Asset Generation Technology for Digital Twin Construction

链接：https://arxiv.org/abs/2606.30896

作者：Hidenori Sakaniwa,Akihito Akai,Akihiko Hyodo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：false detections, missed detections, simulation technology, enabling pre-evaluation, pre-evaluation of false

备注： 6 pages, 4 figures

点击查看摘要

Abstract:In the verification of in-vehicle cameras, simulation technology using virtual spaces has advanced, enabling pre-evaluation of false detections and missed detections in various scenarios. However, discrepancies in the scale of the object being verified between the virtual and real environments can lead to a decrease in camera recognition performance. For traffic signs installed at high altitudes, distance measurement using LiDAR or stereo cameras is difficult, requiring size estimation from monocular images. This paper proposes a method for estimating the scale of an object by decomposing it into multiple structural elements and integrating external knowledge regarding design rules, geometric relationships, and conventional dimensions. Specifically, this method detects each component from a monocular image and estimates the size of each component by considering its structural relationships and dimensional consistency with surrounding elements. Furthermore, it generates a 3D asset of the object by reconstructing the estimated components. This method makes it possible to place 3D assets with a scale approximating the real environment within a digital twin space and is expected to contribute to improving the verification accuracy of in-vehicle cameras for autonomous driving in virtual environments.

156. 【2606.30875】he Label Imitation Game: Turing Test Network for Zero-Shot Pseudo-Label Pruning

链接：https://arxiv.org/abs/2606.30875

作者：Brent A. Griffin,Jason J. Corso

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Foundation model pseudo-labeling, evade standard thresholds, labeling data strictly, Label Imitation Game, enables massive scale

备注： ECCV 2026

点击查看摘要

Abstract:Foundation model pseudo-labeling - labeling data strictly via zero-shot inference - enables massive scale, but performance is undermined by hallucinations that evade standard thresholds. To eliminate these errors, we introduce the Turing-inspired Label Imitation Game (LIG), a framework that formalizes pseudo-label pruning as an adversarial interrogation. Rather than filtering labels via isolated thresholds, we use the LIG to train a Turing Test Network (TTN), a task-agnostic "judge" that evaluates candidate pseudo-labels within a dataset-wide context. Experiments across four diverse datasets demonstrate the TTN's robustness, consistently enhancing label accuracy for three state-of-the-art vision-language models without costly supervision or retraining. Crucially, we demonstrate that learned semantic-contextual logic is a robust alternative to spatial-geometric verification, enabling a unique zero-shot task transfer capability - a TTN trained strictly on image classification datasets can effectively prune complex object detection pseudo-labels. This pruning yields F1-score gains of 28% for the worst-performing baseline categories and 44% with task-specific fine-tuning. Significantly, we also observe Category Revival, where the TTN pruning "detoxifies" the training signal for downstream models and enables them to recover from zero recall on transfer-vulnerable classes. The pre-trained TTN models and code are available at this https URL.

157. 【2606.30849】SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation

链接：https://arxiv.org/abs/2606.30849

作者：Juncheng Ma,Yuxuan Du,Yanan Sun,Zhening Xing,Changlin Li,Zhenyu Tang,Bo Li,Peng-Tao Jiang,Li Yuan,Daquan Zhou,Yonghong Tian

类目：Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：high computational cost, computational cost leads, substantial inference latency, significantly advanced audio-driven, Diffusion Transformers

备注： ECCV 2026

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have significantly advanced audio-driven portrait animation, but their high computational cost leads to substantial inference latency. Although training-free diffusion caching accelerates inference significant, existing methods are primarily developed for text-conditioned generation and overlook the spatial and modality imbalances inherent in audio-driven portrait animation. In this paper, we propose SyncCache, a training-free caching acceleration method tailored for DiT-based portrait animation that explicitly exploits asymmetric dynamics. Specifically, high-frequency dynamics driven by audio conditions and concentrated in human regions are more challenging and critical to cache and reuse than the low-frequency visual background in portrait animation. First, we introduce Spatially-Asymmetric Probing to prioritize error sensitivity in dynamic human region. Second, through Modality-Decoupled Caching, we bypass heavy DiT block by reusing stable inter-block residuals, while continuously recomputing lightweight audio blocks to preserve precise lip synchronization. Furthermore, we introduce a cache ratio to control cache capacity and formulate memory-adaptive cache selection as an offline dynamic programming problem without online overhead. Extensive experiments demonstrate that SyncCache achieves superior speed-quality trade-offs, delivering up to 4.12x acceleration on HunyuanVideo-Avatar and 3.75x on Wan-S2V with near-lossless visual fidelity and precise audio alignment.

158. 【2606.30811】AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

链接：https://arxiv.org/abs/2606.30811

作者：Kien T. Pham,I Chieh Chen,Qifeng Chen,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：unprecedented research attention, recently gained unprecedented, gained unprecedented research, synthesize high-quality sounding, high-quality sounding video

备注： ECCV 2026

点击查看摘要

Abstract:Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present \textbf{AVTok}, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.

159. 【2606.30809】GaussLite: Online Task-Conditioned 3D Gaussian Splatting for Real-Time Robotic Mapping

链接：https://arxiv.org/abs/2606.30809

作者：Annika Thomas,Mason Peterson,Jonathan P. How

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：downstream robotic tasks, robotic tasks engage, distribute representation capacity, representation capacity uniformly, Gaussian Splatting

备注：

点击查看摘要

Abstract:Existing 3D Gaussian Splatting (3DGS) systems distribute representation capacity uniformly across a scene, ignoring the fact that many downstream robotic tasks engage only a fraction of the reconstructed geometry. This causes valuable onboard compute to be allocated towards optimizing irrelevant parts of the scene, either limiting online capacity or under-optimizing the most relevant parts of the scene. We introduce GaussLite, a task-driven 3DGS mapping system that conditions its representation density on a natural-language task specification. Given a posed RGB-D stream and a task such as "prepare to pick up the object on the desk," GaussLite uses a one-shot LLM parser to extract target and anchor objects, which are grounded per-frame by an open-vocabulary detector and segmented to produce per-pixel relevance masks in real time. The mapper allocates seeding density, gradient flow and scaling by task relevance. At matched Gaussian budget and real-time mapping at 4 Hz on resource-constrained hardware, GaussLite outperforms baselines on ROI PSNR on the Replica Dataset by an average +2.72 dB and on a real-hardware demonstration in indoor and outdoor settings by +2.23 dB. We further show that two task-specialized agents' maps can be fused into a single shared map via per-voxel voting on active-optimization counts in real time, outperforming concatenation by +3.42 dB while only sharing an average 7.08% of the map.

160. 【2606.30807】Off the Rails: Hijacking the Scoring Head in Generative End-to-End Driving Planners with Safety-Violating Adversarial Perturbations

链接：https://arxiv.org/abs/2606.30807

作者：Halima Bouzidi,Mboutidem Ekemini Mkpong,Haoyu Liu,Mohammad Abdullah Al Faruque

类目：Robotics (cs.RO); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：dominant trajectory-decoding paradigms, autonomous driving, trajectory-decoding paradigms, models have recently, recently seen rapid

备注： 23 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Generative models have recently seen rapid adoption in End-to-End (E2E) autonomous driving (AD), with diffusion-based denoising and vocabulary-based retrieval becoming the dominant trajectory-decoding paradigms. Despite their architectural diversity, current generative AD planners share a common inference pattern: a fixed set of candidate trajectories (anchors, vocabulary entries, or proposal queries) is scored by one or more learned heads conditioned on the Bird's-Eye-View (BEV) features, and the highest-scored candidate is returned as the final trajectory. Under this design, the scoring head is the only barrier between perception and the motion command, and its decision margins between competing candidates are often small. We introduce \textsc{Derail}, an adversarial framework that exploits this scoring-head attack surface. Evaluated on various generative planners, \textsc{Derail} flips the trajectory selection from a safe to an unsafe candidate, with score drops of $39$--$80\%$ and collision rates of up to $50\%$, consistently outperforming generic loss-maximization and feature-divergence attacks. Our analysis suggests that safety-violating objectives govern attack effectiveness against generative AD planners, and that the scoring-head inference pattern itself is a recurring attack surface worth explicit defensive consideration.

161. 【2606.30795】Simple Supervision Is Hard to Beat: A Bitter Lesson from Sparse Target Labels in Domain-Adaptive Object Detection

链接：https://arxiv.org/abs/2606.30795

作者：Lijun Zhang,Ruinian Xu,Mudit Agrawal

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Source-free domain adaptive, Source-free domain, domain adaptive object, typically through teacher-student, domain adaptive

备注：

点击查看摘要

Abstract:Source-free domain adaptive object detection adapts a source-trained detector to an unlabeled target domain, typically through teacher-student self-training with pseudo-labels. We revisit this setting when a small, uniformly sampled subset of target images is labeled. We introduce Random-Target Supervised Mixing (RTSM), a simple anchor that incorporates these annotations through a supervised detection loss while leaving the original unlabeled adaptation branch unchanged. Across evaluations spanning four SFDA-OD methods, two object detectors, multiple adaptation tasks, and target-label budgets from 1% to 10%, RTSM consistently improves pure SFDA by 1.7 to 18.3 AP50. We then examine whether the same annotations can provide further gains by steering unlabeled self-training. To this end, we evaluate ten sparse-label feedback plugins covering pseudo-label selection, object completion, and optimization control, which yield limited and method-dependent gains over RTSM. These results reveal a bitter lesson for sparse-label SFDA-OD: simple supervision is hard to beat. RTSM therefore provides a simple yet effective anchor for sparse-label SFDA-OD.

162. 【2606.30777】Unveiling Transferability in Trajectory Prediction via Latent Scene Embeddings

链接：https://arxiv.org/abs/2606.30777

作者：Theodor Westny,David Axelsson,Björn Olofsson,Erik Frisk

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fueled major advances, data-driven motion prediction, growing availability, availability of trajectory, advances in data-driven

备注： Accepted to ECCV 2026

点击查看摘要

Abstract:The growing availability of trajectory datasets has fueled major advances in data-driven motion prediction. Yet, models trained on one dataset often fail to generalize beyond their training domain as a result of differences in scene layouts, agent behaviors, and sensing conditions. A framework that learns latent representations of datasets and quantifies their similarity using distributional metrics is presented. This large-scale study covers 24 major datasets, including the most widely used motion-prediction benchmarks, and shows that the resulting transferability scores strongly correlate with cross-dataset model performance. The results provide practical guidance for dataset selection, pretraining, and large-scale foundation models for motion prediction, paving the way toward more generalizable and robust predictive systems.

163. 【2606.30754】Streaming Gaussian Encoding for 4D Panoptic Occupancy Tracking

链接：https://arxiv.org/abs/2606.30754

作者：Maximilian Luz,Thomas Nürnberg,Yakov Miron,Abhinav Valada

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：enabling joint reasoning, panoptic occupancy tracking, holistic scene understanding, panoptic occupancy, multi-view imagery

备注： Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

点击查看摘要

Abstract:Camera-based 4D panoptic occupancy tracking (4D-POT) is a promising paradigm for holistic scene understanding from multi-view imagery, enabling joint reasoning about geometry, semantics, and object identities across time. Recent mask-based pipelines achieve strong performance by propagating instance queries across frames. However, their underlying volumetric representations are typically recomputed at each timestep, limiting geometric temporal consistency, particularly under occlusion and for static scene elements. To address this limitation, we propose a streaming Gaussian encoder that maintains a persistent volumetric scene representation for 4D-POT. Our method models the scene as a fixed-size set of latent Gaussian queries that are propagated via ego-motion compensation and refreshed under a confidence-guided budget constraint. Crucially, we shape Gaussian opacities through depth-based supervision to serve as proxy for visibility, enabling confidence to accumulate as a temporally aggregated measure of persistent scene support. Together with a warmup-based multi-frame training strategy, this yields representation-level temporal coherence beyond decoder-only tracking. Extensive experiments on Occ3D-extended nuScenes and Waymo establish a new state-of-the-art for camera-based 4D-POT, improving tracking consistency with negligible computational overhead while remaining fully compatible with existing mask-based pipelines. We provide code and models at this https URL.

164. 【2606.30697】LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents

链接：https://arxiv.org/abs/2606.30697

作者：Yogeswar Reddy Thota

类目：Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：expose interfaces optimized, Current operating systems, systems expose interfaces, Current operating, expose interfaces

备注： The LUMOS repository is available at [this https URL](https://github.com/thotayogeswarreddy/Lumos.git)

点击查看摘要

Abstract:Current operating systems expose interfaces optimized for human users but not for AI agents. Humans benefit from pixels, icons, windows, visual grouping, mouse movement, and keyboard shortcuts; AI agents instead need compact semantic state, grounded actions, and reliable feedback. As a result, many computer-use agents are forced to interpret screenshots, OCR output, and visual crops, introducing high token costs, visual ambiguity, latency, and coordinate uncertainty. This paper introduces LUMOS (Language Model Unified Machine-Readable Operating-System Semantics), a semantic interaction layer between AI agents and operating systems. LUMOS converts native accessibility metadata and browser UI structures into machine readable semantic blueprints with stable identifiers, roles, names, values, bounds, and action affordances. It also supports live semantic pointer grounding by querying the UI element under or near the cursor through operating-system automation APIs. An LLM then acts through an accessibility grounded observe act loop using constrained visible-UI primitives rather than application-specific scripts. LUMOS does not claim to replace visual agents; instead, it reduces dependence on screenshots when operating systems already provide semantic structure. These results suggest a path toward AI-native operating systems and machine-readable interaction layers.

165. 【2606.30677】DANTE-W: Diffuse Albedo Neural Texturing in the Wild

链接：https://arxiv.org/abs/2606.30677

作者：Guangyu Wang,Tianheng Lu,Ruqi Huang,Lu Fang

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：techniques blend captured, blend captured multi-view, Classical mesh texturing, multi-view images directly, texturing techniques blend

备注：

点击查看摘要

Abstract:Classical mesh texturing techniques blend captured multi-view images directly, which inevitably suffer from baked-in shading and casted shadows that compromise visual fidelity during relighting. To circumvent this issue, we present a neural texturing framework, namely DANTE-W, to enable high-fidelity diffuse albedo texture recovery from unstructured image collections for large-scale, in-the-wild scenes, which integrates seamlessly with traditional 3D reconstruction pipelines. Given a reconstructed mesh and its surface parameterization, our method fuses view-space generative albedo priors into a coherent texture space via an expressive neural representation, while substantially enhancing fine-grained textural details through physically principled neural rendering. To comprehensively evaluate our method, we curate a benchmark dataset featuring diverse, fine-grained textures, comprising both real-world in-the-wild scenes and synthetic objects. Extensive experiments verify the effectiveness of our approach in reconstructing accurate albedo textures and boosting relighting fidelity. Project page: this http URL.

166. 【2606.30673】PolyFlow: Continuous Topology Embedding Flow Matching for Artist-style Mesh Generation

链接：https://arxiv.org/abs/2606.30673

作者：Chunshi Wang,Haohan Weng,Junliang Ye,Biwen Lei,Yang Li,Zibo Zhao,Zeqiang Lai,Kaiyi Zhang,Yunhan Yang,Zhuo Chen,Chunchao Guo,Yawei Luo

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Transformers dominate high-quality, Autoregressive Transformers dominate, producing artist-worthy topologies, substantial computational overhead, inherent sequential decoding

备注：

点击查看摘要

Abstract:Autoregressive Transformers dominate high-quality mesh generation by producing artist-worthy topologies, yet their inherent sequential decoding induces substantial computational overhead, falling orders of magnitude slower than parallel generative models. On the other hand, while continuous diffusion and flow-matching methods support efficient parallel synthesis across a variety of domains, they cannot be directly applied to meshes: mesh connectivity is inherently discrete and incompatible with standard continuous noise injection and denoising operations. To resolve this fundamental incompatibility, we introduce a compact topology embedder that projects discrete mesh vertex positions and normals into continuous per-vertex embeddings, where the original discrete adjacency information can be faithfully recovered via spacetime distance thresholding. After pretraining and freezing this embedder, any raw mesh can be fully converted into a continuous per-vertex state space unifying position, normal, and implicit topological attributes. Built upon this novel continuous mesh representation, we present PolyFlow, a Transformer-based flow-matching framework that achieves fully parallel vertex state denoising conditioned on extracted point-cloud features. During inference, our model completes generation rapidly via an ODE solver, and supports explicit, precise control over output mesh resolution by directly specifying the target vertex count. Extensive evaluations on the Toys4K benchmark demonstrate that PolyFlow surpasses state-of-the-art autoregressive baselines in both Chamfer Distance and Hausdorff Distance.

167. 【2606.30647】Cross-Modal Hierarchical Fusion for from Multi-Sensor Ground Observation

链接：https://arxiv.org/abs/2606.30647

作者：Xinze Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：sparse ground-based instruments, ground-based instruments remains, Dense volumetric reconstruction, Dense volumetric, open problem

备注：

点击查看摘要

Abstract:Dense volumetric reconstruction of cloud microphysical fields from sparse ground-based instruments remains an open problem, largely because the available measurements are heterogeneous in both modality and spatial coverage. We present AtmoFuseNet, a framework that fuses multi-view sky camera imagery with millimeter-wave cloud radar and ceilometer observations to produce 4D (three spatial dimensions plus time) estimates of cloud state and wind. The method operates in three stages: a cross-modal hierarchical aggregation module that combines image feature pyramids with instrument-derived vertical profiles through layer-wise cross-attention; a conditional variational refinement module that maps the resulting volume to physically consistent microphysical fields under differentiable radar and image forward models; and a correlation-based motion estimator that recovers per-voxel 3D wind vectors from consecutive volumetric reconstructions. On collocated observations from a semi-arid site, AtmoFuseNet reaches 0.026 g m^-3 liquid water content MAE and 1.18 m s^-1 wind speed MAE, improving over existing retrieval baselines. Ablation experiments isolate the contribution of each module.

168. 【2606.31521】Distortion-Corrected Diffusion MRI Using Rotated-View EPI and Joint Field-Map/Image Estimation with Gaussian Primitives

链接：https://arxiv.org/abs/2606.31521

作者：Wenqi Huang,Zhitao Li,Nan Wang,Yimeng Lin,Mengze Gao,Yurui Qian,Sevgi Gokce Kafali,Xiaozhi Cao,Kawin Setsompop,Daniel Rueckert,Congyu Liao

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词：Echo Planar Imaging, Echo Planar, enabling rapid imaging, geometric distortions caused, Planar Imaging

备注：

点击查看摘要

Abstract:Echo Planar Imaging (EPI) is the standard acquisition technique for diffusion and functional neuroimaging, enabling rapid imaging but suffering from geometric distortions caused by B0 field inhomogeneities. Existing correction methods first reconstruct distorted images using parallel imaging, then estimate the B0 field and correct the distortion in the image domain. In this sequential process, reconstruction artifacts at high acceleration factors and low SNR at high diffusion b-values degrade B0 estimation and limit the overall correction quality. We propose a physics-informed framework that jointly estimates the B0 field and distortion-free image directly from k-space data, without depending on an intermediate parallel-imaging reconstruction for the correction. The image and the B0 field are each represented as a superposition of Gaussian primitives embedded within an MRI physics forward model. The explicit, continuous parameterization captures both smooth regions and tissue boundaries and supports rotated-view EPI acquisitions without interpolation. The diffusion-weighted image is modeled as real and non-negative, with the image phase absorbed into a per-shot phase factor. Rotated views distribute distortions across multiple phase-encoding orientations, improving point spread function isotropy and providing stronger constraints for B0 estimation. On in vivo brain diffusion EPI, the proposed method attains the closest brain-boundary agreement with a distortion-free structural reference, with the largest improvement over sequential methods at high b-value and high acceleration. Extensive visual comparisons further show improved detail fidelity and noise suppression.

169. 【2606.31084】Accelerating Merge with Motion Vector Difference via Filter Difference Analysis for VVenC

链接：https://arxiv.org/abs/2606.31084

作者：Xinmin Feng,Shengyang Xu,Jianhua Chen,Li Li,Dong Liu,Feng Wu

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Versatile Video Coding, key coding tool, Versatile Video, Video Coding, tool in Versatile

备注： 5 pages, 4 tables, 4 figures

点击查看摘要

Abstract:Merge with Motion Vector Difference (MMVD) is a key coding tool in Versatile Video Coding for improving motion prediction accuracy. However, its exhaustive search strategy imposes a significant computational burden on the encoder. To address this issue, we propose a novel fast MMVD algorithm for the VVenC encoder based on fractional motion vector filter difference analysis. By approximating the 8-tap interpolation filter with a 2-tap filter, we derive a criterion based on spatial gradients and prediction residuals for estimating the potential gain of MMVD candidates. We further generalize this criterion to accommodate both shifted integer reference samples and 2D separable filtering. To minimize the overhead of the proposed method, we introduce implementation optimizations, including symmetric offset inference and cross-shaped downsampled dot-product computation. Compared with existing fast MMVD algorithms in VVenC, our method reduces the average MMVD search ratio from 21.07\% to 11.05\% and decreases the efficiency-complexity metric $\eta$ from 11.79 to 7.10 under the fast preset.