本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新633篇论文,其中:

  • 自然语言处理80
  • 信息检索14
  • 计算机视觉135

自然语言处理

1. 【2605.16250】A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation

链接https://arxiv.org/abs/2605.16250

作者:Pavan Manjunath,Thomas Pruefer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)

关键词

备注

点击查看摘要

None

2. 【2605.16245】AI-Mediated Communication Can Steer Collective Opinion

链接https://arxiv.org/abs/2605.16245

作者:Stratis Tsirtsis,Kai Rawal,Chris Russell,Brent Mittelstadt,Sandra Wachter

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:Generative artificial intelligence, Generative artificial, large language models, humans exchange opinions, polish users' posts

备注

点击查看摘要

Abstract:Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users' posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions and shape individuals' opinions during human-AI interactions, less attention has been paid to its influence on collective opinion formation when mediating human-to-human communication. We address this gap via a combination of empirical and theoretical analyses. We show empirically that LLMs from multiple popular families introduce directional biases when instructed to edit human-written texts on contested topics, for example, nudging texts in favor of gun control and against atheism. Building on this observation, we introduce a mathematical model of opinion dynamics in which an AI system sits between users on a social network, transforming the opinions they express and perceive. By analytically characterizing the equilibrium of this model and performing simulations on real social network data, we show that biases introduced by AI in human-to-human communication can be amplified through the network and shift collective opinion in their direction. In light of these findings, we investigate whether such biases are controllable by online platforms. We audit the "Explain this post" feature on X and find evidence of pro-life bias in Grok's outputs on abortion-related content, which we trace back to specific design choices. We conclude with a discussion of the broader implications of our findings in relation to ongoing legislative efforts in the European Union.

3. 【2605.16234】Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find

链接https://arxiv.org/abs/2605.16234

作者:Gabriel Garcia

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:conflate distinct tests, distinct tests, conflate distinct, cs.LG, Abstract

备注: 40 pages, 8 figures, 24 tables. Code and frozen JSON logs are not public during write-up; the authors plan to open [this https URL](https://github.com/Gpgabriel25/ProtocolGapDiagnostic)

点击查看摘要

Abstract:When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

Comments:
40 pages, 8 figures, 24 tables. Code and frozen JSON logs are not public during write-up; the authors plan to open this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
I.2.6; I.2.7

Cite as:
arXiv:2605.16234 [cs.LG]

(or
arXiv:2605.16234v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.16234

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2605.16233】FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

链接https://arxiv.org/abs/2605.16233

作者:Igor Bogdanov,Chung-Horng Lung,Thomas Kunz,Jie Gao,Adrian Taylor,Marzia Zaman

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)

关键词:agents improve decision-making, gradient updates, Failure-Optimized Reflective Graduation, decision-making through self-generated, LLM agents improve

备注

点击查看摘要

Abstract:Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

5. 【2605.16232】A Unified Generative-AI Framework for Smart Energy Infrastructure: Intelligent Gas Distribution, Utility Billing, Carbon Analytics, and Quantum-Inspired Optimisation

链接https://arxiv.org/abs/2605.16232

作者:Pavan Manjunath,Thomas pruefer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)

关键词:generative artificial intelligence, manage physical infrastructure, quantum-inspired combinatorial optimisation, energy utilities manage, utilities manage physical

备注

点击查看摘要

Abstract:The accelerating convergence of smart metering, generative artificial intelligence, and quantum-inspired combinatorial optimisation is reshaping how energy utilities manage physical infrastructure, customer engagement, and environmental accountability

6. 【2605.16222】Artificial Aphasias in Lesioned Language Models

链接https://arxiv.org/abs/2605.16222

作者:Nathan Roll,Jill Kries,Laura Gwilliams,Cory Shain

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:affected brain regions, providing causal links, selective language impairments, brain damage, functional organization

备注: 49 pages, 13 figures

点击查看摘要

Abstract:Aphasias, selective language impairments which can arise from brain damage, reveal the functional organization of human language by providing causal links between affected brain regions and specific symptom profiles. Drawing on this literature, we introduce an aphasia-inspired technique to characterize the emergent functional organization of language models (LMs). We ``lesion'' (zero-out) model parameters and measure the effects of this intervention against clinical aphasia symptoms, as diagnosed by the Text Aphasia Battery (TAB). When applied to 112,426 outputs from five 1B-scale LMs, the full range of evaluated symptoms surface, but in distributions largely distinct from those of humans. Our method uncovers broad symptom-profile differences between attention components (query, key, value, output) and feed-forward components (up, gate, down), with weaker evidence for differences among components within the same mechanism. We also find an effect of depth, where lesions in early layers disproportionately cause syntactic and semantic symptoms while late-middle layers yield higher rates of phonological and fluency deficits. Although some LM lesions induce quantitatively more similar profiles to some human aphasia types than others, qualitative differences in symptom patterns between LMs and humans suggest that aphasia syndromes are heavily influenced by the details of learning and processing rather than being a domain-invariant consequence of disrupted language processing.

7. 【2605.16217】Argus: Evidence Assembly for Scalable Deep Research Agents

链接https://arxiv.org/abs/2605.16217

作者:Zhen Zhang,Liangcai Su,Zhuo Chen,Xiang Lin,Haotian Xu,Simon Shaolei Du,Kaiyu Yang,Bo An,Lidong Bing,Xinyu Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:information seeking tasks, achieved remarkable progress, complex information seeking, seeking tasks, Deep research

备注

点击查看摘要

Abstract:Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

8. 【2605.16215】Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

链接https://arxiv.org/abs/2605.16215

作者:Xavier Theimer-Lienhard,Mushtaha El-Amin,Fay Elhassan,Sahaj Vaidya,Victor Cartier-Negadi,David Sasu,Lars Klein,Mary-Anne Hartley

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:decision support systems, Clinical decision support, require scrutable, Fully Open, support systems

备注: Preprint. 31 pages, 10 figures. Code, models, and data: [this https URL](https://github.com/EPFLiGHT/FullyOpenMeditron)

点击查看摘要

Abstract:Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

9. 【2605.16207】Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

链接https://arxiv.org/abs/2605.16207

作者:Tahreem Yasir,Wenbo Li,Sam Gilson,Sutapa Dey Tithi,Xiaoyi Tian,Tiffany Barnes

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Effective tutoring requires, intelligent tutoring systems, requires distinguishing optimal, tutoring requires distinguishing, Effective tutoring

备注: 22 pages, 20 fgures

点击查看摘要

Abstract:Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

10. 【2605.16205】Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

链接https://arxiv.org/abs/2605.16205

作者:Igor Bogdanov,Chung-Horng Lung,Thomas Kunz,Jie Gao,Adrian Taylor,Marzia Zaman

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)

关键词:Deploying compound LLM, partially observable sequential, sequential environments requires, environments requires navigating, Partially Observable Markov

备注

点击查看摘要

Abstract:Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

11. 【2605.16193】Improving Cross-Cultural Survey Simulation with Calibrated Value Personas

链接https://arxiv.org/abs/2605.16193

作者:Axel Abels,Elias Fernandez Domingos,Apurva Shah,Tom Lenaerts

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large language models, cultures remains limited, Large language, language models, remains limited

备注: Submitted to the Fourth International Workshop on Value Engineering in AI (VALE 2026), held at IJCAI-ECAI 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate human opinions and survey responses, but their ability to reproduce population responses across cultures remains limited. Existing persona-based prompting methods typically rely on sociodemographic or personality traits, which are only indirect proxies for the values that shape human responses. We propose a value-based persona construction method that derives textual descriptors from survey responses capturing core cultural dimensions. By sampling value profiles from target populations and aggregating LLM responses across personas, we obtain population-level predictions grounded in observed value distributions. We further introduce a calibration procedure that improves response diversity while preserving estimated opinions. We show that our approach reduces prediction error across countries, with the largest improvements observed in underrepresented populations. This substantially narrows the performance gap between countries aligned with dominant LLM priors and those that are less represented in training data, while also yielding response distributions that closely match human diversity.

12. 【2605.16191】Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search

链接https://arxiv.org/abs/2605.16191

作者:Michael P. Brenner,Lizzie Dorfman,John C. Platt

类目:Computation and Language (cs.CL); Other Condensed Matter (cond-mat.other); Computational Physics (physics.comp-ph)

关键词:Empirical Research Assistance, present a case, case study, Empirical Research, Research Assistance

备注: 10 pages 7 figures

点击查看摘要

Abstract:We present a case study for how AI coding systems can be used to generate novel scientific hypotheses. We combine a generic coding agent (Google's AntiGravity) with an LLM-driven tree search algorithm (Empirical Research Assistance / ERA) to autonomously generate high-efficiency three-dimensional photovoltaic (3DPV) structures that overcome losses limiting flat solar panels at mid-latitudes. These structures operate by presenting favorable angles to the sun throughout the day, and for illustrative purposes we focus on optimizing performance for a single solar day. Our workflow begins by using AntiGravity to reproduce calculations \cite{bernardi2012solar} showing that 3DPV can have energy densities much higher than stationary flat PV panels. We use these initial designs as the starting point for large scale tree search, where we seek improved solutions and score them for their diurnal yield. The initial tree search leads to nominally more efficient solutions, yet they are caused by algorithmic reward hacking, arising from non-physical design features such as structurally levitating disconnected tiers and exploitations of the discretizations in the optics solver. To counteract this, we develop a workflow where the coding agent iteratively patches the physics engine with constraints to eliminate reward hacking. With reward-hacking eliminated, ERA discovers a series of designs with various constraints and improved performance, including optimal designs with different fixed collector areas, optimizing zenith tracking and avoiding self shadowing. Combining coding agents with tree search (ERA) provides a powerful platform for scientific discovery, for problems whose solutions can be empirically evaluated with a score function.

Comments:
10 pages 7 figures

Subjects:

Computation and Language (cs.CL); Other Condensed Matter (cond-mat.other); Computational Physics (physics.comp-ph)

Cite as:
arXiv:2605.16191 [cs.CL]

(or
arXiv:2605.16191v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.16191

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
13. 【2605.16143】Look Before You Leap: Autonomous Exploration for LLM Agents

链接https://arxiv.org/abs/2605.16143

作者:Ziang Ye,Wentao Shi,Yuxin Liu,Yu Wang,Zhengzhou Cai,Yaorui Shi,Qi Gu,Xunliang Cai,Fuli Feng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model, sufficient environment-specific information, language model based, unfamiliar environments due, acquiring sufficient environment-specific

备注

点击查看摘要

Abstract:Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

14. 【2605.16117】SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation

链接https://arxiv.org/abs/2605.16117

作者:Xin Zhang,Yang Cao,Baoxing Wu,Kai Song,Siying Li

类目:Computation and Language (cs.CL)

关键词:Large Language Models, diverse NLP applications, Large Language, demonstrated strong capabilities, NLP applications

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.

15. 【2605.16113】DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.16113

作者:Rui Chu,Bingyin Zhao,Thanh Quoc Hung Le,Duy Cao Hoang,Huawei Lin,Ping Li,Weijie Zhao,Khoa D Doan,Yingjie Lao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, achieved unprecedented success, unprecedented success due, Large language, exceptional generative capabilities

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.

16. 【2605.16107】Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection

链接https://arxiv.org/abs/2605.16107

作者:Chenwang Wu,Yiuming Cheung,Bo Han,Shuhai Zhang,Defu Lian

类目:Computation and Language (cs.CL)

关键词:Machine-generated texts, pose risks, disinformation and phishing, Machine-generated, token-level detection score

备注

点击查看摘要

Abstract:Machine-generated texts (MGTs) pose risks such as disinformation and phishing, underscoring the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. Then, we theoretically derive the multi-hop transitions of the token-level detection score and explore their local and global relations. Based on these findings, we propose a multi-level contextual token relation modeling framework for MGT detection. Specifically, for local relations, we model them through a lightweight Markov-informed calibration module that refines token-level evidence before aggregation. For global relations, we introduce a rule-support reasoning module that uses explicit logical rules derived from contextual score statistics. Finally, we combine the local calibrated score and the global rule-support reasoning signal in a joint multi-level inference framework. Extensive experiments show broad and substantial improvements across various real-world scenarios, including cross-LLM and cross-domain settings, with low computational overhead.

17. 【2605.16077】Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

链接https://arxiv.org/abs/2605.16077

作者:Si-Belkacem Yamine Ketir,Lenard Paulo Tamayo,Shohei Hisada,Shaowen Peng,Shoko Wakamiya,Eiji Aramaki

类目:Computation and Language (cs.CL)

关键词:remains challenging due, limited dataset size, Accurate assessment, speech remains challenging, remains challenging

备注: 11 pages, 6 figures

点击查看摘要

Abstract:Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.

18. 【2605.16052】Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

链接https://arxiv.org/abs/2605.16052

作者:Parisa Kordjamshidi,Samer Aslan,Madhavan Seshadri,Leslie Barrett,Enrico Santus

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, significantly enhanced automated, Recent advances, enhanced automated legal, language models

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

19. 【2605.16045】RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

链接https://arxiv.org/abs/2605.16045

作者:Zijie Dai,Shiyuan Deng,Sheng Guan,Yizhou Tian,Xin Yao,Xiao Yan,James Cheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:limited context windows, retrievable external memory, organize user-agent interactions, Memory, Memory systems

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.

20. 【2605.16026】From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

链接https://arxiv.org/abs/2605.16026

作者:Yu Pan,Yang Hou,Xiongfei Wu,Liang Zhang,Yves Le Traon,Lei Ma,Jianjun Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:recently shown promising, large language models, speech large language, shown promising performance, built upon speech

备注: Submitted to IEEE/ACM TASLP. This work extends S2ST-Omni, accepted to Findings of ACL 2026

点击查看摘要

Abstract:Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.

21. 【2605.16023】Judge Circuits

链接https://arxiv.org/abs/2605.16023

作者:Nils Feldhus,Tanja Baeumel,Elena Golimblevskaia,Qianli Wang,Van Bach Nguyen,Aaron Louis Eidt,Christopher Ebert,Wojciech Samek,Jing Yang,Vera Schmitt,Sebastian Möller,Simon Ostermann

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:False label, model assigns systematically, Edge Attribution Patching, Position-aware Edge Attribution, dominant paradigm

备注: 32 pages

点击查看摘要

Abstract:LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

22. 【2605.16011】Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

链接https://arxiv.org/abs/2605.16011

作者:Jie Gao,Yongan Yu,Junzhu Su,Yiran Lin,Adam K. Dube,Jackie Chi Kit Cheung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:learners' learning performance, track learners' learning, individual learners' learning, learners' learning progress, learners' learning

备注

点击查看摘要

Abstract:Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.

23. 【2605.15990】Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory

链接https://arxiv.org/abs/2605.15990

作者:Isar Nejadgholi,Masoud Kianpour,Krishnapriya Vishnubhotla,Maryam Molamohamadi

类目:Computation and Language (cs.CL)

关键词:Tremendous efforts, systems across cultures, put into evaluating, evaluating the inclusivity, inclusivity and effectiveness

备注

点击查看摘要

Abstract:Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers "Does the model know?", Cultural Sensitivity answers "How does it frame its knowledge?", and Cultural Competence answers "Can it adapt as the interaction evolves?". Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.

24. 【2605.15978】Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports

链接https://arxiv.org/abs/2605.15978

作者:Anita Srbinovska,Jansen Orfan,Adrian Martin,Ernest Fokoué

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)

关键词:Law enforcement reports, Law enforcement, structured fields, fields and written, written narratives

备注: 13 pages, 8 figures, 9 tables

点击查看摘要

Abstract:Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank--VerbNet--WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.

25. 【2605.15976】Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

链接https://arxiv.org/abs/2605.15976

作者:Ernesto Garcia-Estrada,Carlos Escolano,José A. R. Fonallosa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Production machine translation, machine translation relies, translation relies overwhelmingly, reinforcement learning approaches, largely targeted decoder-only

备注

点击查看摘要

Abstract:Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.

26. 【2605.15915】SLIP ETHICS: Graduated Intervention for AI Emotional Companions

链接https://arxiv.org/abs/2605.15915

作者:Minseo Kim

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:damage supportive alliance, emotional companions face, permissive systems risk, risk user harm, systems risk user

备注: Accepted to PervasiveHealth 2026. 11 pages, 2 figures, 4 tables. Proc. of the 20th EAI International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth 2026)

点击查看摘要

Abstract:AI emotional companions face a safety-rapport paradox: restrictive safeguards can damage supportive alliance, while permissive systems risk user harm. We present SLIP (Staged Layers of Intervention Protocol), a four-stage graduated methodology deriving interventions (none, soft, hard) from structured qualitative indicators -- affect intensity (a) and narrative dynamism (m) -- alongside ETHICS (Emergent Taxonomy for Human-AI Interaction Context Signals), a "signals not labels" taxonomy. An evaluation combining a small-scale production deployment (N=68 entries, 10 users, 10 weeks) with a synthetic persona battery (N=91, 5 behavioral-risk profiles) achieved 0% false positives for the flow persona and showed expected escalation patterns in crisis-oriented personas. However, initial results showed that 8 consecutive days of high-energy elevation produced zero interventions (0/8), exposing a boundary where the "do not pathologize" principle conflicts with safety. A subsequent three-model stress test demonstrated that increased model capability improves detection from 0/8 to 6/8 while preserving 0/10 flow false positives in the largest model. Read as preliminary, these findings position graduated intervention as a design direction for navigating -- not resolving -- the safety-rapport tension in affective computing.

27. 【2605.15913】owards Generalization of Block Attention via Automatic Segmentation and Block Distillation

链接https://arxiv.org/abs/2605.15913

作者:Shuaiyi Li,Zhisong Zhang,Yan Wang,Lei Zhu,Dongyang Ma,Chenlong Deng,Yang Deng,Wai Lam

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:offers significant potential, Retrieval-Augmented Generation, offers significant, significant potential, potential to improve

备注: 16 pages, 2 figures

点击查看摘要

Abstract:Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.

28. 【2605.15886】Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

链接https://arxiv.org/abs/2605.15886

作者:Daria Blinova,Gayathri Emuru,Rakesh Emuru,Kushagradheer Shridheer Srivastava,Mina Rulis,Sunita Chandrasekaran,Benjamin E. Bagozzi

类目:Computation and Language (cs.CL)

关键词:addressing persistent deficiencies, addressing persistent, paper introduces, persistent deficiencies, authoritarian politics contexts

备注

点击查看摘要

Abstract:This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.

29. 【2605.15864】Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

链接https://arxiv.org/abs/2605.15864

作者:Chufan Shi,Cheng Yang,Yaokang Wu,Linhao Jin,Bo Shui,Taylor Berg-Kirkpatrick,Xuezhe Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:produce self-reflective statements, produce self-reflective, check the figure, Vision-Language Models, self-reflective statements

备注: ICML 2026 Spotlight

点击查看摘要

Abstract:Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: this https URL

30. 【2605.15848】Conversations in Space: Structuring Non-Linear LLM Interactions on a Canvas

链接https://arxiv.org/abs/2605.15848

作者:Rifat Mehreen Amin,Alperen Adatepe,Daniela Fernandes,Daniel Buschek,Andreas Butz

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:large language models, structure limits exploration, Conversational interfaces powered, language models, ideation and analysis

备注

点击查看摘要

Abstract:Conversational interfaces powered by large language models (LLMs) are widely used for ideation and analysis, yet their linear structure limits exploration of alternatives and management of long-running interactions. We present CanvasConvo, a conversational interface concept that transforms linear chat into a branching conversation tree embedded in a spatial canvas. CanvasConvo enables users to explore what-if scenarios by branching directly from conversational content, supporting parallel development of alternative directions. These branches are visualized on a canvas while remaining integrated with a familiar chat interface, allowing users to switch between linear and non-linear interaction. Features such as timeline-based navigation, automatic tagging and summarization, and context-aware controls (e.g., goals, reusable prompts) support structured interaction and continuity. We evaluated CanvasConvo in a 5-7 day field study with 24 participants. Our findings highlight how non-linear conversational structures support exploratory workflows and different interactions in LLM-based work.

31. 【2605.15815】BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

链接https://arxiv.org/abs/2605.15815

作者:Sihan Fu,Oucheng Liu,Shiyuan Wang,Jin Shi,Chengkun Wei

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:usable development state, Code agents increasingly, unfamiliar repositories, costly prerequisite, development state

备注: 19 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Code agents increasingly help developers work with unfamiliar repositories, but every such task depends on a costly prerequisite: bootstrapping the repository into a usable development state. This process requires substantial trial-and-error exploration, yet the resulting knowledge--resolved dependencies, repair strategies--stays trapped in a single conversation, unavailable to future agents. We therefore formulate repository bootstrapping as a reusable startup knowledge problem and introduce BootstrapAgent, a multi-agent framework that distills the heuristics discovered during bootstrap exploration into a persistent, verifiable, agent-consumable .bootstrap contract. Through evidence extraction, structured planning, deterministic Docker-based verification, and trace-driven repair, BootstrapAgent generates a contract covering environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. We further propose warm repair with clean replay to accelerate iterative debugging without sacrificing cold-start reproducibility, and a delta repair with sanity check to prevent reward hacking. Experiments on three benchmarks show that BootstrapAgent achieves a 92.9% success rate, outperforming the baseline by over 10% while reducing downstream agent token usage by 25.9% and build time by 22.3%. Our code is available at this https URL.

32. 【2605.15794】ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

链接https://arxiv.org/abs/2605.15794

作者:Michał Ciesiółka,Dawid Wiśniewski,Adrian Charkiewicz,Kamil Guttmann

类目:Computation and Language (cs.CL)

关键词:Format-Preserving Multilingual Translation, preserves original layout, original layout metadata, layout metadata proposed, Format-Preserving Multilingual

备注

点击查看摘要

Abstract:We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.

33. 【2605.15763】CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

链接https://arxiv.org/abs/2605.15763

作者:Kamil Guttmann,Zofia Fraś,Artur Nowakowski,Krzysztof Jassem

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:raising data privacy, data privacy concerns, machine translation relies, Quality Estimation, relies on massive

备注

点击查看摘要

Abstract:Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.

34. 【2605.15759】DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory

链接https://arxiv.org/abs/2605.15759

作者:Wentao Qiu,Haotian Hu,Fanyi Wang,Jinwei Kong,Yu Zhang

类目:Computation and Language (cs.CL)

关键词:Large language model, Large language, past interactions, leverage information, information from past

备注

点击查看摘要

Abstract:Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \textbf{DimMem}, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \textbf{81.43\%} and \textbf{78.20\%} overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \textbf{24\%}. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at this https URL.

35. 【2605.15726】Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

链接https://arxiv.org/abs/2605.15726

作者:Chanuk Lee,Sangwoo Park,Minki Kang,Sung Ju Hwang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, Reinforcement learning, language models, learning with verifiable, paradigm for improving

备注: 28 pages, 7 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at this https URL.

36. 【2605.15721】Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

链接https://arxiv.org/abs/2605.15721

作者:Jiachen Zhu,Zhuoying Ou,Congmin Zheng,Yuxiang Chen,Zeyu Zheng,Rong Shan,Lingyu Yang,Lionel Z. Wang,Weiwen Liu,Yong Yu,Weinan Zhang,Jianghao Lin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, automated context engineering, motivating the development

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are highly sensitive to their input contexts, motivating the development of automated context engineering. However, existing methods predominantly treat this as a global search problem, seeking a single context strategy that maximizes average performance across a dataset. This restrictive assumption overlooks the fact that different inputs often require distinct guidance, leaving substantial instance-level performance gains untapped. In this paper, we propose a paradigm shift by formulating context engineering as a recommendation problem. We introduce \textbf{Neural Collaborative Context Engineering (NCCE)}, a framework that transitions optimization from a static global search to dynamic, instance-wise routing. NCCE first bootstraps a diverse catalog of anchor contexts and then employs a novel \textbf{Context-CF Co-Evolution} mechanism. This stage establishes a synergistic feedback loop: a lightweight Neural Collaborative Filtering (NCF) model learns instance-context preferences to guide the generation of specialized context variants, while the newly evaluated contexts continuously refine the NCF model's understanding of latent preferences. At inference time, the trained NCF model acts as a context router, dynamically assigning the most suitable context strategy to each unseen instance. Theoretical Proofs and comprehensive experiments demonstrate that by matching individual inputs with their optimal contexts, NCCE significantly improves task accuracy, highlighting the critical importance of personalization in LLM context engineering.

37. 【2605.15710】SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

链接https://arxiv.org/abs/2605.15710

作者:Huacan Chai,Yukai Wang,Yingxuan Yang,Dan Peng,Yuanyi Song,Zhihui Fu,Weiwen Liu,Jianghao Lin,Jun Wang,Weinan Zhang

类目:Computation and Language (cs.CL)

关键词:Existing benchmarks, independently originated sources, Source-distributed Multimodal Memory, multimodal memory, distributed across independently

备注

点击查看摘要

Abstract:Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at this https URL.

38. 【2605.15701】H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

链接https://arxiv.org/abs/2605.15701

作者:Jiawei Yu,Yixiang Fang,Xilin Liu,Yuchi Ma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Model, Large Language, OpenClaw and Manus, ubiquitous in Large, Language Model

备注

点击查看摘要

Abstract:Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.

39. 【2605.15687】ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

链接https://arxiv.org/abs/2605.15687

作者:Jiahui Guang,Yingjie Zhu,Cuiyun Gao,Haiyan Wang,Jing Li,Di Shao,Zhaoquan Gu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:memorize sensitive cross-modal, sensitive cross-modal information, Multimodal large language, making machine unlearning, large language models

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.

40. 【2605.15680】Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

链接https://arxiv.org/abs/2605.15680

作者:Liqi Zhou,Jiafu Li

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词:Online patient inquiries, Online patient, professional assessment, clinical follow-up, patient inquiries

备注: 4 figures, 19 tables, 23 pages (including appendix and reference)

点击查看摘要

Abstract:Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.

41. 【2605.15677】VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

链接https://arxiv.org/abs/2605.15677

作者:Xiaoyan Su,Peijie Dong,Zhenheng Tang,Song Tang,Yuyao Zhai,Kaitao Lin,Liang Chen,Gai Yuhang,Yuyu Luo,Qiang Wang,Xiaowen Chu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:critical gap remains, controllable diagrammatic tasks, Vision-Language Models, diagrammatic tasks essential, controllable diagrammatic

备注: Accepted by ICML2026, 37 pages, 10 figures

点击查看摘要

Abstract:Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

42. 【2605.15676】Dynamic Chunking for Diffusion Language Models

链接https://arxiv.org/abs/2605.15676

作者:Yichen Zhu,Xiaoming Shi,Peng Zhao,Weiyu Chen,Debing Zhang,James Kwok

类目:Computation and Language (cs.CL)

关键词:decoupling within-block parallel, within-block parallel denoising, language models factorize, diffusion language models, fixed-size positional blocks

备注

点击查看摘要

Abstract:Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into $K$ clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.

43. 【2605.15635】Evaluating Chinese Ambiguity Understanding in Large Language Models

链接https://arxiv.org/abs/2605.15635

作者:Junwen Mo,Yuanzhi Lu,Yifang Xue,Ke Xu,Hideki Nakayama

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, limited attention devoted, robustness of Large, Chinese ambiguity

备注

点击查看摘要

Abstract:Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B's CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity. We further observe that models exhibit a bias toward dominant interpretations. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs' ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs.

44. 【2605.15613】oward LLMs Beyond English-Centric Development

链接https://arxiv.org/abs/2605.15613

作者:Sho Takase,Ukyo Honda

类目:Computation and Language (cs.CL)

关键词:large language models, biased toward English, open-weight large language, analysis of sequences, sequences generated

备注

点击查看摘要

Abstract:Through an analysis of sequences generated by open-weight large language models (LLMs), we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.

45. 【2605.15609】PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

链接https://arxiv.org/abs/2605.15609

作者:Shengyin Sun,Yiming Li,Renxi Liu,Xinqi Li,Hui-Ling Zhen,Weizhe Lin,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Chen Ma

类目:Computation and Language (cs.CL)

关键词:Diffusion large language, Diffusion large, generate text, masked token sequences, text by iteratively

备注: 16 pages

点击查看摘要

Abstract:Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.

46. 【2605.15607】Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

链接https://arxiv.org/abs/2605.15607

作者:Vinayshekhar Bannihatti Kumar,Disha Makhija,Manoj Ghuhan Arivazhagan,Rashmi Gangadharaiah

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:achieve high pass, remains poorly understood, high pass rates, code generation benchmarks, pretraining remains poorly

备注

点击查看摘要

Abstract:Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.

47. 【2605.15604】VSPO: Vector-Steered Policy Optimization for Behavioral Control

链接https://arxiv.org/abs/2605.15604

作者:Xuechen Zhang,Zijian Huang,Kai Yang,Weijia Zhang,Jiasi Chen,Samet Oymak

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Modern language models, Modern language, secondary behavioral preferences, accommodating secondary behavioral, primary accuracy objective

备注

点击查看摘要

Abstract:Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

48. 【2605.15589】MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

链接https://arxiv.org/abs/2605.15589

作者:Weixin Liu,Congning Ni,Shelagh A. Mulvaney,Susannah L. Rose,Murat Kantarcioglu,Bradley A. Malin,Zhijun Yin

类目:Computation and Language (cs.CL)

关键词:mental health domain, capture related biomedical, related biomedical knowledge, clinically salient structured, Large language models

备注: Accepted to GEM 2026, ACL 2026 Workshop; 9 pages main text plus references and appendices

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.

49. 【2605.15588】Calibrating LLMs with Semantic-level Reward

链接https://arxiv.org/abs/2605.15588

作者:Fengfei Yu,Ruijia Niu,Dongxia Wu,Yian Ma,Rose Yu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:requiring well-calibrated uncertainty, medical question answering, legal reasoning, requiring well-calibrated, well-calibrated uncertainty

备注

点击查看摘要

Abstract:As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

50. 【2605.15573】Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

链接https://arxiv.org/abs/2605.15573

作者:Nurbek Tastan,Alex Iacob,Lorenzo Sani,Meghdad Kurmanji,Nicholas D. Lane,Samuel Horvath,Karthik Nandakumar

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:multiple Large Language, Large Language Model, Large Language, Language Model agents, multiple Large

备注

点击查看摘要

Abstract:Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.

51. 【2605.15572】Measuring Maximum Activations in Open Large Language Models

链接https://arxiv.org/abs/2605.15572

作者:Luxuan Chen,Han Tian,Xinran Chen,Rui Kong,Fang Wang,Jiamin Chen,Yuchen Li,Jiashu Zhao,Shuaiqiang Wang,Haoyi Xiong,Dawei Yin

类目:Computation and Language (cs.CL)

关键词:stable LLM inference, first-order constraint, LLM inference, stable LLM, modern open LLMs

备注

点击查看摘要

Abstract:The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at this https URL.

52. 【2605.15562】GiLT: Augmenting Transformer Language Models with Dependency Graphs

链接https://arxiv.org/abs/2605.15562

作者:Tianyu Huang,Yida Zhao,Chuyan Zhou,Kewei Tu

类目:Computation and Language (cs.CL)

关键词:Transformer Language Model, linguistic structures effectively, structures effectively enhances, augmenting Transformer language, Transformer Language

备注

点击查看摘要

Abstract:Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at this https URL.

53. 【2605.15557】When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation

链接https://arxiv.org/abs/2605.15557

作者:De Shuai Zhang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:attractive for non-autoregressive, update all positions, Continuous diffusion, Continuous, continuous latent

备注: 17 pages, 1 figure, 6 tables. Technical Report v1. Stage 1 complete; Stage 2 ongoing Code: [this https URL](https://github.com/saslifat-gif/structured-latent-text-refinement)

点击查看摘要

Abstract:Continuous diffusion and flow models are attractive for non-autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft-conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian-start experiments showed that good latent-space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768-dimensional BERT latents recover tokens much better than compressed 256-dimensional latents. With 768-dimensional latents, DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder-aware readout give modest additional gains, while metric learning and OT-style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure.

54. 【2605.15532】DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

链接https://arxiv.org/abs/2605.15532

作者:Jaehun Jung,Hyunwoo Kim,Brandon Cui,Ximing Lu,David Acuna,Prithviraj Ammanabrolu,Yejin Choi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enables compact Vision-Language, Distillation enables compact, strong reasoning capabilities, compact Vision-Language Models, obtain strong reasoning

备注

点击查看摘要

Abstract:Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($\Delta$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.

55. 【2605.15529】Process Rewards with Learned Reliability

链接https://arxiv.org/abs/2605.15529

作者:Jinyuan Li,Langlin Huang,Chengsong Huang,Shaoyang Xu,Donghong Cai,Yuyi Yang,Wenxuan Zhang,Jiaxin Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Process Reward Models, single reward score, provide step-level feedback, Process Reward, Reward Models

备注

点击查看摘要

Abstract:Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

56. 【2605.15518】DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

链接https://arxiv.org/abs/2605.15518

作者:Junchao Wu,Yefeng Liu,Chenyu Zhu,Hao Zhang,Zeyu Wu,Tianqi Shi,Yichao Du,Longyue Wang,Weihua Luo,Jinsong Su,Derek F. Wong

类目:Computation and Language (cs.CL)

关键词:Large Language Model, increasingly critical due, governance of Large, Language Model, Large Language

备注: ACL 2026 Main

点击查看摘要

Abstract:The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.

57. 【2605.15514】RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

链接https://arxiv.org/abs/2605.15514

作者:Yufeng Du,Phillip Harris,Minyang Tian,Eliu A Huerta,Srikanth Ronanki,Subendhu Rongali,Aram Galstyan,Hao Peng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Rotary Positional Embeddings, Positional Embeddings, Rotary Positional, Transformer-based long-context language, identify intrinsic limitations

备注: 35 pages, 11 figures, submitted to NeurIPS 2026

点击查看摘要

Abstract:We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.

58. 【2605.15508】STS: Efficient Sparse Attention with Speculative Token Sparsity

链接https://arxiv.org/abs/2605.15508

作者:Ceyu Xu,Jiangnan Yu,Yongji Wu,Yuan Xie

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, imposes severe memory, bottlenecks on Large, attention imposes severe

备注: 14 pages, 12 figures

点击查看摘要

Abstract:The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

59. 【2605.15482】FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

链接https://arxiv.org/abs/2605.15482

作者:Dmitry Stanishevskii,Nini Kamkia,Alexey Khoroshilov,Dmitry Zmitrovich,Denis Kokosinskii,Zhirayr Hayrapetyan,Andrei Kalmykov

类目:Computation and Language (cs.CL)

关键词:investment decision support, risk management, investment decision, decision support, financial

备注: 21 pages, 10 tables, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.

60. 【2605.15467】Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction

链接https://arxiv.org/abs/2605.15467

作者:A H M Rezaul Karim,Ozlem Uzuner

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:scale remains challenging, Conversational nurse-patient transcripts, remains challenging, Conversational nurse-patient, structured representations

备注

点击查看摘要

Abstract:Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.

61. 【2605.15464】GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

链接https://arxiv.org/abs/2605.15464

作者:Shangjian Yin,Yu Fu,Yue Dong,Zhouxing Shi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:reinforcement learning, large language models, crucial step, step for unlocking, unlocking the capabilities

备注

点击查看摘要

Abstract:Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{this https URL}{this https URL}.

62. 【2605.15454】Reasoning Models Don't Just Think Longer, They Move Differently

链接https://arxiv.org/abs/2605.15454

作者:Anders Gjølbye,Lars Kai Hansen,Sanmi Koyejo

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:spend more tokens, chains of thought, Boolean satisfiability, Reasoning-trained language models, trajectory

备注: Preprint

点击查看摘要

Abstract:Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.

63. 【2605.15440】Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis

链接https://arxiv.org/abs/2605.15440

作者:William Timkey,Brian Dillon,Tal Linzen

类目:Computation and Language (cs.CL)

关键词:Surprisal theory posits, predictability in context, offering a potential, theory posits, potential link

备注

点击查看摘要

Abstract:Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.

64. 【2605.15436】Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance

链接https://arxiv.org/abs/2605.15436

作者:Mahdi Naser-Moghadasi,Faezeh Ghaderi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:cognitive task categories, twelve cognitive task, distinct large language, examining their performance, paper presents

备注: 8 pages, accepted at IEEE BigData 2025

点击查看摘要

Abstract:This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model (LLM) architectures, examining their performance on twelve cognitive task categories. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks. Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors, with implications for model selection and optimization in big data applications.

65. 【2605.15412】From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

链接https://arxiv.org/abs/2605.15412

作者:Lingzhe Zhang,Tong Jia,Yunpeng Zhai,Zixuan Xie,Chiming Duan,Minghua He,Philip S. Yu,Ying Li

类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:extract predictive signals, Modern quantitative trading, large-scale financial data, trading increasingly relies, quantitative trading increasingly

备注

点击查看摘要

Abstract:Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed--time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.

66. 【2605.15404】Capability Conditioned Scaffolding for Professional Human LLM Collaboration

链接https://arxiv.org/abs/2605.15404

作者:Sen Yang,Yinglei Ma

类目:Computation and Language (cs.CL)

关键词:Large language model, typically adapts outputs, Large language, language model personalization, model personalization typically

备注

点击查看摘要

Abstract:Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expertise. This limitation can encourage Professional Domain Drift, where users rely on AI generated reasoning in domains they cannot reliably evaluate. We introduce Capability Conditioned Scaffolding, a typed framework that partitions expertise into strong, mixed, and weak domains and conditions intervention behavior on structured capability profiles. A pilot evaluation across multiple MMLU subsets and four LLM substrates shows consistent profile conditioned intervention behavior, including categorical inversion under profile swapping and selective activation in mixed domain risk zones. These findings suggest that capability aware scaffolding can support more reliable professional human AI collaboration beyond stylistic personalization.

67. 【2605.15380】Eskwai for Students: Generative AI Assistant for Legal Education in Ghana

链接https://arxiv.org/abs/2605.15380

作者:George Boateng,Philemon Badu,Patrick Agyeman-Budu,Samuel Ansah,Evans Atompoya,Evan Igwilo,Lord Baah,Frederick Abu-Bonsrah,Victor Wumbor-Apin Kumbol

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:law students, legal education, Students, Recent advances, Global South

备注: 10 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Recent advances in generative AI have shown their potential to be leveraged for legal education. Yet, work on the development and deployment of such systems for legal education in the Global South is limited. In this work, we developed Eskwai for Students, a generative AI assistant to help law students with their legal education. Eskwai for Students is a retrieval augmented generation (RAG) system that provides answers to a wide range of legal questions for law students grounded in a curated database of over 12K case laws and 1.4K legislation in Ghana. We deployed Eskwai for Students in a longitudinal study of 30 months (2.5 years) used by 3.1K law students in Ghana who made 32K queries. We evaluated the helpfulness of our AI, and provided insight into the kinds of queries law students submit to this generative AI tool, which raises some ethical concerns. This work contributes to an understanding of how law students in the Global South are using generative AI for their studies and the ways it could be leveraged responsibly to advance legal education.

68. 【2605.15376】Adesua: Development and Feasibility Study of an AI WhatsApp Bot for Science Learning in West Africa

链接https://arxiv.org/abs/2605.15376

作者:George Boateng,Evans Atompoya,Philemon Badu,Samuel John,Samuel Ansah,Patrick Agyeman-Budu,Victor Wumbor-Apin Kumbol

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Sub-Saharan Africa faces, limiting students' access, Africa faces persistently, persistently high student-teacher, high student-teacher ratios

备注: 11 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Sub-Saharan Africa faces persistently high student-teacher ratios and shortages of qualified teachers, limiting students' access to personalized learning support and formative assessment. To address this challenge, we present Adesua, a WhatsApp-based AI Teaching Assistant for science education that extends the Kwame for Science platform. Adesua leverages WhatsApp's widespread adoption in Africa to provide accessible, curriculum-aligned learning support for Junior High School (JHS) and Senior High School (SHS) students across West Africa. The system integrates curated textbooks and 33 years of national examination questions with generative AI to enable conversational question answering and automated assessment with feedback via a WhatsApp bot. Students can ask science questions, take timed or untimed multiple-choice tests by topic or exam year, and receive instant grading and detailed explanations of correct and incorrect responses. A 6-month feasibility deployment in 2025 had 56 active users in Ghana, including students and parents. Quantitative evaluation showed a high perceived usefulness, with a helpfulness score of 93.75\% for AI-generated answers, albeit with a small number of ratings (n=16). These preliminary results provide a basis for more extensive future evaluation of a WhatsApp-based AI assistant to assess its potential to offer scalable, low-cost personalized learning support and formative assessment in resource-constrained educational contexts.

69. 【2605.15365】Greedy or not, here I come: Language production under vocabulary constraints in humans and resource-rational models

链接https://arxiv.org/abs/2605.15365

作者:Thomas Hikaru Clark,Sihan Chen,Laura Nicolae

类目:Computation and Language (cs.CL)

关键词:challenging cognitive phenomenon, cognitive phenomenon, requiring an ideal, constrained lexicon, Sequential Monte Carlo

备注

点击查看摘要

Abstract:Communicating using only a limited vocabulary is a common but challenging cognitive phenomenon, requiring an ideal communicator to plan carefully to optimize for intelligibility while circumventing a constrained lexicon. In this work, we investigate how humans respond to a broad array of questions under variable vocabulary limitations, consisting of only 250 highly frequent words at the most restrictive. We provide theoretically motivated comparisons to greedy and globally optimal sampling algorithms using Sequential Monte Carlo inference with large language models. Humans generally resemble greedy sampling more than globally optimal sampling, though more skilled humans are more likely to backtrack and revise -- a non-greedy behavior. An observed human pattern of leaning on semantically light words in high-constraint settings falls out of both greedy and globally optimal sampling. We discuss the results and their broader implications for resource-rational cognition, psycholinguistics, L2 communication, and language impairments.

70. 【2605.15362】Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

链接https://arxiv.org/abs/2605.15362

作者:Volodymyr Ovcharov

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:billion citation edges, citation edges extracted, citation structure encodes, Half a billion, million Ukrainian court

备注: 15 pages, 7 figures, 2 tables, 21 references

点击查看摘要

Abstract:Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 - 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.

Comments:
15 pages, 7 figures, 2 tables, 21 references

Subjects:

Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

Cite as:
arXiv:2605.15362 [cs.CL]

(or
arXiv:2605.15362v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.15362

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
71. 【2605.15334】From I/O to Code with Discovery Agent

链接https://arxiv.org/abs/2605.15334

作者:Yihong Dong,Jiaru Qian,Haoran Zhang,Peixu Wang,Binhua Li,Zhi Jin,Yongbin Li,Ge Li,Xiaokang Yang,Xue Jiang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:computer science, automatic synthesis, form of specification, specification is regarded, holy grail

备注

点击查看摘要

Abstract:The automatic synthesis of a program from any form of specification is regarded as a holy grail of computer science. Fueled by LLMs, NL2Code has achieved tremendous success, yet the fundamentally more challenging task of synthesizing programs from input-output behavior, which we refer to as IO2Code, remains largely unsolved. Whereas NL2Code can exploit the semantic alignment between natural language and code acquired during pretraining, IO2Code requires recovering underlying principles from concrete computational behavior, navigating a vast and underspecified hypothesis space. To address this, we propose DIO-Agent, a discovery agent for IO2Code. Our method frames IO2Code as an evolutionary search over discrete program space, in which an LLM serves as the mutation operator and concrete error signals from execution guide each mutation. To prevent the search from wandering into structurally complex yet incorrect dead ends, we introduce the Transformation Priority Premise as a mutation prior that biases the LLM toward the simplest hypothesis consistent with current evidence, progressively escalating from constants to conditionals to iteration only when simpler constructs are insufficient. To facilitate systematic study, we further construct an IO2CodeBench spanning multiple difficulty levels. Extensive experiments show that DIO-Agent consistently outperforms both traditional program-by-example method and SOTA evolution-agent baselines across all difficulty levels and various LLMs, while substantially surpassing test-time scaling strategies with equivalent sampling budgets.

72. 【2605.15315】Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

链接https://arxiv.org/abs/2605.15315

作者:Jingjing Wang,Xiwen Chen,Wenhui Zhu,Huayu Li,Zhengxiao He,Feiyang Cai,Ana S. Carreon-Rascon,Xuanzhao Dong,Feng Luo

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:reading repository files, budget reading repository, LLM-powered coding agents, coding agents spend, token budget reading

备注

点击查看摘要

Abstract:LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.

73. 【2605.15304】DiscoExplorer: An Open Interface for the Study of Multilingual Discourse Relations

链接https://arxiv.org/abs/2605.15304

作者:Amir Zeldes

类目:Computation and Language (cs.CL)

关键词:Linguistics and Pragmatics, Computational Linguistics, relations connecting propositions, interest in Computational, DISRPT Shared Task

备注

点击查看摘要

Abstract:The relations connecting propositions in discourse such as cause (A because B) or concession (A although B) are a subject of intense interest in Computational Linguistics and Pragmatics, but challenging to study and compare across languages. Recent progress in standardizing discourse relation inventories across datasets offers the potential to facilitate such studies, but is hindered by the complexity of relevant data and the lack of easily accessible interfaces to analyze it. In this paper we present DiscoExplorer, a new open source web interface, capable of running on local computers, which we use to make datasets from the DISRPT Shared Task on discourse relation classification publicly available, covering 16 different languages. We present the query language, search and visualization facilities for relations and signaling devices such as connectives, as well as some example studies.

74. 【2605.15298】PhysBrain 1.0 Technical Report

链接https://arxiv.org/abs/2605.15298

作者:Shijie Lian,Bin Yu,Xiaopeng Lin,Changti Wu,Hang Yuan,Xiaolin Hu,Zhaolong Shen,Yuzhuo Miao,Haishan Liu,Yuxuan Tian,Yukun Shi,Cong Huang,Kai Chen

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:learning broad physical, provide limited coverage, models have advanced, advanced rapidly, limited coverage

备注: Project Page: [this https URL](https://phys-brain.github.io)

点击查看摘要

Abstract:Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

75. 【2605.15282】Fluency and Faithfulness in Human and Machine Literary Translation

链接https://arxiv.org/abs/2605.15282

作者:Sarah Griebel,Ted Underwood

类目:Computation and Language (cs.CL)

关键词:requires balancing target-language, translation requires balancing, balancing target-language fluency, Literary translation requires, Google Translate

备注: Accepted NLP4DH 2026

点击查看摘要

Abstract:Literary translation requires balancing target-language fluency with faithfulness to the source. Recent large language models (LLMs) often produce fluent translations, but it remains unclear whether fluency corresponds to semantic preservation in literary text. We examine this relationship using 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations. Fluency is measured as original-likeness with a translationese classifier trained on paragraph part-of-speech n-grams, and faithfulness with the automatic translation evaluation metric COMET-KIWI. We control for paragraph length and find a consistent negative correlation between fluency and faithfulness. The pattern appears for both human and Google Translate, but is weaker and often non-significant for TranslateGemma. These results show that segment length matters for automatic evaluation and suggest a tradeoff between fluency and faithfulness in literary translation.

76. 【2605.15222】PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

链接https://arxiv.org/abs/2605.15222

作者:Huihao Jing,Wenbin Hu,Haochen Shi,Hanyu Yang,Sirui Zhang,Shaojin Chen,Haoran Li,Yangqiu Song

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词:tasks remains limited, generate functionally correct, remains limited, generate functionally, ability to produce

备注

点击查看摘要

Abstract:Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at this https URL.

77. 【2605.15221】Effective Harness Engineering for Algorithm Discovery with Coding Agents

链接https://arxiv.org/abs/2605.15221

作者:Yoichi Ishibashi,Taro Yano,Masafumi Oyamada

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:combining large language, large language models, automated algorithm discovery, AlphaEvolve and FunSearch, FunSearch have demonstrated

备注

点击查看摘要

Abstract:AlphaEvolve and FunSearch have demonstrated the potential of combining large language models (LLMs) with evolutionary search for automated algorithm discovery. However, discovery success is shaped not only by model capability but also significantly by the design of the execution infrastructure, i.e., the harness. This paper investigates effective harness design through three questions: under a fixed token budget, is it better to produce many algorithms with brief thought or fewer algorithms with deeper thought? How should the harness handle evaluation hacks, where generated programs exploit the scoring function? And how can agents that require full filesystem access execute safely in parallel? Using Vesper, an algorithm discovery framework that incorporates harness improvements addressing these questions, we evaluate on Circle Packing under the same token budget. Interestingly, generating fewer algorithms while thinking more deeply about each one achieved higher scores. That is, scaling the quality of each individual is more budget-efficient than scaling the number of evolutionary generations. Surprisingly, more capable models produced evaluation hacks at higher rates, making hack detection increasingly necessary as models scale.

78. 【2605.15220】Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

链接https://arxiv.org/abs/2605.15220

作者:Michael Y. Hu,Apurva Gandhi,Kyunghyun Cho,Tal Linzen,Pratyusha Sharma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Data mixing, Data, Data mixing decides, language model training, combine different sources

备注

点击查看摘要

Abstract:Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.

79. 【2605.15202】DeepSlide: From Artifacts to Presentation Delivery

链接https://arxiv.org/abs/2605.15202

作者:Ming Yang,Zhiwei Zhang,Jiahang Li,Haoseng Liu,Yuzheng Cai,Weiguo Zheng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:visually plausible deck, slide generators optimize, full presentation process, scholarly communication, plausible deck

备注: 37 pages,10 figures,9 tables

点击查看摘要

Abstract:Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under-optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human-in-the-loop multi-agent system that supports preparing the full presentation process, from requirement elicitation and time-budgeted narrative planning, to evidence-grounded slide--script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical-chain planner with per-node time budgets, (ii) a lightweight content-tree retriever for grounding, (iii) Markov-style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual-scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide--script synergy with clearer attention guidance.

80. 【2509.22151】MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

链接https://arxiv.org/abs/2509.22151

作者:Jonas Belouadi,Tamy Boubekeur,Adrien Kaiser

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:displacement maps, conductivity maps, including geometry, roughness and displacement, albedo and conductivity

备注: Accepted at ICLR 2026 (poster)

点击查看摘要

Abstract:Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

信息检索

1. 【2605.16217】Argus: Evidence Assembly for Scalable Deep Research Agents

链接https://arxiv.org/abs/2605.16217

作者:Zhen Zhang,Liangcai Su,Zhuo Chen,Xiang Lin,Haotian Xu,Simon Shaolei Du,Kaiyu Yang,Bo An,Lidong Bing,Xinyu Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:information seeking tasks, achieved remarkable progress, complex information seeking, seeking tasks, Deep research

备注

点击查看摘要

Abstract:Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

2. 【2605.16194】paper.json: A Coordination Convention for LLM-Agent-Actionable Papers

链接https://arxiv.org/abs/2605.16194

作者:Arquimedes Canedo

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:extracting reproducibility steps, LLM agents routinely, http URL, agents routinely serve, LLM agents

备注

点击查看摘要

Abstract:LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `this http URL`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run this http URL this http URL --against this http URL` passes. Repo: this https URL

3. 【2605.16120】MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos

链接https://arxiv.org/abs/2605.16120

作者:Anh-Tai Pham-Nguyen,Tung-Duong Le-Duc,Anh-Duy Le,Trung-Hieu Truong-Le

类目:Information Retrieval (cs.IR)

关键词:semantically grounded event, grounded event retrieval, online video platforms, video platforms drives, semantically grounded

备注: Accepted to SOICT 2025

点击查看摘要

Abstract:The growth of online video platforms drives the need for effective, semantically grounded event retrieval. We present MERVIN, a unified multimodal framework for Vietnamese news videos that integrates keyframes, transcripts, and video summaries. Transcript quality is enhanced via Gemini 1.5 Flash, reducing noise from accents, background sounds, and recognition errors. Visual features are extracted with Perception Encoder, while a Vietnamese language model produces textual embeddings; both are indexed in Milvus for efficient similarity-based retrieval. In addition, a React-based interface enables iterative query refinement across modalities, improving semantic alignment. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase and successfully retrieved all results for every query in the final round.

4. 【2605.16007】Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

链接https://arxiv.org/abs/2605.16007

作者:Fujun He,Chuyue Ye,Huaxiang Cai,Zetao Lv,Baolong Cui,Wenru Yan,Chao Zhan,Zigang Zhang,Hao Yi,Jie Xiang,Xiabing Li,Yuhang Gai,Ziyang Zhang,Pengfei Zheng,Yunfei Du

类目:Information Retrieval (cs.IR)

关键词:prohibitive computational overhead, memory bandwidth limitations, traditional CPU-based implementations, CPU-based implementations face, Neural Processing Units

备注

点击查看摘要

Abstract:Vector similarity search is a critical component of modern AI systems, but traditional CPU-based implementations face fundamental scalability bottlenecks for billion-scale corpora due to prohibitive computational overhead and memory bandwidth limitations. While Neural Processing Units (NPUs) offer orders-of-magnitude higher compute density, existing CPU/GPU-optimized 1-bit RaBitQ quantization implementations cannot be directly ported to NPU architectures due to fundamental hardware mismatches, and homogeneous design paradigms struggle to simultaneously balance accuracy, memory footprint, and performance. This paper presents Ascend-RaBitQ, the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search, built on the core insight that decoupling coarse ranking (NPU) from fine ranking (CPU) allows each stage to leverage its optimal hardware, breaking the long-standing accuracy-memory-performance trade-off. We propose a three-stage heterogeneous pipeline comprising AI Core-accelerated coarse ranking on 1-bit quantized vectors, on-device AI CPU Top-k processing, and host CPU fine re-ranking on full-precision vectors. We introduce four NPU architecture-native optimizations: fused AIC-AIV operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing that breaks query boundaries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask Top-k latency. Evaluation on standard datasets shows that Ascend-RaBitQ achieves 3.0* to 62.8* faster index construction than the CPU baseline, up to 4.6* throughput improvement over the fastest CPU IVF-RaBitQ implementation, and over 100* over the mathematically equivalent CPU baseline, while demonstrating encouraging scalability on distributed multi-NPU systems.

5. 【2605.15905】Generative Long-term User Interest Modeling for Click-Through Rate Prediction

链接https://arxiv.org/abs/2605.15905

作者:Jiangli Shao,Kaifu Zheng,Hao Fang,Huimu Ye,Zhiwei Liu,Bo Zhang,Shu Han,Xingxing Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Modeling long-term user, enhances click-through rate, Modeling long-term, behaviors enhances click-through, click-through rate

备注

点击查看摘要

Abstract:Modeling long-term user interests with massive historical user behaviors enhances click-through rate (CTR) prediction performance in advertising and recommendation systems. Typically, a two-stage framework is widely adopted, where a general search unit (GSU) first retrieves top-$k$ relevant behaviors towards the target item, and an exact search unit (ESU) generates interest features via tailored attention. However, current target-centered GSU would ignore other latent user interests, leading to incomplete and biased interest features. Additionally, the matching-based retrieval process in GSUs depends on the pairwise similarity score between target item and each historical behavior, which not only becomes time-consuming for online services as user behaviors continue to grow, but also overlooks the interaction information among user behaviors. To combat these problems, we propose a \textbf{Gen}erative \textbf{L}ong-term user \textbf{I}nterest model named GenLI for CTR prediction. GenLI consists of an interest generation module (IGM), a behavior retrieval module (BRM), and an interest fusion module (IFM). The IGM generates multiple interest distributions to indicate different aspects of real-time user interests, which is target-independent and incorporates interaction information among behaviors, ensuring complete and diverse interest features. The BRM selects related behaviors via a simple lookup operation, reducing the time complexity for weighting each behavior to $O(1)$. Finally, the IFM uses delicate gating mechanisms to generate interest features. Based on the generation process, GenLI improves the diversity of user interests and avoids complex matching-based behavioral retrieval, achieving a better balance between accuracy and efficiency for CTR prediction.

6. 【2605.15790】Fairness-Aware Retrieval Optimization for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.15790

作者:Yingqi Zhao,Vasilis Efthymiou,Jyrki Nummenmaa,Kostas Stefanidis

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:incorporating external knowledge, large language models, improves reliability, external knowledge, generated outputs

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves reliability of large language models by incorporating external knowledge, but the retrieval process can introduce bias that propagates to generated outputs. This issue is particularly challenging in top-k settings, where multiple documents jointly influence generation. We propose a fairness-aware retrieval framework that models and controls this bias. Our approach combines controlled bias injection via reranking, a position-aware model of bias propagation, and an optimization formulation that balances relevance and fairness. We further introduce a scalable solution based on Quadratic Fairness via Dual Hyperplane Approximation (FARO), which enables efficient optimization through problem decomposition. Experimental results show that our method effectively mitigates generation bias while preserving relevance. This work provides a principled approach for fairness-aware retrieval in RAG systems.

7. 【2605.15505】X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention

链接https://arxiv.org/abs/2605.15505

作者:Guruprasad Raghavan,George Nychis,Rohan Narayana Murthy

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:static information stores, True Lead Rate, False Lead Rate, Lead Rate, static information

备注: 11 pages, 7 figures, 5 tables

点击查看摘要

Abstract:In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened [2, 52]. The prevailing approach [17, 31, 34, 36] retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual [5, 57, 61], present in behavioral patterns, absent from any retrieval index. For complex agentic tasks it breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in human attention, the digitally observable interaction signatures of each worker, encoding not just what they did but the sequence in which they did it, along with implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven qualitatively distinct attention filters: Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. On a sales lead identification task, a frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and human attention is its most reliable ground truth.

8. 【2605.15474】Jobs' AI Exposure Should Be Measured from Evidence, Not Model Priors

链接https://arxiv.org/abs/2605.15474

作者:Luca Mouchel,Pierre Bouquet,Yossi Sheffi

类目:Information Retrieval (cs.IR)

关键词:inferred from LLM, LLM priors, position paper argues, evidence-based methods, argues that job

备注

点击查看摘要

Abstract:This position paper argues that job exposure to AI should be measured with grounded, evidence-based methods, not inferred from LLM priors alone. Current theoretical exposure measures use zero-shot prompting to classify task-level AI exposure, generating labels with no explicit evidence, no transparent chain of reasoning, and no external validation. The stakes of these measurements are too high to rely on such methods, as they influence policy making, where public and private funds are directed, and how workers understand their future prospects. We therefore argue that AI capability claims should meet three standards: reproducibility, external grounding, and inspectability. We propose a retrieval-augmented framework that assigns AI exposure labels to all 18,796 occupation--task pairs in O*NET 30.2, using open-weight reasoning and instruct models with retrieved news articles and academic paper abstracts as evidence of current AI capabilities. Relative to a zero-shot baseline, the grounded condition is preferred in over 72\% of disagreement cases under both automatic and human evaluation, and yields scores that align more closely with observed real-world AI usage. Taken together, these findings suggest that evidence-grounded measurement better captures what current AI systems can plausibly do in practice, rather than what a model asserts without external evidence. Because AI capabilities continue to change, the measurements used to inform policy must evolve with them: theoretical AI exposure scores should be periodically reassessed, not inherited as immutable ground truth.

9. 【2605.15460】Differentially Private Motif-Preserving Multi-modal Hashing

链接https://arxiv.org/abs/2605.15460

作者:Zehua Cheng,Wei Dai,Jiahao Sun

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:compact binary codes, enables efficient retrieval, hashing enables efficient, binary codes, enables efficient

备注: 9 Pages

点击查看摘要

Abstract:Cross-modal hashing enables efficient retrieval by encoding images and text into compact binary codes. State-of-the-art methods rely on semantic similarity graphs derived from user interactions for supervision, yet these graphs encode sensitive behavioral patterns vulnerable to link reconstruction attacks. Existing privacy-preserving approaches fail on graph-structured data: Differentially Private SGD destroys relational motifs by treating samples independently, while graph synthesis methods suffer from unbounded local sensitivity in scale-free networks, hub nodes cause single-edge modifications to alter triangle counts by $\mathcal{O}(N)$, necessitating prohibitive noise injection. We term this phenomenon Hubness Explosion. We propose DMP-MH, a Sanitize-then-Distill framework that decouples privacy from representation learning. Our approach first bounds sensitivity by deterministically clipping node degrees, capping the $L_2$-sensitivity of triangle motifs independently of dataset size. A sanitized synthetic graph is then generated via Noisy Mirror Descent under $(\epsilon,\delta)$-Edge Differential Privacy. Finally, dual-stream hashing networks distill this topology using a holistic structural loss that enforces cross-modal alignment. Evaluated on MIRFlickr-25K and NUS-WIDE under a strict inductive protocol, DMP-MH outperforms private baselines by up to 11.4 mAP points while retaining up to 92.5% of non-private performance.

10. 【2605.15362】Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

链接https://arxiv.org/abs/2605.15362

作者:Volodymyr Ovcharov

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:billion citation edges, citation edges extracted, citation structure encodes, Half a billion, million Ukrainian court

备注: 15 pages, 7 figures, 2 tables, 21 references

点击查看摘要

Abstract:Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 - 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.

Comments:
15 pages, 7 figures, 2 tables, 21 references

Subjects:

Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

Cite as:
arXiv:2605.15362 [cs.CL]

(or
arXiv:2605.15362v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.15362

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2605.15299】Fortress: A Case Study in Stabilizing Search Recommendations via Temporal Data Augmentation and Feature Pruning

链接https://arxiv.org/abs/2605.15299

作者:Milind Pandurang Jagre,Jia Huang,Dayvid V. R. Oliveira,Zhinan Cheng,Babak Seyed Aghazadeh,Puja Das,Chris Alvino,Jinda Han,Kailash Thiyagarajan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:search and recommendation, input features introduce, recommendation systems, features, temporal instability

备注

点击查看摘要

Abstract:In search and recommendation systems, predictive models often suffer from temporal instability when certain input features introduce volatility in output scores. This instability can degrade model reliability and user experience especially in multi-stage systems where consistent predictions are critical for downstream decision making. We introduce Fortress, a general framework for enhancing model stability and accuracy by identifying and pruning features that contribute to inconsistent prediction scores over time. Fortress leverages historical snapshots temporally partitioned datasets capturing score fluctuations for the same entity across periods and follows a four-step process: (1) collect historical snapshots, (2) identify samples with unstable predictions, (3) isolate and remove instability-inducing features, and (4) retrain models using only stable features. While semantic features from LLMs and BERT-based models improve generalization, they often lack full query or entity coverage. Engagement-based features offer strong predictive power but tend to introduce temporal instability. Fortress mitigates this trade-off by suppressing the volatility of engagement signals while retaining their predictive value leading to more stable and accurate models. We validate Fortress on a query-to-app relevance model in a large-scale app marketplace. Offline experiments demonstrate notable improvements in prediction stability (measured by Coefficient of Variation) and classification performance (measured by PR-AUC).

12. 【2605.15213】An LLM-RAG Approach for Healthy Eating Index-Informed Personalized Food Recommendations

链接https://arxiv.org/abs/2605.15213

作者:Yibin Wang,Yanjie Yang,Grace Melo Guerrero,Rodolfo M. Nayga Jr.,Azlan Zahid

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:chronic disease risk, disease risk, leading determinant, determinant of chronic, chronic disease

备注

点击查看摘要

Abstract:Diet quality is a leading determinant of chronic disease risk. Advances in artificial intelligence (AI) have enabled food recommendation systems to adapt suggestions to user preferences and health goals. However, most current systems rely on loosely curated food databases and provide limited connection to a validated index. In this study, we propose a Healthy Eating Index (HEI) informed retrieval-augmented generation (RAG) framework that combines standardized nutrition databases with large language models (LLMs) for personalized food recommendations. Our proposed method anchors retrieval in the National Health and Nutrition Examination Survey (NHANES) and the Food Patterns Equivalents Database (FPED). A food-level embedding space is constructed from FPED-derived textual descriptions. For each entity, the system computes baseline HEI scores, retrieves candidate foods for intake recommendations, and estimates the HEI impact of simple substitutions or additions. A constrained RAG pipeline instantiated with a pretrained OpenAI LLM generates personalized recommendations and sources based on nutrient profiles and HEI contributions. The simulation results showed a mean HEI improvement of 6.45, with the proportion of users HEI over 50 increasing from 45.12 to 61.26. Quantile analysis revealed consistent improved shifts across the HEI distribution. Our findings suggest that the proposed LLM-RAG-based AI systems can support more precise, explainable, and personalized nutrition guidance to improve diet quality.

13. 【2605.15203】Agent4POI: Agentic Context-Conditioned Affordance Reasoning for Multimodal Point-of-Interest Recommendation

链接https://arxiv.org/abs/2605.15203

作者:Jinze Wang,Yangchen Zeng,Tiehua Zhang,Lu Zhang,Yuze Liu,Yongchao Liu,Xingjun Ma,Zhu Sun

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:POI recommendation framework, static POI embeddings, POI embeddings pre-computed, embeddings pre-computed independently, recommendation time

备注

点击查看摘要

Abstract:We introduce Agent4POI, the first POI recommendation framework that generates context-conditioned multimodal representations at recommendation time, rather than relying on static POI embeddings pre-computed independently of context. Existing multimodal systems encode each POI once as a static embedding, a design that precludes reasoning about why the same cafe affords solo work on Monday but group celebration on Friday evening. We formally prove that no pre-computed encoder can satisfy context-sensitive ranking under standard bilinear scoring, motivating inference-time item-side representation. Agent4POI inverts this computation: given a situational context, a four-phase LLM agent generates dynamic, context-specific affordance queries (Phase 1) and executes a five-step cross-modal chain-of-thought over image, review, and metadata evidence (Phase 2). The resulting uncertainty-aware affordance representation is grounded in Gibsonian affordance theory. These cross-modal verdicts form a structured, uncertainty-adjusted affordance representation (Phase 3), which is aligned with user preferences via a semantic caching system for low-latency ranking (Phase 4). On three POI benchmarks and three evaluation configurations (standard, cold-start, context-shift), Agent4POI achieves a 23.2% relative gain over the strongest baseline and degrades by only 7.5% under context-shift versus 16--17\% for the strongest baselines. In cold-start scenarios, Agent4POI outperforms the best content-based baseline by up to 2.4x, whereas ID-based methods fail to generalize.

14. 【2605.15202】DeepSlide: From Artifacts to Presentation Delivery

链接https://arxiv.org/abs/2605.15202

作者:Ming Yang,Zhiwei Zhang,Jiahang Li,Haoseng Liu,Yuzheng Cai,Weiguo Zheng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:visually plausible deck, slide generators optimize, full presentation process, scholarly communication, plausible deck

备注: 37 pages,10 figures,9 tables

点击查看摘要

Abstract:Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under-optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human-in-the-loop multi-agent system that supports preparing the full presentation process, from requirement elicitation and time-budgeted narrative planning, to evidence-grounded slide--script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical-chain planner with per-node time budgets, (ii) a lightweight content-tree retriever for grounding, (iii) Markov-style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual-scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide--script synergy with clearer attention guidance.

计算机视觉

1. 【2605.16258】IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

链接https://arxiv.org/abs/2605.16258

作者:Yuqi Wu,Tianyu Hu,Wenzhao Zheng,Yuanhui Huang,Haowen Sun,Jie Zhou,Jiwen Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:unposed multi-view images, Implicit Visual Geometry, Visual Geometry Transformer, Reconstructing coherent, computer vision

备注: Code: [this https URL](https://github.com/wzzheng/IVGT/)

点击查看摘要

Abstract:Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

2. 【2605.16241】Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

链接https://arxiv.org/abs/2605.16241

作者:Jin Shi,Brady Zhang,Yishun Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real-time closed-loop control, recently shown impressive, shown impressive performance, cost remain major, remain major obstacles

备注

点击查看摘要

Abstract:Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $\pi_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

3. 【2605.16223】Evaluating Design Video Generation: Metrics for Compositional Fidelity

链接https://arxiv.org/abs/2605.16223

作者:Adrienne Deganutti,Dingning Cao,Jaejung Seol,Elad Hirsch,Purvanshi Mehta

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative video models, design animation tasks, Generative video, models are increasingly, animation tasks

备注

点击查看摘要

Abstract:Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field.

4. 【2605.16179】MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models

链接https://arxiv.org/abs/2605.16179

作者:Piyush Tiwary,Utkarsh Ahuja,Depanshu Sani,Aishwarya Jayagopal,Sagar Gubbi,Subhashini Venugopalan,Alok Talekar,Vaibhav Rajan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, high intra-class variance, labeled training data, fragmented plots

备注

点击查看摘要

Abstract:Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.

5. 【2605.16171】Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment

链接https://arxiv.org/abs/2605.16171

作者:Xinyue Liu,Jianyuan Wang,Biao Leng,Shuo Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot Generalist Anomaly, Generalist Anomaly Detection, Few-shot Generalist, Anomaly Detection requires, rapidly changing categories

备注

点击查看摘要

Abstract:Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP's inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res$^2$CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP's residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at this https URL.

6. 【2605.16165】Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

链接https://arxiv.org/abs/2605.16165

作者:Yishun Lu,Wes Armour

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Autoregressive next-token training, Autoregressive next-token, creates strong modality, strong modality competition, limits large-batch scaling

备注

点击查看摘要

Abstract:Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

7. 【2605.16147】Registers Matter for Pixel-Space Diffusion Transformers

链接https://arxiv.org/abs/2605.16147

作者:Nikita Starodubcev,Ilia Sudakov,Ilya Drobyshevskiy,Artem Babenko,Dmitry Baranchuk

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, problem effectively mitigated, high-norm patch-token outliers, problem effectively, effectively mitigated

备注

点击查看摘要

Abstract:Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.

8. 【2605.16137】STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System

链接https://arxiv.org/abs/2605.16137

作者:Zhen Luo,Yixuan Yang,Xudong Xu,Jinkun Hao,Zhaoyang Lyu,Feng Zheng,Jiangmiao Pang,Yanwei Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Generating simulation-ready tabletop, promising research direction, field of Embodied, Generating simulation-ready, simulation-ready tabletop scenes

备注: ICML 2026

点击查看摘要

Abstract:Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

9. 【2605.16127】WeatherOcc3D: VLM-Assisted Adverse Weather Aware 3D Semantic Occupancy Prediction

链接https://arxiv.org/abs/2605.16127

作者:A. Enes Doruk,Abdelaziz Hussein,Hasan F. Ates

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:occupancy prediction typically, prediction typically enhances, typically enhances robustness, semantic occupancy prediction, occupancy prediction

备注

点击查看摘要

Abstract:While multi-modal 3D semantic occupancy prediction typically enhances robustness by fusing camera and LiDAR inputs, its effectiveness is fundamentally constrained by environmental variability. Specifically, camera sensors suffer from severe low-light degradation, while LiDAR sensors encounter significant backscatter noise during heavy precipitation. These adverse conditions create a modality trust problem, as static fusion strategies fail to adaptively re-weight inputs when a specific sensor becomes unreliable. To address this, we propose a VLM-assisted framework leveraging the pre-trained CLIP latent space to guide multi-sensor integration via linguistic environmental cues. We utilize a parameter-efficient adapter to align weather-specific text embeddings with sensor features, coupled with a gating strategy that decomposes environmental uncertainty into two factors: visibility and illumination. This enables the model to dynamically modulate the fusion ratio - prioritizing semantic camera features in clear daylight and shifting to geometric LiDAR priors during rainy nights. Evaluations on the nuScenes dataset demonstrate the versatility of our approach, as implementing our proposed framework on the OccMamba and M-CONet architectures achieves mIoU scores of 26.3 and 21.1, respectively, significantly outperforming their traditional baselines.

10. 【2605.16122】GenShield: Unified Detection and Artifact Correction for AI-Generated Images

链接https://arxiv.org/abs/2605.16122

作者:Zhipei Xu,Xuanyu Zhang,Youmin Xu,Qing Huang,Shen Chen,Taiping Yao,Shouhong Ding,Jian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Diffusion-based image synthesis, raising urgent concerns, Diffusion-based image, AIGI detection, made AI-generated images

备注

点击查看摘要

Abstract:Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step ``diagnose-then-repair'' correction with an explicit stopping criterion. A high-quality dataset with large-scale ``artifact-restored'' pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method. The code is available at this https URL.

11. 【2605.16090】A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

链接https://arxiv.org/abs/2605.16090

作者:Hao Yang,Zhuo Ma,Yang Liu,Yilong Yang,Guancheng Wang,JianFeng Ma

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision-language models, Large vision-language, prompt injection, prompt injection attack, prompt

备注

点击查看摘要

Abstract:Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model's interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model's interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only $10^5$ parameters) to the model hidden state space (for multimodal information integration and with $10^7$ parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.

12. 【2605.16081】MIND: Decoupling Model-Induced Label Noise via Latent Manifold Disentanglement

链接https://arxiv.org/abs/2605.16081

作者:Dayong Ren

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:dominates data-hungry applications, automatic annotations driven, Models dominates data-hungry, Foundation Models dominates, data-hungry applications

备注: Accepted, to appear in ICML2026

点击查看摘要

Abstract:The paradigm of learning from automatic annotations driven by pre-trained experts and Foundation Models dominates data-hungry applications. However, it introduces a critical challenge: model-induced label noise. Unlike stochastic noise in classical robust learning, this noise stems from annotator inductive biases, manifesting as systematic errors tightly coupled with local feature manifolds. Existing methods relying on global transition matrices underfit these structural patterns, while learning instance-specific matrices remains mathematically intractable. We propose Model-Induced Noise Decoupling (MIND), a theoretically grounded framework addressing this dilemma. We demonstrate that the high-dimensional noise manifold can be decoupled into tractable, subspace-dependent components via Latent Manifold Disentanglement. Specifically, our Latent Decoupling Estimator (LDE) dynamically projects samples into latent structural clusters with consistent error modes, facilitating noise identifiability without ground-truth anchor points. To rigorously evaluate robustness, we adopt a hierarchical protocol: moving from controlled noise on CIFAR-100 to a structural stress test on large-scale real-world 3D datasets (S3DIS, ScanNet), where error patterns explicitly couple with geometric manifolds. Empirically, MIND significantly outperforms state-of-the-art methods on these complex benchmarks and effectively corrects zero-shot hallucinations from Vision-Language Models (e.g., OpenSeg), highlighting its potential as a robust distillation framework for Foundation Models.

13. 【2605.16080】ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation

链接https://arxiv.org/abs/2605.16080

作者:Qing Huang,Zhipei Xu,Xuanyu Zhang,Xiangyu Yu,Jian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses growing challenges, generalizable image forgery, generalizable image, AI-generated images, poses growing

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.

14. 【2605.16079】VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

链接https://arxiv.org/abs/2605.16079

作者:Yiming Zhao,Yu Zeng,Wenxuan Huang,Zhen Fang,Qing Miao,Qisheng Su,Jiawei Zhao,Jiayin Cai,Lin Chen,Zehui Chen,Yukun Qi,Yao Hu,Xiaolong Jiang,Feng Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Large Vision-Language Models, shown significant progress, Large Vision-Language, requiring precise spatiotemporal, precise spatiotemporal localization

备注: Project Page: [this https URL](https://gaotiexinqu.github.io/VideoSeeker/)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

15. 【2605.16076】AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification

链接https://arxiv.org/abs/2605.16076

作者:Salma Hoque Talukdar Koli,Fahima Haque Talukder Jely

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Plant disease detection, extension workers eyeball, workers eyeball leaf, eyeball leaf samples, manual in Bangladesh

备注

点击查看摘要

Abstract:Plant disease detection is still largely manual in Bangladesh, where extension workers eyeball leaf samples across millions of smallholdings. We built AgriMind to automate this: an ensemble of ResNet50, EfficientNet-B0, and DenseNet121 trained on 20,638 PlantVillage images across 15 pepper, potato, and tomato disease classes. Transfer learning with frozen ImageNet backbones and 10 epochs of head-only training keeps the pipeline lightweight. Individual models hit 96--97% on the held-out test set, but averaging their softmax outputs pushes the ensemble to 99.23% -- a two-thirds cut in error rate. We tried biasing the average toward the best validation model; it backfired. Dropping any single model also hurt. Pepper and potato classify perfectly; tomato, with ten visually similar classes, still reaches 99.01%. On an NVIDIA T4 GPU the full ensemble runs at 53 FPS. Whether that translates to real-time mobile use depends on TensorFlow Lite optimization -- work we have not yet completed.

16. 【2605.16065】Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting

链接https://arxiv.org/abs/2605.16065

作者:Raushan Joshi,Jean-Yves Guillemaut

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Gaussian Splatting, reconstruction but lacks, Model High Quality, lacks robust segmentation, Splatting

备注: Accepted at IEEE International Conference on Image Processing 2026, 6 pages

点击查看摘要

Abstract:3D Gaussian Splatting (3D-GS) enables real-time 3D scene reconstruction but lacks robust segmentation for editing tasks such as object removal, extraction, and recoloring. Existing approaches that lift 2D segmentations to the 3D domain suffer from view inconsistencies and coarse masks. In this paper, we propose a novel framework that leverages the Segment Anything Model High Quality (SAM-HQ) to generate accurate 2D masks, addressing the limitations of the standard SAM in boundary fidelity and fine-structure preservation. To achieve robust 3D segmentation of any target object in a given scene, we introduce a prior-guided label reassignment method that assigns labels to 3D Gaussians by enforcing multiview consistency with learned priors. Our approach achieves state-of-the-art segmentation accuracy and enables interactive, real-time object editing while maintaining high visual fidelity. Qualitative results demonstrate superior boundary preservation and practical utility in Virtual Reality (VR) and robotics, advancing 3D scene editing.

17. 【2605.16022】EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting

链接https://arxiv.org/abs/2605.16022

作者:Changjing Liu,Yiming Huang,Long Bai,Beilei Cui,Hongliang Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:minimally invasive surgery, enhancing downstream tasks, high-fidelity dynamic endoscopic, advancing surgical outcomes, robot-assisted minimally invasive

备注: Early Accepted by MICCAI 2026

点击查看摘要

Abstract:In robot-assisted minimally invasive surgery, high-fidelity dynamic endoscopic scene reconstruction and simulation are crucial to enhancing downstream tasks and advancing surgical outcomes. However, existing methods primarily focus on visual reconstruction, lacking physics-based descriptions of the scene required for realistic simulation. We propose a unified framework that achieves physics-aware reconstruction and physical simulation of endoscopic scenes through Multi-modal Large Language Models (MLLMs)-guided Gaussian Splatting. Our approach utilizes 4D Gaussian Splatting (4DGS) integrated with pre-trained segmentation and depth estimation to represent deformable tissues and tools. To achieve automatic inference of physical properties, we introduce an object-wise material field that initializes material parameters via MLLM and refines them through a differentiable Material Point Method (MPM) under joint supervision from rendered images and optical flow. Validated on both open-source and in-house datasets, our framework achieves superior simulation fidelity and physical accuracy compared to state-of-the-art methods, underscoring its potential to advance robot-assisted surgical applications.

18. 【2605.16008】End-to-end plaque counting and virus titration from laboratory plate images with deep learning

链接https://arxiv.org/abs/2605.16008

作者:Eugenia Moris,Alicia Costábile,Sebastián Rey,Irene Ferreiro,Joaquín Hurtado,Lizandra Lissette Luciano,Matías Villagrán,Aisha Espino Vázquez,Jomari Ramos,Isadora Monteiro,María Victoria de Santiago,Pilar Moreno,Gonzalo Moratorio,José Ignacio Orlando

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gold standard readout, plaque assay images, Plaque assays remain, inter-operator variability, plaque assay

备注

点击查看摘要

Abstract:Plaque assays remain the gold standard readout of virus infectivity; however, plaque counting from plate images is labor-intensive and prone to inter-operator variability. We present an end-to-end, computer-aided workflow for cytopathic effect-based virus titration directly from laboratory plaque assay images. The proposed approach combines two models derived from the Segment Anything Model (SAM): a SAM2-based well-segmentation module that localizes assay wells across heterogeneous imaging conditions, and a SAM-based plaque-segmentation model that detects and enumerates plaques within each well. The method was evaluated on a mixed dataset comprising private plaque assay images of Mayaro virus and Coxsackievirus B3, together with public Vaccinia virus images from the VACVPlaque dataset. The pipeline outputs per-well plaque counts, automatically computes plaque-forming units per milliliter (PFU/mL), and is integrated into a web-based platform that allows users to review results and organize experiments. On held-out plates (17 from MAYV/CVB3 and 22 from VACV), the workflow generalized across two plate formats (6-well and 12-well) and showed strong agreement with manual annotations (Pearson correlation coefficients of 0.92 for MAYV/CVB3 and 0.88 for VACV). Automated plaque counts were further compared with annotations from four independent experts, demonstrating high concordance. The proposed system will be open sourced and publicly released upon acceptance of this manuscript to enable reproducible, scalable, and audit-ready plaque assay analysis while substantially reducing manual annotation effort.

19. 【2605.16003】Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

链接https://arxiv.org/abs/2605.16003

作者:Mingqiang Wu,Weilun Feng,Zhefeng Zhang,Haotong Qin,Yuqi Li,Guoxin Fan,Xiaokun Liu,Zhulin An,Libo Huang,Yongjun Xu,Chuanguang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autoregressive video diffusion, diffusion models enable, models enable open-ended, video diffusion models, enable open-ended generation

备注

点击查看摘要

Abstract:Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in this https URL

20. 【2605.15997】Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

链接https://arxiv.org/abs/2605.15997

作者:Yuyuan Liu,Can Peng,Yingyu Yang,Qianye Yang,Cheng Ouyang,J. Alison Noble

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent progress, progress in deep, deep learning, learning has significantly, significantly advanced

备注: 8 pages, 4 figures, submitted to IEEE Transactions on Medical Imaging (TMI)

点击查看摘要

Abstract:Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g., masks and bounding boxes) and textual reasonings. To progressively enhance localisation accuracy and semantic clarity, we further design a "closer-look" mechanism that allows the model to perform progressive coarse-to-fine visits to regions of interest under refined fields of view. To support model training and evaluation, we curated a new multimodal CT dataset containing pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions for visual objects constructed through an AI-assisted annotation process with human verification. Experiments on public benchmarks demonstrate consistent improvements over the SoTA, achieving up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, while additionally providing appearance reasoning outputs. The code and dataset will be available.

21. 【2605.15980】Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

链接https://arxiv.org/abs/2605.15980

作者:Xiaoxuan He,Siming Fu,Zeyue Xue,Weijie Wang,Ruizhe He,Yuming Li,Dacheng Yin,Shuai Dong,Haoyang Huang,Hongfa Wang,Nan Duan,Bohan Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Group Relative Policy, Group Relative, Relative Policy Optimization, aligning video diffusion, typically demands hundreds

备注

点击查看摘要

Abstract:Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

22. 【2605.15967】Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

链接https://arxiv.org/abs/2605.15967

作者:Fabio Rovai

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)

关键词:typed RDF triples, structured intervention vocabulary, represent agent state, study event-graph substrates, typed RDF

备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.

23. 【2605.15964】WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

链接https://arxiv.org/abs/2605.15964

作者:Baining Zhao,Jiacheng Xu,Weicheng Feng,Xin Zhang,Zhaolu Wang,Haoyang Wang,Shilong Ji,Ziyou Wang,Jianjie Fang,Zhiheng Zheng,Weichen Zhang,Yu Shang,Wei Wu,Chen Gao,Xinlei Chen,Yong Li

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:follow natural-language instructions, aerial VLN, Aerial vision-language navigation, follow natural-language, natural-language instructions

备注

点击查看摘要

Abstract:Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at this https URL.

24. 【2605.15961】Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

链接https://arxiv.org/abs/2605.15961

作者:Fabian Morelli,Arnas Uselis,Ankit Sonthalia,Seong Joon Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:CLIP demonstrate remarkable, Large-scale pre-trained vision-language, demonstrate remarkable zero-shot, CLIP demonstrate, remarkable zero-shot performance

备注

点击查看摘要

Abstract:Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: this https URL.

25. 【2605.15951】From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

链接https://arxiv.org/abs/2605.15951

作者:Yuyuan Liu,Yiping Ji,Anjie Le,Jiayuan Zhu,Jiazhen Pan,Can Peng,Jiajun Deng,Fengbei Liu,Junde Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Finetuning Large Vision-Language, Finetuning Large, Large Vision-Language Models, Large Vision-Language, promising approach

备注: 8 pages, 5 figures, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

点击查看摘要

Abstract:Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at this https URL.

26. 【2605.15942】Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

链接https://arxiv.org/abs/2605.15942

作者:Chenhao Wang,Yingrui Ji,Yu Meng,Yao Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multiple semantic units, entangle multiple semantic, Decomposed Vision-Language Alignment, models often struggle, struggle to generalize

备注

点击查看摘要

Abstract:Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

27. 【2605.15923】Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction

链接https://arxiv.org/abs/2605.15923

作者:Chun-Peng Chang,Shaoxiang Wang,Alain Pagani,Dariu Gavrila,Holger Caesar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern image encoders, Modern image, decoupling semantic meaning, fully realized, image encoders achieve

备注

点击查看摘要

Abstract:Modern image encoders achieve high generalization by decoupling semantic meaning from resolution, an ability yet to be fully realized in the 3D domain. We investigate the failure of 3D point cloud encoders to achieve similar generalization and find that existing models are highly sensitive to sampling resolution and scale changes, leading to significant performance degradation. This sensitivity is a major bottleneck for real-world deployment in robotics, as it suggests models overfit to specific quantization densities and object scales rather than learning invariant semantic features. To mitigate this dependency, we propose Invaria, a point cloud encoder that achieves scale and density invariance through next-resolution prediction and receptive field calibration. While our objective is not the explicit generation of high-resolution point clouds, we find that this training objective encourages the model to learn robust, structural invariants. The resulting encoder achieves significant performance gains during resolution shifts while maintaining high efficiency through a compact model size and reduced token requirements. Specifically, on ScanNet, Invaria achieves a 56.0\% higher mIoU at 3$\times$ lower resolution and a 20\% improvement when the objects scale is reduced by a factor of 3. These gains are achieved with a 45\% smaller model size and an average reduction of 40\% in input tokens.

28. 【2605.15921】AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression

链接https://arxiv.org/abs/2605.15921

作者:Dingming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to eliminate, plausibly inpainting, inpainting the affected, Object removal aims, background content

备注: Accepted by ICML 2026

点击查看摘要

Abstract:Object removal aims to eliminate specified objects from images while plausibly inpainting the affected regions with background content. Current training-free methods typically block attention to object regions within self-attention layers during the image generation process, leveraging surrounding background information to restore the image. However, indiscriminate suppression of self-attention in the vacated areas can degrade generation quality, as the model must simultaneously reconstruct background content in these regions. To solve this conflict, we propose AdaEraser, an adaptive framework that dynamically modulates attention based on the estimated presence of target object concepts. Through analysis of self-attention map evolution across denoising timesteps before and during removal, we develop a token-wise adaptive attention suppression strategy. This approach enables progressive perception of object removal throughout the denoising process, with the suppression strength in self-attention layers adjusted adaptively. Extensive experiments demonstrate that AdaEraser achieves superior performance in object removal, outperforming even training-based methods.

29. 【2605.15916】LoCO: Low-rank Compositional Rotation Fine-tuning

链接https://arxiv.org/abs/2605.15916

作者:An Nguyen,Jaesik Choi,Anh Tong

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:adapting large-scale foundation, Parameter-efficient fine-tuning, natural language processing, large-scale foundation models, critical technique

备注: IJCAI 2026

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.

30. 【2605.15908】RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

链接https://arxiv.org/abs/2605.15908

作者:Yanhao Ge,Shanyan Guan,Weihao Wang,Ying Tai,Mingyu You

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Natural images, generative models synthesize, limiting resolution-flexible generation, discrete grids, limiting resolution-flexible

备注

点击查看摘要

Abstract:Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.

31. 【2605.15906】A Causally Grounded Taxonomy for Image Degradation Robustness Evaluation

链接https://arxiv.org/abs/2605.15906

作者:Stefan Becker,Simon Weiss,Wolfgang Hübner,Michael Arens

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:altering visual appearance, affecting downstream vision, downstream vision tasks, occur during acquisition, altering visual

备注

点击查看摘要

Abstract:Image degradations can occur during acquisition, processing, and transmission, altering visual appearance and affecting downstream vision tasks. They are studied in several communities, including synthetic corruption benchmarks for robustness evaluation, perceptual image quality assessment, and physically grounded analyses of imaging systems or real camera failures. Although these areas address closely related phenomena, they often use incompatible grouping schemes and backend specific severity definitions, making results difficult to compare across datasets, degradation sources, and tasks. We propose a causally grounded framework for organizing and interpreting image degradations across these settings. Instead of introducing new degradations or redefining existing benchmarks, we provide an interpretive representation and measurement layer that makes implicit assumptions explicit. Each degradation is described along two orthogonal axes: its dominant causal source in the imaging pipeline (environment, sensor/optics, ISP/renderer/codec, or transfer/system), and its resulting perceptual effect. This dual axis abstraction yields a compact taxonomy spanning algorithmic corruptions, perceptual distortions, and physically motivated imaging artifacts. To address inconsistent severity semantics without changing existing implementations, we introduce a lightweight severity measurement layer. For every degradation and each native severity level of a given backend, we quantify degradation strength using full reference image quality metrics: PSNR, SSIM, and LPIPS. This makes severity observable and comparable across sources while preserving native parameterizations. We demonstrate the framework through COCO Degradation, a taxonomy aligned benchmark for evaluating object detector robustness under diverse imaging conditions.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.15906 [cs.CV]

(or
arXiv:2605.15906v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.15906

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
32. 【2605.15894】Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning

链接https://arxiv.org/abs/2605.15894

作者:Ranjith Chodavarapu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:health risk management, human health risk, Rapid and accurate, air quality modeling, emergency response

备注

点击查看摘要

Abstract:Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.

33. 【2605.15880】FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization

链接https://arxiv.org/abs/2605.15880

作者:Tingting Liu,Yuan Liu,Guiping Chen,Xiubao Sui,Qian Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Thermal infrared imaging, Thermal infrared, making it important, all-weather perception, imaging is robust

备注

点击查看摘要

Abstract:Thermal infrared imaging is robust to illumination variations and smoke interference, making it important for all-weather perception. However, the lack of natural color and fine texture limits target recognition, human visual interpretation, and the transfer of visible-light models. Existing infrared colorization methods mainly rely on single-band images, where insufficient spectral cues may lead to structural distortion and semantic confusion. Although infrared hyperspectral images provide rich spectral responses and material information, existing single-band frameworks remain limited in modeling spatial-spectral coupling and weak texture details. To address these issues, this paper presents FSCM, a spectral-information-guided GAN framework. Within FSCM, a frequency-enhanced spatial-spectral state-space generator composed of cascaded FSB units is constructed. Each FSB integrates three complementary components: state-space modeling captures global spatial-spectral dependencies; the frequency enhancement module (FEM) combines multi-level wavelet decomposition and Fourier gating to recover structural contours, directional high-frequency details, and global frequency responses; and the dual-stream hybrid gating module (DGM) integrates deformation-aware sampling with sparse attention to enhance effective local structures and suppress background interference. Additionally, an online semantic segmentation-guided loss is introduced to constrain the generated results, improving semantic consistency in complex road scenes. Experiments show that FSCM outperforms existing infrared colorization methods in visual quality and semantic fidelity.

34. 【2605.15876】Unlocking Dense Metric Depth Estimation in VLMs

链接https://arxiv.org/abs/2605.15876

作者:Hanxun Yu,Xuan Qu,Yuxin Wang,Jianke Zhu,Lei ke

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:grounding and captioning, remain limited, Vision-Language Models, vision models, external vision models

备注: Project Page: [this https URL](https://depthvlm.github.io/)

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.

35. 【2605.15868】SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval

链接https://arxiv.org/abs/2605.15868

作者:Wenjie Yang,Hang Yu,Yuyu Guo,Peng Di

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:contexts are interchangeable, multimodal retrieval works, address the critical, critical yet underexplored, underexplored challenge

备注: Accepted by ICML 2026

点击查看摘要

Abstract:In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.

36. 【2605.15864】Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

链接https://arxiv.org/abs/2605.15864

作者:Chufan Shi,Cheng Yang,Yaokang Wu,Linhao Jin,Bo Shui,Taylor Berg-Kirkpatrick,Xuezhe Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:produce self-reflective statements, produce self-reflective, check the figure, Vision-Language Models, self-reflective statements

备注: ICML 2026 Spotlight

点击查看摘要

Abstract:Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: this https URL

37. 【2605.15860】On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry

链接https://arxiv.org/abs/2605.15860

作者:Michał Król,Michał Salamonowicz,Władysław Skarbek,Michał Tomaszewski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate geometric calibration, low spatial resolution, low-cost thermal sensors, Accurate geometric, multimodal building envelope

备注: 27 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Accurate geometric calibration of RGB-thermal infrared (TIR) stereo camera systems is essential for multimodal building envelope analysis, yet remains challenging when low-cost thermal sensors with very low spatial resolution are employed. This paper presents a practical stereo calibration framework for an RGB camera (2028 x 1520 px) paired with a TIR camera operating at only 80 x 62 px - a pixel-count ratio of approximately 1:625. An active OLED screen dynamically switches modality-specific patterns (checkerboard for TIR, ChArUco for RGB) on a single physical surface, providing controlled and repeatable thermal contrast. A dedicated corner detection algorithm combining perspective rectification, Hessian saddle-point analysis, and Mean Shift localisation achieves reliable checkerboard detection at 80 x 62 px without per-frame parameter tuning. A baseline-constrained bundle adjustment enforces physically consistent rig geometry under the planar-calibration-object degeneracy, yielding a stereo baseline of 32.7 mm (nominal 30 mm) with an overall reprojection error of 0.382 px. The system is validated on a thermally active building mock-up using constant-depth and per-pixel depth estimation, demonstrating consistent TIR-to-RGB projection suitable for building energy performance assessment.

38. 【2605.15855】Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?

链接https://arxiv.org/abs/2605.15855

作者:Renye Yan,Jikang Cheng,Shikun Sun,Yi Sun,You Wu,Wei Peng,Zongwei Wang,Ling Liang,Junliang Xing,Yimao Cai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion models' reconstruction, models' reconstruction objectives, reconstruction objectives limit, objectives limit alignment, strong image-generation performance

备注

点击查看摘要

Abstract:Despite strong image-generation performance, diffusion models' reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare 'dual benefit': a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.

39. 【2605.15852】GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

链接https://arxiv.org/abs/2605.15852

作者:Leyang Chen,Junyi Wu,Zhiteng Li,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe memory bottleneck, monocular video sequences, video sequences requires, sequences requires maintaining, long monocular video

备注

点击查看摘要

Abstract:Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at this https URL.

40. 【2605.15843】WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes

链接https://arxiv.org/abs/2605.15843

作者:Jichen Hu,Jiawei Guo,Jiazhong Cen,Chen Yang,Sikuang Li,Wei Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modeling systems based, typically static monolithic, static monolithic assets, world modeling systems, generative scene synthesis

备注: Project page: [this https URL](https://sjtu-deepvisionlab.github.io/WorldAct)

点击查看摘要

Abstract:Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.

41. 【2605.15835】Community-aware evaluation and threshold calibration for open-set plankton image recognition

链接https://arxiv.org/abs/2605.15835

作者:Xi Chen(1),Eryuan Huang(2),Yingjun Xiao(3),Gang Fang(4) ((1) School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China, (2) School of Environment, South China Normal University, Guangzhou, China, (3) School of Artificial Intelligence, Guangzhou University, Guangzhou, China, (4) Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated plankton image, inevitably encounter unseen, aquatic ecosystem monitoring, encounter unseen taxa, deployed classifiers inevitably

备注: Manuscript. 14 figures/tables in total

点击查看摘要

Abstract:Automated plankton image recognition is increasingly used in aquatic ecosystem monitoring, but deployed classifiers inevitably encounter unseen taxa and non-target particles. Open-set recognition methods are usually evaluated with sample-level metrics such as AUROC, AUPR, and FPR@95% unknown-recall operating points, whereas ecological monitoring depends on community-level estimates of taxon abundance and diversity. This study examines the mismatch between these objectives using controlled pseudo-communities and three datasets spanning marine zooplankton imaged by ZooScan, marine phytoplankton imaged by IFCB, and freshwater plankton imaged by an in-situ camera. We define Open-Set Community Distortion (OSCD), a Bray-Curtis-style error over known taxa plus an unknown bin, with directional components distinguishing known-taxon overestimation from underestimation. Closed-set classifiers achieved high known-class accuracy, but unknown samples were often absorbed with high confidence and in structured ways. Sample-level OOD metrics were not sufficient to select ecological operating points: for MSP, FPR@95% unknown-recall thresholds produced large test-community OSCD on all three datasets mainly because true known taxa were over-rejected into the unknown bin. Community-aware threshold calibration reduced MSP OSCD relative to fixed 95% known recall on SYKE-ZooScan 2024 and SYKE-IFCB 2022; on ZooLake the fixed-recall baseline was already close to the community-aware threshold, and the best community-level method was a prototype-distance variant rather than MSP. The benefit of community-aware calibration therefore depends on validation-community representativeness and the gap between fixed recall and the community optimum. These results show that open-set plankton recognition should be evaluated as an ecological measurement problem, not only as a sample-level detection task.

42. 【2605.15828】Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

链接https://arxiv.org/abs/2605.15828

作者:Yipu Zhang,Jintao Cheng,Weilun Feng,Jiehao Luo,Chuanguang Yang,Zhulin An,Yongjun Xu,Wei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Geometry Grounded, Geometry Grounded Transformer, multiple visual geometry, Visual Geometry, visual geometry tasks

备注

点击查看摘要

Abstract:Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization.

43. 【2605.15824】FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

链接https://arxiv.org/abs/2605.15824

作者:Quanjian Song,Yefeng Shen,Mengting Chen,Hao Sun,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Liujuan Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown significant commercial, Human-centric video customization, Human-centric video, shown significant, significant commercial

备注: Project Page: [this https URL](https://quanjiansong.github.io/projects/FashionChameleon/)

点击查看摘要

Abstract:Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.

44. 【2605.15816】StippleDiffusion: Capacity-Constrained Stippling using Controlled Diffusion

链接https://arxiv.org/abs/2605.15816

作者:Ofir Gilad,Aleksander Plocharski,Przemyslaw Musialski,Andrei Sharf

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:per-density iterative optimizers, iterative optimizers, traditionally produced, produced by per-density, per-density iterative

备注: 12 pages, 10 figures

点击查看摘要

Abstract:Stipple patterns, point sets whose local density tracks a target image, are traditionally produced by per-density iterative optimizers, which are slow, non-differentiable, and must be re-run from scratch for each new target. Learned alternatives have so far addressed only unconditional point generation; capacity-constrained, image-conditioned stippling has remained out of reach. We present the first diffusion-based sampler that simultaneously satisfies a learned local point-distribution prior and a continuous, image-defined capacity constraint at inference. The method is a ControlNet branch built on top of an optimal-transport-grid point-set diffusion baseline, conditioned on the target density map and a high-resolution image. Two design choices make the combination tractable: training and inference are restricted to the late-stage denoising regime, initialized from a density-weighted rejection sample, and the standard zero-convolution injection is replaced with a sigmoid-gated 1x1 projection that preserves the base model's blue-noise structure under hard density signals. A single trained checkpoint accepts arbitrary target densities at inference, generalizes to point budgets that were not seen during training, and produces stipples in time nearly independent of the output point count. On the Icons-50 benchmark, our learned sampler reaches parity with per-density-optimized baselines on every reported metric while remaining differentiable end-to-end.

45. 【2605.15803】Embedding-perturbed Exploration Preference Optimization for Flow Models

链接https://arxiv.org/abs/2605.15803

作者:Sujie Hu,Chubin Chen,Jiashu Zhu,Jiahong Wu,Xiangxiang Chu,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:established Reinforcement Learning, aligning generative models, Recent advancements, established Reinforcement, Reinforcement Learning

备注: Accepted by ICML 2026

点击查看摘要

Abstract:Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

46. 【2605.15796】Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion

链接https://arxiv.org/abs/2605.15796

作者:Xiongjun Guan,Jianjiang Feng,Jie Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:avoiding contact-induced deformation, local ridge structure, global finger geometry, preserve global finger, contact-induced deformation

备注

点击查看摘要

Abstract:Three-dimensional (3D) fingerprints preserve global finger geometry and local ridge structure while avoiding contact-induced deformation, but they remain difficult to integrate with legacy two-dimensional (2D) fingerprint systems. This paper addresses the intermediate stage between 3D acquisition and cross-modal matching, and presents a unified framework for 3D fingerprint preprocessing and registration across contactless and contact-based 2D modalities. The framework combines four components: 1) a nonparametric visualization and unwrapping method that converts a 3D fingerprint point cloud into a rolled-equivalent 2D representation without relying on a global finger-shape model; 2) a point-cloud fusion pipeline that registers and mosaics multiple partial 3D captures into a more complete fingerprint model; 3) an ellipse-based pose normalization method for canonical finger alignment; and 4) a pose-aware cross-modal registration strategy that improves compatibility between 3D fingerprints and both contactless and contact-based 2D fingerprints. Experiments on a self-collected multimodal fingerprint database containing 150 fingers show that the proposed framework achieves ridge-level 3D registration accuracy, robust pose estimation, and consistent gains in 2D compatibility. In particular, the 3D fusion error is concentrated around 0.09 mm, contactless 2D--3D registration reaches ridge-scale projection accuracy, and pose-aware unwrapping improves genuine matching scores relative to generic 3D unwrapping. These results support the use of 3D fingerprints as an effective geometric bridge across heterogeneous fingerprint modalities.

47. 【2605.15792】Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

链接https://arxiv.org/abs/2605.15792

作者:Yujun Tong,Dongliang Chang,Zijin Yin,Xintong Liu,Yuanchen Fang,Zhanyu Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generation mutually enhance, long-standing goal, mutually enhance, visual generation mutually, visual

备注: Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.

48. 【2605.15764】GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

链接https://arxiv.org/abs/2605.15764

作者:Junho Kim,Xu Cao,Houze Yang,Bikram Boote,Ana Jojic,Fiona Ryan,Bolin Lai,Sangmin Lee,James M. Rehg

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:current multimodal large, multimodal large language, Understanding social interactions, subtle non-verbal cues, large language models

备注: Project page: [this https URL](https://social-reaoning.github.io/grasp/)

点击查看摘要

Abstract:Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

49. 【2605.15760】Learn2Splat: Extending the Horizon of Learned 3DGS Optimization

链接https://arxiv.org/abs/2605.15760

作者:Naama Pearl,Stefano Esposito,Haofei Xu,Amit Peleg,Patricia Gschossmann,Lorenzo Porzi,Peter Kontschieder,Gerard Pons-Moll,Andreas Geiger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, commonly performed, Adam, SGD, standard optimizers

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) optimization is most commonly performed using standard optimizers (Adam, SGD). While stable across diverse scenes, standard optimizers are general-purpose and not tailored to the structure of the problem. In particular, they produce independent parameter updates that do not capture the structural and spatial relationships within a scene, leading to inefficient optimization and slow convergence. Recent works introduced learned optimizers that predict correlated updates informed by inter-parameter and inter-Gaussian dependencies. However, these methods are trained for a fixed number of optimization iterations and rely on manually scheduled learning rates to avoid degradation. In this paper, we introduce a learned optimizer for 3DGS that avoids degradation over extended optimization horizons without auxiliary mechanisms. To enable this, we propose a meta-learning scheme that extends the optimization horizon via a checkpoint buffer and an optimizer rollout strategy, combined with an architecture that encodes gradient scale information in its latent states. Results show improved early novel view synthesis quality while remaining stable over long horizons, with zero-shot generalization to unseen reconstruction settings. To support our findings, we introduce the first unified framework for training and evaluating both learned and conventional optimizers across sparse and dense view settings. Code and models will be released publicly. Our project page is available at this https URL .

50. 【2605.15755】Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

链接https://arxiv.org/abs/2605.15755

作者:Cheng Zhang,Yuer Liu,Zhiyu Zhou,Hongxia Xie,Wen-Huang Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, large language models, Multimodal large, fluent artwork emotion, visible formal attributes

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at this https URL

51. 【2605.15753】Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

链接https://arxiv.org/abs/2605.15753

作者:Xinggang Hu,Chenyangguang Zhang,Alexandros Delitzas,Xiangkui Zhang,Marc Pollefeys,Francis Engelmann,Xiangyang Ji

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:interactive elements, robotic manipulation, offer a versatile, versatile and flexible, flexible representation

备注

点击查看摘要

Abstract:Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

52. 【2605.15741】HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

链接https://arxiv.org/abs/2605.15741

作者:Yu He,Lichen Ma,Zipeng Guo,Xinyuan Shan,Jingling Fu,Dong Chen,Junshi Huang,Yan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Variational Autoencoders, Pixel-space diffusion models, Pixel-space diffusion, bottleneck of Variational, diffusion models bypass

备注

点击查看摘要

Abstract:Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual Foundation Model (VFM), effectively reducing generation hallucination and artifacts. Extensive experiments demonstrate that HyperDiT achieves state-of-the-art (SoTA) FID of $\mathbf{1.56}$ on ImageNet $256\times256$ directly within the pixel space. By combining the fine-grained stream with semantic guidance, HyperDiT offers a superior paradigm for high-fidelity pixel generation.

53. 【2605.15737】BARRIER: Bounded Activation Regions for Robust Information Erasure

链接https://arxiv.org/abs/2605.15737

作者:Jan Miksa,Patryk Krukowski,Przemysław Spurek,Dawid Damian Rymarczyk,Marcin Sendera

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:critical bottleneck, Machine unlearning, reached a critical, Machine, Bounded Activation Regions

备注

点击查看摘要

Abstract:Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding hypercube. By driving unlearning updates exclusively within this forget interval and mathematically bounding the model response on the complement, we ensure rigorous protection of the retain distribution. This geometric construction transforms the preservation of knowledge from an empirical heuristic into a formal optimization target with a probabilistic tail bound on functional drift. Crucially, this stability permits highly aggressive unlearning updates within the forget region. Empirical evaluations demonstrate that BARRIER matches state-of-the-art trade-offs across classifiers and diffusion models, maximizing targeted concept erasure while safeguarding the integrity of all other representations. Our code is available at this https URL.

54. 【2605.15736】BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

链接https://arxiv.org/abs/2605.15736

作者:Huanyang Tong,Kai Liu,Fangjun Kuang,Huiling Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Existing adaptation frameworks, shown remarkable promise, Language Models, Gated Cross-Modal Fusion, Biomedical Vision

备注: CVPR2026 Workshop

点击查看摘要

Abstract:Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal this http URL enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: this https URL. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

Comments:
CVPR2026 Workshop

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.15736 [cs.CV]

(or
arXiv:2605.15736v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.15736

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Huanyang Tong [view email] [v1]
Fri, 15 May 2026 08:45:57 UTC (1,998 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation, by Huanyang Tong and Kai Liu and Fangjun Kuang and Huiling ChenView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

prev

|
next

new
|
recent
| 2026-05

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

55. 【2605.15735】UAM: A Dual-Stream Perspective on Forgetting in VLA Training

链接https://arxiv.org/abs/2605.15735

作者:Jianke Zhang,Yuanfei Luo,Yucheng Hu,Xiaoyu Chen,Yanjiang Guo,Ziyang Liu,Hongbin Xu,Tian Lan,Jianyu Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:typically built, built by fine-tuning, language model, Dorsal Expert, language

备注

点击查看摘要

Abstract:Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

56. 【2605.15733】Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

链接https://arxiv.org/abs/2605.15733

作者:Tianqiu Zhang,Muyang Lyu,Xiao Liu,Si Wu

类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Humans abstract experiences, Humans abstract, facilitate pattern inference, experiences into structured, structured representations

备注: Project page: [this https URL](https://hpc-mec-worldmodel.github.io/)

点击查看摘要

Abstract:Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

57. 【2605.15728】DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation

链接https://arxiv.org/abs/2605.15728

作者:Yifan Gao,Lu Zou,Zhangjin Huang,Guoping Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multi-category joint learning, joint learning problem, shared model parameters, fully shared model, model parameters

备注

点击查看摘要

Abstract:Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.

58. 【2605.15725】DiLA: Disentangled Latent Action World Models

链接https://arxiv.org/abs/2605.15725

作者:Tianqiu Zhang,Muyang Lyu,Yufan Zhang,Fang Fang,Si Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:inferring abstract actions, Latent Action, inferring abstract, consecutive frames, Action

备注: Project Page: [this http URL](http://disentangled-latent-action-world-models.github.io)

点击查看摘要

Abstract:Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

59. 【2605.15723】GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

链接https://arxiv.org/abs/2605.15723

作者:Xu Wang,Xunkai Li,Yinlin Zhu,Rong-Hua Li,Guoren Wang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:CLIP-style dual encoders, entities largely unused, isolated image-text pairs, dual encoders, leaving the relational

备注

点击查看摘要

Abstract:Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.

60. 【2605.15722】Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation

链接https://arxiv.org/abs/2605.15722

作者:Jeonghwa Lim,Minje Park,Sunghoon Joo

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:meaningful waveform features, waveform features, cardiovascular diagnostics, Accurate delineation, crucial for cardiovascular

备注: 11 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.

61. 【2605.15720】Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment

链接https://arxiv.org/abs/2605.15720

作者:Yuchen Li,Zhen Zhao,Yi Liu,Luping Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:making annotation costly, requires pixel-level masks, pixel-level masks aligned, requires pixel-level, making annotation

备注

点击查看摘要

Abstract:Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.

62. 【2605.15711】EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

链接https://arxiv.org/abs/2605.15711

作者:Xuanyu Ge,Zhongqi Wang,Jie Zhang,Shiguang Shan,Xilin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable capabilities, demonstrated remarkable, remarkable capabilities, Large Vision-Language Models, Large Vision-Language

备注: 20 pages, 6 figures, 8tables

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.

63. 【2605.15708】3D Segmentation Using Viewpoint-Dependent Spatial Relationships

链接https://arxiv.org/abs/2605.15708

作者:Ayaka Nanri(1),Klara Reichard(2,3),Mert Kiray(2,4,5),Federico Tombari(2,6),Benjamin Busam(2,4,5),Asako Kanezaki(1,7,8) ((1) Institute of Science Tokyo, (2) Technical University of Munich, (3) BMW Group, (4) Munich Center for Machine Learning (MCML), (5) Obsphera, (6) Google, (7) Tohoku University, (8) RIKEN)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:improved natural language, greatly improved natural, Recent advances, scene understanding, natural language

备注

点击查看摘要

Abstract:Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as "left," "right," "front," and "behind" ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.

64. 【2605.15689】How to Choose Your Teacher for Fine Grained Image Recognition

链接https://arxiv.org/abs/2605.15689

作者:Oswin Gosal,Edwin Arkel Rios,Augusto Christian Surya,Fernando Mikael,Bo-Cheng Lai,Min-Chun Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fine-grained image recognition, image recognition classifies, recognition classifies subcategories, Fine-grained image, image recognition

备注: Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{this https URL}{this https URL}.

65. 【2605.15684】ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

链接https://arxiv.org/abs/2605.15684

作者:Kunpeng Du,Haizhen Xie,Sen Lu,Lei Yu,Binglei Bao,Huaao Tang,Chuntao Liu,Hao Wu,Yang Zhao,Zhicai Huang,Heyuan Gao,Zhijun Tu,Jie Hu,Xinghao Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformer, Transformer, Stable, Diffusion, high-fidelity image generation

备注

点击查看摘要

Abstract:The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

66. 【2605.15682】DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

链接https://arxiv.org/abs/2605.15682

作者:Qingji Dong,Hang Dong,Mingqin Chen,Rui Zhang,Yitong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale pre-trained diffusion, powerful generative priors, real-world image Super-Resolution, Large-scale pre-trained, extensively adopted

备注

点击查看摘要

Abstract:Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at this https URL.

67. 【2605.15681】DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer

链接https://arxiv.org/abs/2605.15681

作者:Nisha Huang,Yizhou Lin,Jie Guo,Xiu Li,Tong-Yee Lee,Zitong Yu

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:textbf, additional computational costs, underline, feature misalignment, fine-tuning or complex

备注

点击查看摘要

Abstract:Recently, diffusion-based material transfer methods rely on image fine-tuning or complex architectures with auxiliary networks but face challenges such as text dependency, additional computational costs, and feature misalignment. To address these limitations, we propose \textbf{DealMaTe}, using \underline{\textbf{de}}pth, norm\underline{\textbf{a}}l, and \underline{\textbf{l}}ighting images for \underline{\textbf{ma}}terial \underline{\textbf{t}}ransf\underline{\textbf{e}}r. DealMaTe is a simplified diffusion framework that eliminates text guidance and reference networks. We design a lightweight 3D information injection method, Multi-Dim 3D Shader LoRA, which, without modifying the base model weights, enables compatible control conditions and achieves harmonious and stable results. Additionally, we optimize the attention mechanism with Shader Causal Mutual Attention and key-value (KV) caching to reduce inference latency caused by multiple conditions, improve computational efficiency, and achieve high-quality material transfer results with low architectural complexity. Extensive experiments covering a wide variety of objects and lighting conditions consistently demonstrate that DealMaTe achieves remarkable high-fidelity material transfer under arbitrary input materials. The code is available at this https URL.

68. 【2605.15677】VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

链接https://arxiv.org/abs/2605.15677

作者:Xiaoyan Su,Peijie Dong,Zhenheng Tang,Song Tang,Yuyao Zhai,Kaitao Lin,Liang Chen,Gai Yuhang,Yuyu Luo,Qiang Wang,Xiaowen Chu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:critical gap remains, controllable diagrammatic tasks, Vision-Language Models, diagrammatic tasks essential, controllable diagrammatic

备注: Accepted by ICML2026, 37 pages, 10 figures

点击查看摘要

Abstract:Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

69. 【2605.15672】VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

链接https://arxiv.org/abs/2605.15672

作者:Hyesoo Hong,Minsoo Kim,Wonje Jeung,Sangyeon Yoon,Dongjae Jeon,Albert No

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieve strong performance, lack robust control, basic visual operations, Vision-language models, achieve strong

备注

点击查看摘要

Abstract:Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

70. 【2605.15666】ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark

链接https://arxiv.org/abs/2605.15666

作者:Haozhe Si,Yuxuan Wan,Yuqing Wang,Minh Do,Han Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling material-level understanding, dense spectral information, Earth surface, enabling material-level, dense spectral

备注

点击查看摘要

Abstract:Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.

71. 【2605.15661】VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

链接https://arxiv.org/abs/2605.15661

作者:Yan Luo,Ahmadou Aidara,Jingyi Lu,Jeremy Moebel,Kai Han,Mengyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:entire ODE trajectory, standard practice holds, strongly text semantics, text semantics move, ODE trajectory

备注

点击查看摘要

Abstract:Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at this https URL.

72. 【2605.15660】MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

链接https://arxiv.org/abs/2605.15660

作者:Nisha Huang,Henglin Liu,Yizhou Lin,Kaer Huang,Chubin Chen,Jie Guo,Tong-Yee Lee,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extra computational costs, including text dependency, Recent diffusion-based methods, face challenges including, challenges including text

备注

点击查看摘要

Abstract:Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.

73. 【2605.15640】Learning Disentangled Representations for Generalized Multi-view Clustering

链接https://arxiv.org/abs/2605.15640

作者:Xin Zou,Ruimeng Liu,Chang Tang,Zhenglai Li,Xinwang Liu,Kunlun He,Wanqing Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained significant attention, leverage complementary information, diverse views, gained significant, significant attention

备注: accepted by IEEE TPAMI 2026 (IEEE Transactions on Pattern Analysis and Machine Intelligence)

点击查看摘要

Abstract:Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: this https URL.

74. 【2605.15621】LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

链接https://arxiv.org/abs/2605.15621

作者:Hongyu Lu,Feng Zhang,Wenwei Jin,Huanling Hu,Tianjun Shi,Shikai Jiang,Yao Hu,Jiawei Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strong multimodal understanding, inference cost grows, cost grows rapidly, achieve strong multimodal, Large vision-language models

备注: The paper includes 11 figures, multiple tables, comprehensive experimental results on 11 image understanding benchmarks and 3 video benchmarks, with extensive ablation studies and qualitative visualizations

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

75. 【2605.15618】Latent Video Prediction Learns Better World Models

链接https://arxiv.org/abs/2605.15618

作者:Ali J Alrasheed,Aryan Yazdan Parast,Basim Azam,James Bailey,Naveed Akhtar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains largely confined, Self-supervised video models, evaluation remains largely, Self-supervised video, accuracy score

备注

点击查看摘要

Abstract:Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.

76. 【2605.15615】Neutral-Reference Prompting for Vision-Language Models

链接https://arxiv.org/abs/2605.15615

作者:Senmao Tian,Xiang Wei,Shunli Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Efficient transfer learning, Efficient transfer, Base-New Trade-off, commonly suffers, unseen classes

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

77. 【2605.15599】Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study

链接https://arxiv.org/abs/2605.15599

作者:Alexander Hackett,Srikanth Thudumu,Ginny Fisher,Mahule Roy,Aisha Sartaj,Jason Fisher

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:selecting pretrained encoders, labeling is expensive, common in expert, principled guidance, guidance for selecting

备注: Presented at the 13th Workshop on Fine-Grained Visual Categorization (FGVC13) at CVPR 2026

点击查看摘要

Abstract:Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.

78. 【2605.15597】CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage

链接https://arxiv.org/abs/2605.15597

作者:Jiale Liu,Jungang Li,Jieming Yu,Xinglin Yu,Zihao Dongfang,Zongjian Ding,Kaifeng Ding,Yi Yang,Lidong Chen,Yang Zou,Shunwen Bai,Jiahuan Zhang,Haoran Huang,Shan Huang,Yudong Gao,Mingjun Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:panoramic training interface, visual learning relies, point clouds, existing scans, training interface

备注: 35 pages including appendix. Code and dataset: [this https URL](https://github.com/Strange-animalss/CM-EVS)

点击查看摘要

Abstract:Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.

79. 【2605.15592】Efficient Image Synthesis with Sphere Latent Encoder

链接https://arxiv.org/abs/2605.15592

作者:Tung Do,Thuan Hoang Nguyen,Hao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rapid progress, consistency and meanflow-based, reducing the number, number of sampling, meanflow-based methods significantly

备注: Technical report

点击查看摘要

Abstract:Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.

80. 【2605.15586】Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

链接https://arxiv.org/abs/2605.15586

作者:Tan-Ha Mai,Chao-Kai Chiang,Han-Hwa Shih,Gang Niu,Masashi Sugiyama,Hsuan-Tien Lin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:weakly supervised paradigm, Complementary-label learning, weakly supervised, supervised paradigm, paradigm where instances

备注: 33 pages, 16 figures, 18 tables

点击查看摘要

Abstract:Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

81. 【2605.15585】See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

链接https://arxiv.org/abs/2605.15585

作者:Yuejia Li,Ke He,Junheng Li,Shutong Chen,Jingkang Xia,Zhiyue Su,Junchi Zhang,Mang Ye

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:including element overlap, Large language models, broken animation continuity, generate executable code, Large language

备注: 21 pages, 4 figures

点击查看摘要

Abstract:Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.

82. 【2605.15584】AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

链接https://arxiv.org/abs/2605.15584

作者:Zhiwei Li,Jiacheng Xue,Weining Wang,Ajian Liu,Xingyu Gao,Zhenan Sun,Qi Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:zero-shot transfer capabilities, demonstrated remarkable zero-shot, remarkable zero-shot transfer, Vision-language models, transfer capabilities

备注

点击查看摘要

Abstract:Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.

83. 【2605.15583】Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling

链接https://arxiv.org/abs/2605.15583

作者:Ryohei Goto,Takuya Fujihashi,Shunsuke Saruwatari,Fumio Okura

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human pose, single view, pose, human, multi-view ancestral sampling

备注: International Conference on Automatic Face and Gesture Recognition (FG 2026), Oral

点击查看摘要

Abstract:We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: this https URL.

84. 【2605.15582】LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance

链接https://arxiv.org/abs/2605.15582

作者:Jiaxuan Zhao,Ali Bereyhi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern deep learning, Modern deep, represent task-relevant semantic, change detection, explicitly represent task-relevant

备注: Accepted to IGARSS 2026. Code is available at: [this https URL](https://github.com/zjxyoyo/LDGuid)

点击查看摘要

Abstract:Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.

85. 【2605.15574】MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

链接https://arxiv.org/abs/2605.15574

作者:Sunghwan Steve Cho,Yunseok Han,Jaeyoung Do

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Longitudinal chest X-ray, short-horizon image pairs, VQA benchmarks focus, existing medical VQA, interpretation requires reasoning

备注: 33 pages

点击查看摘要

Abstract:Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at this https URL

86. 【2605.15561】RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

链接https://arxiv.org/abs/2605.15561

作者:Jiayan Yang,Zhuoyu Wu,Wenqi Fang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:facilitate medical visual, visual question answering, medical visual question, jointly interpreting images, facilitate medical

备注: under revision

点击查看摘要

Abstract:Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.

87. 【2605.15546】3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds

链接https://arxiv.org/abs/2605.15546

作者:Bingwen Qiu,Yuan Liu,Junqi Bai,Tong Jiang,Ben Liang,Fangzhou Chen,Xiubao Sui,Qian Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cloud object detection, object detection lies, point cloud object, Hybrid Mamba Transformer, fundamental challenge

备注

点击查看摘要

Abstract:A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at this https URL.

88. 【2605.15536】SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

链接https://arxiv.org/abs/2605.15536

作者:Mingtong Dai,Guanqi Peng,Yongjie Bai,Feng Yan,Chunjie Chen,Lingbo Liu,Liang Lin,Xinyu Wu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Previous imitation learning, contact-rich operation phases, imitation learning policies, learning policies predict, policies predict future

备注

点击查看摘要

Abstract:Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{this https URL}.

89. 【2605.15535】Learning Dynamic Structural Specialization for Underwater Salient Object Detection

链接https://arxiv.org/abs/2605.15535

作者:Lin Hong,Chenhui Wang,Linan Deng,Yuning Cui,Yu Zhang,Xin Wang,Bojian Zhang,Wenqi Ren,Xingchen Yang,Fumin Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision-guided robotic applications, attracted increasing attention, visual scene understanding, underwater visual scene, salient object detection

备注: 15 pages

点击查看摘要

Abstract:Underwater salient object detection (USOD) has attracted increasing attention for underwater visual scene understanding and vision-guided robotic applications. However, existing USOD methods still struggle with underwater image degradations, which often lead to inaccurate object localization, fragmented salient regions, and coarse boundary prediction. To address these challenges, this paper proposes DSS-USOD, a novel RGB-based USOD method built upon dynamic structural specialization. DSS-USOD extracts a shared base representation from a single underwater image, decomposes it into boundary-sensitive and region-coherent structural features, and dynamically coordinates their contributions according to local structural context. Specifically, the extracted shared base representation is decomposed into a boundary-sensitive branch for modeling fine-grained boundary details and a region-coherent branch for capturing region-level structural consistency. A spatial coordination module is then introduced to adaptively regulate the relative contributions of the two branches according to local structural context. Moreover, cooperative structural supervision is introduced to promote branch specialization and stabilize spatial coordination, enabling DSS-USOD to better balance boundary precision and region coherence under degraded underwater conditions. Extensive experiments show that DSS-USOD achieves superior performance on benchmark datasets. Finally, real-world deployment on an underwater robot validates the practical effectiveness of DSS-USOD for underwater object inspection.

90. 【2605.15533】uning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance

链接https://arxiv.org/abs/2605.15533

作者:Song Wu,Xinyu Chen,Qian Wang,Liang Li,Zili Yi,Junlan Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Video editing poses, significant challenge, poses a significant, Noise Initialization Strategy, Video editing

备注: Accepted by ICIP 2026

点击查看摘要

Abstract:Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.

91. 【2605.15523】Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

链接https://arxiv.org/abs/2605.15523

作者:Hongxi Li,Tong Wang,Chengjing Wu,Tianbao Liu,Jiangtao Yao,Xiaochao Qu,Xinxiao Wu,Luoqi Liu,Ting Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving surrounding background, Scene text editing, surrounding background style, aims to modify, preserving surrounding

备注: ICML 2026

点击查看摘要

Abstract:Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: \href{this https URL}{this http URL}.

92. 【2605.15519】DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

链接https://arxiv.org/abs/2605.15519

作者:Anindya Sarkar,Srikumar Sastry,Aleksis Pirinen,Nathan Jacobs,Yevgeniy Vorobeychik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:direct aerial, exploration and pinpoint, Visual active search, leverages visual cues, modeling framework

备注: 26 Pages, 12 figures, Accepted to AAMAS 2026

点击查看摘要

Abstract:Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.

93. 【2605.15497】AnyAct: Towards Human Reenactment of Character Motion From Video

链接https://arxiv.org/abs/2605.15497

作者:Liuhan Chen,Lei Zhong,Jiewei Wang,Qin Shuai,Li Yuan,Leidong Fan,Qing Li,Kanglin Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:study the problem, problem of directly, directly deriving, motion, human

备注: 12 pages

点击查看摘要

Abstract:We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

94. 【2605.15496】LAPS: Improving Incremental LiDAR Mapping using Active Pooling and Sampling for Neural Distance Fields

链接https://arxiv.org/abs/2605.15496

作者:Dongjae Lee,Wooseong Yang,Yifu Tao,Maurice Fallon,Ayoung Kim

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:distance fields offer, Neural distance fields, making them attractive, distance fields, fields offer

备注: accepted at RA-L 2026

点击查看摘要

Abstract:Neural distance fields offer a compact and continuous representation of 3D geometry, making them attractive for incremental LiDAR mapping. However, their online optimization is vulnerable to catastrophic forgetting, where new observations can degrade previously reconstructed geometry. Replay-based training is commonly used to address this issue, but existing methods typically rely on passive replay buffers and uniform sampling, which can waste memory on redundant observations and under-train poorly constrained regions. We propose LAPS, a replay management framework for incremental neural mapping that improves both replay retention and replay allocation during online updates. LAPS combines reliability-based active pooling to retain reliable historical samples under limited memory with uncertainty-guided active sampling to focus optimization on under-constrained regions. Experiments on synthetic and real-world benchmarks show that LAPS consistently improves reconstruction completeness while maintaining competitive geometric accuracy. On Oxford Spires, it improves recall by 4.66 pp and F1-score by 3.79 pp over PIN-SLAM on the Blenheim Palace 05 sequence. We release our open source implementation at: this https URL.

95. 【2605.15492】FLASH: Efficient Visuomotor Policy via Sparse Sampling

链接https://arxiv.org/abs/2605.15492

作者:Jiaqi Bai,Jindou Jia,Yuxuan Hu,Gen Li,Xiangyu Chen,Tuo An,Kuangji Zuo,Jianfei Yang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:real-time robotic control, visuomotor policy learning, iterative denoising incurs, denoising incurs high, Generative models

备注: 19 pages, 10 figures

点击查看摘要

Abstract:Generative models such as diffusion and flow matching have become dominant paradigms for visuomotor policy learning, yet their reliance on iterative denoising incurs high inference latency incompatible with real-time robotic control. We present Fast Legendre-polynomial Action policy via Sparse History-anchored flow (FLASH Policy), which replaces discrete action-chunk generation with continuous Legendre polynomial trajectory representation. Specifically, by fitting expert demonstrations under sparse temporal sampling, FLASH enables a single inference to cover a significantly extended action horizon. To further accelerate generation, FLASH initiates the flow matching process from history polynomial coefficients rather than uninformative Gaussian noise, shortening the transport distance and enabling accurate single-step inference. Moreover, analytic polynomial differentiation directly provides desired velocity feed-forward signals to the torque controller without numerical approximation. Extensive experiments on five simulated and two real-world manipulation tasks demonstrate that FLASH achieves state-of-the-art success rates ($\ge 92\%$ across all tasks), a per-episode inference time of $31.40\,ms$ (up to $175\times$ faster than diffusion policies and $18\times$ faster than prior flow matching policies), up to $4\times$ faster training convergence than ACT, and $5\times$ to $7\times$ reduction in controller tracking error compared to discrete-action baselines.

96. 【2605.15487】Learning Normalized Energy Models for Linear Inverse Problems

链接https://arxiv.org/abs/2605.15487

作者:Nicolas Zilberstein,Santiago Segarra,Eero Simoncelli,Florentin Guth

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:provide powerful prior, powerful prior probability, existing implementations suffer, Generative diffusion models, introduce sampling biases

备注: ICML 2026

点击查看摘要

Abstract:Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

97. 【2605.15484】When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

链接https://arxiv.org/abs/2605.15484

作者:Libo Sun,Po-wei Harn,Peixiong He,Xiao Qin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:favorable accuracy-compute trade-offs, networks promise favorable, promise favorable accuracy-compute, practical vision deployments, efficiency gains

备注: 24 pages (main + appendix), 8 figures, 18 tables. Under review at TMLR. Code and aggregate results: [this https URL](https://github.com/libophd/sparse-moe-vision-rho)

点击查看摘要

Abstract:Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $\rho$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: this https URL.

98. 【2605.15477】EgoExo-WM: Unlocking Exo Video for Ego World Models

链接https://arxiv.org/abs/2605.15477

作者:Danny Tran,Roberto Martín-Martín,Kristen Grauman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inherent partial observability, humans' physical actions, Egocentric world models, Egocentric world, predict and plan

备注

点击查看摘要

Abstract:Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

99. 【2605.15475】A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation

链接https://arxiv.org/abs/2605.15475

作者:Haijian Lai,Bowen Liu,Man Xu,Chan-Tong Lam,João Macedo,Benjamin Ng,Sio-Kei Im

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Fully Connected Weighted, transposed Fully Connected, Connected Weighted, Fully Connected, empowered transposed Fully

备注: Accepted for publication in IEEE Transactions on Multimedia

点击查看摘要

Abstract:We introduce an empowered transposed Fully Connected Weighted (t-FCW) graph representation to embed point clouds into a metric space. While original t-FCW has shown promising results for point cloud classification, the reasons behind its effectiveness and its broader applicability remained unclear. In this work, we analyze the properties that make the empowered and original t-FCW effective and design a network that uses the empowered t-FCW exclusively as feature extractors. From an interpretability perspective, we build memory banks for classification, part segmentation, and semantic segmentation using the empowered t-FCW. Our analysis reveals that the empowered t-FCW inherits robustness from surface descriptors, provides interpretability through dimension-wise relations. These properties enable a highly efficient and interpretable network, which processes the ModelNet40 classification problem in approximately 7 seconds on an NVIDIA RTX A5000 GPU. Importantly, empowered t-FCW can function both as a lightweight standalone baseline and as a complementary plug-in to existing deep models.

100. 【2605.15466】Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

链接https://arxiv.org/abs/2605.15466

作者:Santosh Kumar Paidi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Learning predictive world, Joint Embedding Predictive, Learning predictive, Embedding Predictive Architectures, artificial intelligence

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

101. 【2605.15458】Video Models Can Reason with Verifiable Rewards

链接https://arxiv.org/abs/2605.15458

作者:Tinghui Zhu,Sheng Zhang,James Y. Huang,Selena Song,Xiaofei Wen,Yuankai Li,Hoifung Poon,Muhao Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made rapid progress, remain primarily optimized, made rapid, rapid progress, remain primarily

备注: Website: [this https URL](https://darthzhu.github.io/VideoRLVR-page/)

点击查看摘要

Abstract:Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

102. 【2605.15450】RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects

链接https://arxiv.org/abs/2605.15450

作者:Chunming He,Rihan Zhang,Dingming Zhang,Chengyu Fang,Longxiang Tang,Jingjia Feng,Fengyang Xiao,Sina Farsiu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Concealed Object Segmentation, transparent object detection, camouflaged object detection, including camouflaged object, object detection

备注

点击查看摘要

Abstract:Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.

103. 【2605.15430】Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones

链接https://arxiv.org/abs/2605.15430

作者:Alex Dunnett,Leonie Bottomley,Mirko Kovac,Basaran Bahadir Kocer

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:ideal perch location, vision-guided autonomous tree-perching, autonomous tree-perching drones, locate an ideal, ideal perch

备注: Work in progress version accepted to the Recent Advances in Robotic Perception for Forestry

点击查看摘要

Abstract:This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.

104. 【2605.15424】Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models

链接https://arxiv.org/abs/2605.15424

作者:Po-Chien Luan,Wuyang Li,Yang Gao,Alexandre Alahi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human trajectory forecasting, crowded environments, Human trajectory, crucial for safe, safe navigation

备注

点击查看摘要

Abstract:Human trajectory forecasting is crucial for safe navigation in crowded environments, requiring models that balance accuracy with computational efficiency. Efficiently modeling social interactions is key to performance in dense crowds. Yet, most recent methods rely on attention mechanisms, which are effective at capturing complex dependencies, but incur quadratic computational costs that scale poorly with the growing number of neighbors. Recently, Selective State-Space Models have provided a linear-time alternative; however, their inherently sequential design is misaligned with the unstructured and dynamic nature of social interactions. To address this challenge, we propose Social-Mamba, a forecasting architecture that reformulates social interactions as structured sequential processes. At its core is the Cycle Mamba block, a novel module that enables continuous bidirectional information flow. Social-Mamba organizes agents on an egocentric grid and introduces social triplet factorization, which decomposes interactions into temporal, egocentric, and goal-centric scans. These are dynamically integrated through a learnable social gate and global scan to generate accurate and efficient trajectory predictions. Extensive experiments on five trajectory forecasting benchmarks show that Social-Mamba achieves state-of-the-art accuracy while offering superior parameter efficiency and computational scalability. Furthermore, embedding Social-Mamba into a flow-matching framework further enhances both accuracy and efficiency, establishing it as a flexible and robust foundation for future trajectory forecasting research. The code is publicly available: this https URL

105. 【2605.15423】MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

链接https://arxiv.org/abs/2605.15423

作者:Luca Bompani,Manuele Rusci,Luca Benini,Daniele Palossi,Francesco Conti

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:Modern smart vision, process video streams, video object detection, Modern smart, smart vision sensors

备注

点击查看摘要

Abstract:Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at this https URL

106. 【2605.15421】U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration

链接https://arxiv.org/abs/2605.15421

作者:Michael Smith,Frank P. Ferrie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:explore in depth, under-studied topics, uncertainty estimation, uncertainty, uncertainty estimates

备注: Accepted to CVPR Findings Track 2026

点击查看摘要

Abstract:In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.

107. 【2605.15398】3DEditSafe: Defending 3D Editing Pipelines from Unsafe Generation

链接https://arxiv.org/abs/2605.15398

作者:Nicole Meng,Zheyuan Liu,Meng Jiang,Yingjie Lao

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, Recent advances, achieved high-fidelity, manipulation from text, unsafe

备注

点击查看摘要

Abstract:Recent advances in 3D generative editing, particularly pipelines based on 3D Gaussian Splatting (3DGS), have achieved high-fidelity, multi-view-consistent scene manipulation from text prompts. However, we find that these pipelines also introduce new safety risks when unsafe prompts produce edits that are propagated and optimized across views. In this work, we study unsafe generation in 3D editing pipelines and show that such behavior can lead to coherent, undesirable Not-Safe-For-Work (NSFW) content in the final 3D representation. To address this, we propose 3DEditSafe, a safety-regularized 3D editing framework that constrains unsafe semantic propagation during optimization. 3DEditSafe combines generation-stage safety guidance with rendered-view 3D safety regularization, safe semantic projection, residue suppression, and mask-aware preservation to steer optimization away from unsafe editing directions. We evaluate our approach on EditSplat scenes using an object-compatible unsafe prompt benchmark and show that 2D safety guidance alone is not consistently sufficient to prevent unsafe 3D edits. 3DEditSafe reduces unsafe semantic alignment and view-level attack success rates, while revealing a safety-quality tradeoff in which stronger unsafe suppression can introduce artifacts or reduce unsafe-prompt fidelity. To our knowledge, this work is the first attempt to study and defend against unsafe generation in text-driven 3D editing pipelines, highlighting the need for safety mechanisms that operate directly on optimized 3D representations.

Subjects:

Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.15398 [cs.GR]

(or
arXiv:2605.15398v1 [cs.GR] for this version)

https://doi.org/10.48550/arXiv.2605.15398

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
108. 【2605.15397】ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest

链接https://arxiv.org/abs/2605.15397

作者:Kangning Cui,Surendra Bohara,Suraj Prasai,Zishan Shao,Wei Tang,Martin Pillaca,Edwin Flores,Zhen Yang,Gregory Larsen,Evan Dethier,David Lutz,Jean-Michel Morel,Miles Silman,Victor Pauca,Fan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long-term ecosystem disruption, fine spatial scales, Illegal gold mining, Amazon rainforest, water contamination

备注: 70 pages, 35 figures, 28 tables

点击查看摘要

Abstract:Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.

109. 【2605.15391】PanoWorld: Geometry-Consistent Panoramic Video World Modeling

链接https://arxiv.org/abs/2605.15391

作者:Le Jiang,Xiangyu Bai,Bishoy Galoaa,Shayda Moezzi,Caleb James Lee,Tooba Imtiaz,Edmund Yeh,Jennifer Dy,Yanzhi Wang,Sarah Ostadabbas

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:panoramic video, video world model, generates geometry-consistent, single image, video

备注

点击查看摘要

Abstract:We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at this https URL.

110. 【2605.15383】MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays

链接https://arxiv.org/abs/2605.15383

作者:Emre Hayir,Lorin Crawford,Alex X. Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Microscopy images, respond to perturbations, drug screening, rich information, applications like drug

备注

点击查看摘要

Abstract:Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at this https URL.

111. 【2605.15375】ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing

链接https://arxiv.org/abs/2605.15375

作者:Blaž Rolih,Matic Fučka,Filip Wolf,Luka Čehovin Zajc

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Remote sensing change, Remote sensing, aims to localise, sensing change detection, Remote

备注

点击查看摘要

Abstract:Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4\%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: this https URL

112. 【2605.15368】Discretizing Group-Convolutional Neural Networks for 3D Geometry in Feature Space

链接https://arxiv.org/abs/2605.15368

作者:Daniel Franzen,Jean Philip Filling,Michael Wand

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:Group-convolutional neural networks, Group-convolutional neural, neural networks, deep learning, linear layer

备注: 11 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Group-convolutional neural networks (GCNNs) are among the most important methods for introducing symmetry as an inductive bias in deep learning: In each linear layer, GCNNs sample a transformation group $G$ densely and correlate data and filters in different poses (with suitable anti-aliasing for steerable GCNNs) to maintain equivariance with respect to $G$. Unfortunately, applying filters to many data items resulting from this sampling is expensive (even for translations alone, i.e., in ordinary CNNs), and costs grow exponentially with increasing degrees of freedom (such as translations and rotations in 3D), which often hinders practical applications. In this paper, we propose sampling in feature space, i.e., replacing geometrically dense samples with representative samples selected by feature similarity. This decouples geometric resolution from memory and processing costs during training and inference, providing a novel way to trade off computational effort and accuracy. Our main empirical finding is that a coarse feature-space sampling already preserves classification accuracy remarkably well, which permits precomputation based on geometric similarity, accelerating the training of equivariant 3D classifiers substantially.

113. 【2605.15342】Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

链接https://arxiv.org/abs/2605.15342

作者:Arsha Nagrani,Jasper Uijilings,Shyamal Buch,Tobias Weyand,Sudheendra Vijayanarasimhan,Bo Hu,Ramin Mehran,David A Ross,Cordelia Schmid

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:core component, embodied agents, reasoning, Abstract, models

备注

点击查看摘要

Abstract:Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at this https URL.

114. 【2605.15326】Multimodal Object Detection Under Sparse Forest-Canopy Occlusion

链接https://arxiv.org/abs/2605.15326

作者:Nitik Jain,Mangal Kothari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:difficult remote-sensing challenge, remote-sensing challenge due, Airborne Optical Sectioning, forest canopy remains, beneath forest canopy

备注

点击查看摘要

Abstract:Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible--thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible--thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.

115. 【2605.15325】COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection

链接https://arxiv.org/abs/2605.15325

作者:Darryl Cherian Jacob,Xinyu Liu,Kai Wang,Pan He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:providing interpretable predictions, Vision-language models, interpretable predictions, providing interpretable, video anomaly detection

备注: Manuscript currently under review for publication

点击查看摘要

Abstract:Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at this https URL

116. 【2605.15320】FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

链接https://arxiv.org/abs/2605.15320

作者:Thuan Hoang Nguyen,Jiahao Luo,Yinyu Nie,Hao Li,Gordon Guocheng Qian,Jian Wang

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:limits scalability, traditionally relied, relied on per-subject, requires hours, hours of computation

备注: Project Page: [this https URL](https://ffavatar.github.io)

点击查看摘要

Abstract:Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.

117. 【2605.15312】Beyond Performance Disparities: A Three-Level Audit of Representational Harm in CelebA

链接https://arxiv.org/abs/2605.15312

作者:Sieun Park,Yuanmo He

类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)

关键词:labels remain underexplored, Large-scale facial datasets, cultural biases embedded, Large-scale facial, remain underexplored

备注: 15 pages, 8 figures

点击查看摘要

Abstract:Large-scale facial datasets like CelebA are widely used in computer vision, yet the cultural biases embedded in their labels remain underexplored. Fairness research has distinguished representational from allocational harms, but audits of computer vision datasets have mostly examined categorical labels, leaving open how such harms appear in learned features and model attention. This paper examines CelebA at three levels: dataset structure, learned feature weights, and spatial attention, focusing on how gendered double standards of ageing and beauty are encoded in the data and reproduced in model behaviour. First, hierarchical clustering of 202,599 images shows that the 39 attributes organise into latent trait bundles aligned with cultural archetypes: performative femininity (youth, makeup, adornment) and professional masculinity (ageing, facial hair, formal attire). Female faces, though more often rated attractive overall, incur steep penalties when assigned to ageing or masculine-coded clusters. Second, XGBoost with SHAP analysis reveal gender-specific effects, such as adiposity reducing attractiveness only for females. Third, Grad-CAM finds that predictions for female and younger male subgroups concentrate on mid-face cues, whereas predictions for older males drift toward peripheral cues such as hair and clothing. Older males attain the highest accuracy but the lowest average precision, indicating categorical exclusion of groups outside the dataset's evaluative templates. Cultural double standards thus pass from media representation into dataset labels, feature weights, and model attention, producing two representational harms: hyper-scrutiny of women under a narrow evaluative template, and exclusion of older men from the scheme entirely. Fairness metrics focused on performance disparities mask both, underscoring the need to address representational harm in fairness research.

118. 【2605.15309】One Pass Is Not Enough: Recursive Latent Refinement for Generative Models

链接https://arxiv.org/abs/2605.15309

作者:Mehdi Esmaeilzadeh,Alexia Jolicoeur-Martineau,Chirag Vashist,Ke Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remarkable progress, FID, mode coverage, competitive FID, image generation

备注

点击查看摘要

Abstract:Despite remarkable progress, image generation is far from solved. The dominant metric, FID, conflates sample fidelity with mode coverage and is close to being saturated. Yet a model can still exhibit mode collapse while achieving a low FID, since a handful of sharp, near-duplicate images can outscore a model that faithfully covers the full data distribution. We argue that precision and recall are essential complements to FID, and that because FID is already saturated, the more meaningful goal is to improve diversity and coverage. Achieving high recall requires a model that explicitly prioritizes mode coverage, unlike most generative models, which optimize sample fidelity. We introduce RTM, which replaces the single-pass latent mapping in style-based generators with an iterative refinement process, and show that this consistently improves both quality and diversity. Integrated with Implicit Maximum Likelihood Estimation (IMLE), which optimizes mode coverage by design, RTM achieves the highest precision and recall among current state-of-the-art approaches while maintaining competitive FID, with improvements across CIFAR-10, CelebA-HQ at 256x256, and nine few-shot benchmarks. RTM also improves StyleGAN2 and StyleGAN2-ADA on CIFAR-10 and AFHQ-v1 at 512x512, demonstrating that the benefit is not specific to IMLE. Unlike flow-matching baselines that achieve competitive FID at the expense of coverage, recursive refinement improves both quality and diversity simultaneously.

119. 【2605.15307】Sound Sparks Motion: Audio and Text Tuning for Video Editing

链接https://arxiv.org/abs/2605.15307

作者:AmirHossein Naghi Razlighi,Aryan Mikaeili,Ali Mahdavi-Amiri,Daniel Cohen-Or,Yiorgos Chrysanthou

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词:Motion-centric video editing, editing remains difficult, large generative video, Motion-centric video, produce specific

备注: Project Page: [this https URL](https://amirhossein-razlighi.github.io/Sound_Sparks_Motion)

点击查看摘要

Abstract:Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model's multimodal conditioning. Code and data are available via our project page: this https URL

Comments:
Project Page: this https URL

Subjects:

Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

Cite as:
arXiv:2605.15307 [cs.GR]

(or
arXiv:2605.15307v1 [cs.GR] for this version)

https://doi.org/10.48550/arXiv.2605.15307

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
120. 【2605.15300】Deep Pre-Alignment for VLMs

链接https://arxiv.org/abs/2605.15300

作者:Tianyu Yu,Kechen Fang,Zihao Wan,Kaidong Zhang,Yicheng Zhang,Jun Song,Bo Zheng,Yuan Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, directly map outputs, Vision Language, directly map, lightweight projector

备注: Accepted by ICML 2026. Project Website: [this https URL](https://github.com/THUMAI-Lab/Deep-Pre-Alignment)

点击查看摘要

Abstract:Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9\% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.

121. 【2605.15298】PhysBrain 1.0 Technical Report

链接https://arxiv.org/abs/2605.15298

作者:Shijie Lian,Bin Yu,Xiaopeng Lin,Changti Wu,Hang Yuan,Xiaolin Hu,Zhaolong Shen,Yuzhuo Miao,Haishan Liu,Yuxuan Tian,Yukun Shi,Cong Huang,Kai Chen

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:learning broad physical, provide limited coverage, models have advanced, advanced rapidly, limited coverage

备注: Project Page: [this https URL](https://phys-brain.github.io)

点击查看摘要

Abstract:Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

122. 【2605.15256】ReactiveGWM: Steering NPC in Reactive Game World Models

链接https://arxiv.org/abs/2605.15256

作者:Zeqing Wang,Danze Chen,Zhaohu Xing,Zizhao Tong,Yinhan Zhang,Xingyi Yang,Yeying Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current game world, models simulate environments, NPC, Current game, player-centric perspective

备注: The code is available at [this https URL](https://inv-wzq.github.io/ReactiveGWM/)

点击查看摘要

Abstract:Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action-induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses (e.g., Offense, Control, Defense) are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: our learned modules can be plugged directly into off-the-shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain-specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence, paving the way for scalable, strategy-rich interaction with the NPC.

123. 【2605.15231】Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

链接https://arxiv.org/abs/2605.15231

作者:Haoran Li,Tobias Lehrer,Yingxue Zhao,Haosu Zhou,Philipp Stocker,Tobias Pfaff,Nan Li

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:finite element crash, Nonlinear finite element, element crash simulations, iterative design optimisation, computationally expensive

备注: 48 pages, 15 figures, jounral paper to be submitted

点击查看摘要

Abstract:Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

124. 【2605.06475】Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features

链接https://arxiv.org/abs/2605.06475

作者:Ranjith Chodavarapu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:dating historical manuscript, historical manuscript pages, introduce a probabilistic, historical manuscript, visual features

备注

点击查看摘要

Abstract:We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93\% of patches within 5 years and 97\% within 10 years. Our approach achieves \textbf{PICP=92.6\%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2\%, 50 passes) and Deep Ensembles (PICP=79.7\%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $\rho=0.729$), and a selective prediction about the most certain 20\% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $\rho=0.905$ between uncertainty and page-level error.

125. 【2509.22151】MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

链接https://arxiv.org/abs/2509.22151

作者:Jonas Belouadi,Tamy Boubekeur,Adrien Kaiser

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:displacement maps, conductivity maps, including geometry, roughness and displacement, albedo and conductivity

备注: Accepted at ICLR 2026 (poster)

点击查看摘要

Abstract:Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

126. 【2508.17034】DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration

链接https://arxiv.org/abs/2508.17034

作者:Jiayi Li,Yuxin Yao,Qiuhang Lu,Juyong Zhang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:partially overlapping data, real-time processing pose, processing pose major, pose major challenges, partially overlapping

备注: Accepted to CVPR 2026, Project page: [this https URL](https://ustc3dv.github.io/DualReg/)

点击查看摘要

Abstract:Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism consisting of a computationally lightweight one-point RANSAC algorithm and a subsequent refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat the filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method's effectiveness, as demonstrated by a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. Project page: this https URL.

127. 【2605.15895】Layer Selection in Feature-Based Losses Affects Image Quality and Microstructural Consistency in Deep Learning Super-Resolution of Brain Diffusion MRI

链接https://arxiv.org/abs/2605.15895

作者:David Lohr,Rene Werner

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:prohibitive scan times, motivating computational super-resolution, scan times, motivating computational, Clinical application

备注

点击查看摘要

Abstract:Clinical application of high-resolution diffusion MRI is hindered by hardware limitations and prohibitive scan times, motivating computational super-resolution. This study investigates the efficacy of a feature-based loss function in preserving diffusion signal consistency in deep learning super-resolution. Using 7T data from the human connectome project to generate pairs of low- and high-resolution diffusion weighted images (DWI), we trained UNets for 2D super-resolution. Ablation and isolation studies evaluated different VGG16-layers for feature-based losses against an image-based L1 baseline. Deeper layers and combinations thereof resulted in grid-like artifacts in super-resolution DWIs, which persisted in diffusion parameters like quantitative and fractional anisotropy. No such artifacts were present when using the shallowest layer. Downstream analysis for this layer showed great consistency with the ground truth, even for 9-fold super-resolution. Image SNR and used VGG16-layer depths modulated artifact appearance and severity, mandating careful selection of contributing layers for application in and beyond diffusion MRI.

128. 【2605.15707】Evaluation of Anatomical Shape Priors in Deep Learning-Based Cardiac Multi-Compartment Segmentation

链接https://arxiv.org/abs/2605.15707

作者:Michael Hudler,Franz Thaler,Martin Urschler

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Whole-heart multi-compartment, enforce anatomical plausibility, explicitly enforce anatomical, clinically important, explicitly enforce

备注: Published in the Proceedings of the Third Austrian Symposium on AI, Robotics, and Vision (AIRoV 2026), pp. 23-27

点击查看摘要

Abstract:Whole-heart multi-compartment CT segmentation is clinically important, but standard CNNs do not explicitly enforce anatomical plausibility. Based on statistics derived from the training data, we evaluate whether lightweight explicit shape priors, implemented as shape-aware losses and spatial label distribution heatmap-guided U-Net variants, improve 3D cardiac segmentation on MM-WHS CT and WHS++. Across all experiments, a standard 3D U-Net surprisingly remained a very strong baseline, with handcrafted priors yielding at best marginal and inconsistent changes and often degrading performance. These results suggest that the baseline already captures substantial implicit anatomical regularities and that future gains will likely require more expressive learned priors rather than simple handcrafted anatomical shape constraints.

129. 【2605.15673】Highly Detailed and Generalizable Broadleaf Tree Crown Instance Segmentation from UAV Imagery

链接https://arxiv.org/abs/2605.15673

作者:Mitsutaka Nakada(1),Takahiko Ikebata(1),Kengo Ikebata(1),Yuji Mizuno(2),Yusuke Onoda(3),Ryuichi Takeshige(3 and 4),Kyaw Kyaw Htoo(3),Kanehiro Kitayama(3 and 5),Robert Ong(6),Masanori Onishi(1 and 3) ((1) DeepForest Technologies Co., Ltd., (2) YM Lab., (3) Graduate School of Agriculture, Kyoto University, (4) Graduate School of Science, Osaka Metropolitan University, (5) Faculty of Tropical Forestry, Universiti Malaysia Sabah, (6) Forest Research Centre)

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:unmanned aerial vehicles, delineating individual tree, aerial imagery acquired, highly detailed instance, individual tree crowns

备注: 12 pages, 5 figures, 3 Tables

点击查看摘要

Abstract:We present a highly detailed instance segmentation model for delineating individual tree crowns in natural broadleaf forests using aerial imagery acquired by unmanned aerial vehicles (UAVs). Tree crown delineation in broadleaf forests is more challenging than in other forest types due to diversity of crown shapes and the lack of clearly defined treetops. To address this issue, we developed a deep-learning-based crown segmentation model trained on high-quality annotated crown outlines. We manually delineated 18,507 crown polygons from orthomosaic images collected across seven forests in Japan by skilled annotators, and developed a model based on Mask2Former with multiple backbone architectures. The best model achieved high segmentation performance in structurally complex broadleaf forests using only RGB imagery. This performance was maintained when applied to geographically distinct forests within Japan, as well as to biologically distinct tropical rainforests in Borneo. These results demonstrate that using a large number of high-quality annotated datasets is critical for achieving detailed and generalizable crown segmentation across diverse forest ecosystems. The developed model has been integrated into DF Scanner Pro, a software that supports practical forest monitoring using UAVs, and this implementation is expected to enable a wide range of users to analyze tree-level information in broadleaf forest from UAVs.

130. 【2605.15671】Degradation-Aware Blur-Segmentation of Brain Tumor

链接https://arxiv.org/abs/2605.15671

作者:Yuchun Wang,Xiaosong Li,Gefei Liang,Yang Liu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:radiotherapy target delineation, multimodal MRI segmentation, surgical planning, post-treatment assessment, pivotal step

备注

点击查看摘要

Abstract:Multimodal 3D MRI brain tumor segmentation is a pivotal step in radiotherapy target delineation, surgical planning and post-treatment assessment. Existing methods often assume artifact-free MRI images. However, inevitable patient motion during scanning introduces artifacts and blur that degrade boundary and texture features, leading to poor segmentation performance. To bridge this gap, we introduce Degradation-Aware Blur-Segmentation Net (DABSeg), a synchronous deblurring 3D multimodal MRI segmentation network that unifies blur removal and accurate segmentation. Specifically, we propose a feature-domain motion-deblurring stem to compensate for blur and rebalance intensity. Concurrently, the backbone network embeds a blur-aware cross-modal cross-attention module and multi-scale residual aggregation to yield effective modality complementarity. Notably, we optimize a joint loss that combines weighted Dice with a clear-reference reconstruction term, where imbalanced weights are applied to small targets to boost learning intensity and predictive stability for small lesions and border regions. Systematic comparisons and ablation experiments on the BraTS2020 dataset under both clear and degenerative conditions consistently demonstrate that DABSeg surpasses state-of-the-art methods in tumor Dice score and boundary precision. These results validate the effectiveness of degenerative-aware cross-task collaborative learning in improving the robustness and clinical utility of multi-modal 3D brain tumor segmentation under realistic degenerative conditions. The source code is available at this https URL

131. 【2605.15579】VRN: Invertible Neural Networks for Compression-Aware Temporal Video Rescaling

链接https://arxiv.org/abs/2605.15579

作者:Xinmin Feng,Li Li,Dong Liu,Feng Wu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring joint optimization, fit diverse display, bandwidth constraints, requiring joint, effective frame-rate rescaling

备注: Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:To fit diverse display and bandwidth constraints, high-frame-rate videos are temporally downscaled to low-frame-rate (LFR) and later upscaled, requiring joint optimization for effective frame-rate rescaling. However, existing methods typically link the two operations via training objectives, without fully exploiting their reciprocal nature, which may cause high-frequency information loss. Moreover, they overlook the impact of lossy codecs on LFR videos, limiting real-world applicability. In this work, we propose an end-to-end framework for compression-aware frame-rate rescaling, named TVRN. To regularize high-frequency information lost during frame-rate downscaling, TVRN adopts an invertible architecture that combines a Multi-Input Multi-Output Temporal Wavelet Transform with a high-frequency reconstruction module. To enable end-to-end training through non-differentiable lossy codecs, we design a surrogate network that approximates their gradients. Finally, to improve robustness under various compression levels, we extend TVRN to an asymmetric architecture by incorporating compression-aware features learned via a learning-to-rank strategy. Extensive experiments show that TVRN outperforms existing methods in reconstruction quality under industrial video compression settings. Source code is publicly available at this https URL.

132. 【2605.15558】xt-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction

链接https://arxiv.org/abs/2605.15558

作者:Hao Yang,Xianping Ma,Peifeng Ma,Man-On Pun

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:high communication costs, land cover analysis, urban mapping, remote sensing imagery, environmental monitoring

备注: 15 pages, 8 figures, submitted to ISPRS JPRS

点击查看摘要

Abstract:High-resolution remote sensing imagery is critical for environmental monitoring, urban mapping, and land cover analysis, but its transmission is often hindered by limited bandwidth and high communication costs. Conventional pipelines transmit full-resolution pixel data, resulting in redundant and inefficient delivery. This paper proposes a text-guided remote sensing image transmission system that replaces complete high-resolution data with low-resolution images accompanied by compact textual descriptions. An onboard text generator produces spatial and semantic summaries, reducing the transmitted data volume to approximately 2\% of the original size. For ground-based reconstruction, a text-conditioned image restoration model is introduced, which leverages cross-modal learning to recover fine spatial details and maintain semantic coherence. Experimental results on the Alsat-2B, UC Merced Land Use, and Aerial Image datasets demonstrate that the proposed framework achieves reconstruction PSNRs of 16.36 dB, 26.87 dB, and 27.41 dB, respectively, enabling efficient and information-preserving image transfer for remote sensing applications. The implementation will be made publicly available at \href{this https URL}{GitHub}.

133. 【2605.15456】DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems

链接https://arxiv.org/abs/2605.15456

作者:Romario Gualdrón-Hurtado,Roman Jacome,Leon Suarez,Henry Arguello

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:Solving imaging inverse, designing proper prior, proper prior models, imaging inverse problems, underlying signal

备注: 17 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Solving imaging inverse problems has usually been addressed by designing proper prior models of the underlying signal. However, minimizing the data fidelity term poses significant challenges due to the ill-conditioned sensing matrix caused by physical constraints in the acquisition system. Thus, preconditioning techniques have been adopted in classical optimization theory to address ill-conditioned data-fidelity minimization by transforming the algorithm gradient step to achieve faster convergence and better numerical stability. We extend the preconditioning concept beyond convergence acceleration and use it to improve reconstruction quality. We introduce DIPA: Distilled Preconditioned Algorithms, where a preconditioning operator (PO) is optimized using teacher-guided distillation criteria. Unlike standard model-compression KD, the teacher and student differ by the sensing operators available during reconstruction: the teacher uses a simulated, better-conditioned, and more informative sensing matrix, whereas the student uses the physically feasible sensing matrix. We design different distillation loss functions to transfer different properties of the teacher algorithm to the preconditioned student. The PO can be linear (L-DIPA), allowing interpretability, or non-linear (N-DIPA), parametrized by a neural network, offering better scalability. We validate the proposed PO design across several imaging modalities, including magnetic resonance imaging, compressed sensing, and super-resolution imaging.

134. 【2605.15392】Frequency-domain Event-based Imaging for Selective Surveillance

链接https://arxiv.org/abs/2605.15392

作者:Megan Birch,James Rick,Adrish Kar,Jason Zutty,Joseph L. Greene

类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling motion extraction, Event-based cameras, high dynamic range, attractive sensing modality, dynamic range

备注: 14 pages, 11 figures

点击查看摘要

Abstract:Event-based cameras (EBCs) are an attractive sensing modality for surveillance due to their reporting of pixel-level radiance changes with microsecond resolution and high dynamic range, enabling motion extraction while suppressing background. Their asynchronous, sparse output, however, necessitate algorithms that identify targets in event-space without processing full frames. We introduce Frequency Rate Information for Event Space (FRIES), a neuromorphic processing framework that detects periodicity in events, such as rotor rotation and mechanical vibrations, to discriminate and monitor man-made objects. FRIES first applies a time gate to suppress background and noise, then aggregates events into a pixel-wise activity (e.g., density) map and clusters pixels into regions-of-interest (ROIs). A localized spectral analysis is applied to each ROI to extract dominant frequencies used to distinguish structured object signatures from unstructured background and noise. Discriminated targets are visualized using a Resonant Time Surface (RTS), a frequency-selective method that weights events by their phase coherence with the extracted frequencies, rewarding in-sync content and suppressing out-of-sync clutter. We demonstrate FRIES and RTS in a controlled indoor experiment to recover the rotational frequency of a mechanical chopper and drone rotors against a moving background. We further test these methods on an outdoor data to detect a hovering drone against a realistic treeline. These preliminary results establish frequency-domain event processing as a promising front-end for selective surveillance in neuromorphic pipelines and a complementary surveillance modality, leveraging the high temporal resolution to enable spectral discrimination.

135. 【2605.15241】From Full and Partial Intraoral Scans to Crown Proposal: A Classification-Guided Restoration Assistance Pipeline

链接https://arxiv.org/abs/2605.15241

作者:Rabin Kunwar,Dikshya Parajuli,Rujal Acharya,Romik Gosai,Prince Panta,Kundan Siwakoti,Shuvangi Adhikari,Saugat Kafley,Louis Digiorgio,Amit Regmi,Akio Tanaka,Masahiko Inada,Yuriko Komagamine,Kennta Kashiwazaki,Manabu Kanazawa

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:CAM workflows, Single-unit crown restoration, designing crowns directly, clinical dentistry, common procedures

备注

点击查看摘要

Abstract:Single-unit crown restoration is among the most common procedures in clinical dentistry, with CAD/CAM workflows now designing crowns directly from intraoral scans. Partial scans are often preferred over full-arch scans for single-unit cases due to fewer stitching errors, yet most segmentation networks trained on full arches fail on partial scans, while end-to-end generative crown methods often produce over-smoothed surfaces that lose occlusal detail. We propose an end-to-end pipeline that takes a raw intraoral scan and target FDI tooth number as input and outputs an initial, patient-specific crown proposal for clinician refinement. The pipeline has three phases: (I) data preparation and pose standardization; (II) segmentation routed by scan type; and (III) crown proposal generation via context-aware retrieval and Blender-based fitting. We address partial-scan segmentation through a classify-then-align strategy: a DGCNN classifier categorizes the scan into one of five anatomical types, then coarse-to-fine RANSAC+ICP registration standardizes the jaw coordinate frame, followed by graph-cut optimization to refine tooth-gingival boundaries. Trained on 1,958 partial scans, the pipeline achieves macro-average DSC 0.9249, Recall 0.8919, and Precision 0.9615 across 17 semantic classes; a fine-tuned full-arch model reaches DSC 0.9347. The prepared tooth and its mesial and distal neighbors achieve DSC 0.9468-0.9569 with sub-millimeter Centroid Errors (0.2666-0.2774 mm). These centroids anchor a retrieval module using DGCNN embeddings and cosine similarity over neighboring and opposing teeth, followed by spline-guided alignment and Blender Python API refinement. The pipeline produces a preliminary crown shell in 2.5-3.5 minutes, offering a practical alternative to end-to-end generative approaches.