本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新961篇论文,其中:
- 自然语言处理129篇
- 信息检索33篇
- 计算机视觉188篇
自然语言处理
1. 【2605.12493】LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
链接:https://arxiv.org/abs/2605.12493
作者:Di Wu,Zixiang Ji,Asmi Kawatkar,Bryan Kwan,Jia-Chen Gu,Nanyun Peng,Kai-Wei Chang
类目:Computation and Language (cs.CL)
关键词:recalling interface affordances, recurring failure modes, interface affordances, failure modes, memory
备注: Work in Progress
点击查看摘要
Abstract:Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.
2. 【2605.12487】ask-Adaptive Embedding Refinement via Test-time LLM Guidance
链接:https://arxiv.org/abs/2605.12487
作者:Ariel Gera,Shir Ashury-Tahan,Gal Bloch,Ohad Eytan,Assaf Toledo
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:query refinement paradigm, LLM-guided query refinement, challenging zero-shot search, explore the effectiveness, paradigm for extending
备注:
点击查看摘要
Abstract:We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at this https URL.
3. 【2605.12477】MEME: Multi-entity Evolving Memory Evaluation
链接:https://arxiv.org/abs/2605.12477
作者:Seokwon Jung,Alexander Rubinstein,Arnas Uselis,Sangdoo Yun,Seong Joon Oh
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:LLM-based agents increasingly, agents increasingly operate, increasingly operate, operate in persistent, persistent environments
备注:
点击查看摘要
Abstract:LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: this https URL.
4. 【2605.12476】Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
链接:https://arxiv.org/abs/2605.12476
作者:Sagi Ahrac,Noya Hochwald,Mor Geva
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:language models efficiently, models enable scaling, scaling language models, enable scaling language, models efficiently
备注:
点击查看摘要
Abstract:Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.
5. 【2605.12471】KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
链接:https://arxiv.org/abs/2605.12471
作者:Alireza Nadali,Patrick Cooper,Ashutosh Trivedi,Alvaro Velasquez
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:training-free long-context inference, long-context inference protocol, treats the key-value, protocol that treats, left fold
备注: 12 pages, 3 figures, 6 tables
点击查看摘要
Abstract:We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
6. 【2605.12466】Solve the Loop: Attractor Models for Language and Reasoning
链接:https://arxiv.org/abs/2605.12466
作者:Jacob Fein-Ashley,Paria Rashidinejad
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
关键词:refining latent representations, iteratively refining latent, Attractor Models, Looped Transformers offer, purely feed-forward computation
备注:
点击查看摘要
Abstract:Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
7. 【2605.12460】Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
链接:https://arxiv.org/abs/2605.12460
作者:Guinan Su,Yanwu Yang,Xueyan Li,Jonas Geiping
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:computer use applications, continued improvements, capability have unlocked, unlocked their widespread, drivers of autonomous
备注: Preprint, 37 pages. Code at [this https URL](https://github.com/seal-rg/streaming/)
点击查看摘要
Abstract:The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.
Comments:
Preprint, 37 pages. Code at this https URL
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2605.12460 [cs.LG]
(or
arXiv:2605.12460v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.12460
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2605.12456】xtSeal: A Localized LLM Watermark for Provenance Distillation Protection
链接:https://arxiv.org/abs/2605.12456
作者:Tom Sander,Hongyan Chang,Tomáš Souček,Tuan Tran,Valeriu Lacatusu,Sylvestre-Alvise Rebuffi,Alexandre Mourachko,Surya Parimi,Christophe Ropers,Rashel Moritz,Vanessa Stark,Hady Elsahar,Pierre Fernandez
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, Building on Gumbel-max, Gumbel-max sampling, Abstract, large language
备注:
点击查看摘要
Abstract:We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.
9. 【2605.12452】he Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
链接:https://arxiv.org/abs/2605.12452
作者:Gunjan,Sidahmed Benabderrahmane,Talal Rahwan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Large Language Models, Large Language, Language Models, text at scale, raising concerns
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.
10. 【2605.12446】ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models
链接:https://arxiv.org/abs/2605.12446
作者:Chen Li,Xiaoling Hu,Songzhu Zheng,Jiawei Zhou,Chao Chen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large language models, Large language, making reliable confidence, confidence, making reliable
备注: 18 pages, 2 figures
点击查看摘要
Abstract:Large language models (LLMs) often produce answers with high certainty even when they are incorrect, making reliable confidence estimation essential for deployment in real-world scenarios. Verbalized confidence, where models explicitly state their confidence in natural language, provides a flexible and user-facing uncertainty signal that can be applied even when token logits are unavailable. However, existing verbalized-confidence methods often optimize answer generation and confidence generation jointly, which can cause confidence-alignment objectives to interfere with answer accuracy. In this work, we propose a decoupled and order-aware framework for verbalized confidence calibration. Our method first generates an answer and then estimates confidence conditioned on the fixed question--answer pair, allowing confidence optimization without directly perturbing the answer-generation process. To align confidence with correctness likelihood, we construct a sampling-based surrogate from multiple model completions and optimize rank-based reinforcement learning objectives that encourage responses with higher estimated correctness likelihood to receive higher verbalized confidence. Experiments on reasoning and knowledge-intensive benchmarks show that our method improves calibration and failure prediction performance while largely preserving answer accuracy. These results demonstrate that verbalized confidence can be more reliably aligned by decoupling confidence estimation from answer generation and optimizing the relative ordering of confidence across responses.
11. 【2605.12438】A Causal Language Modeling Detour Improves Encoder Continued Pretraining
链接:https://arxiv.org/abs/2605.12438
作者:Rian Touchent,Eric de la Clergerie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Masked Language Modeling, Causal Language Modeling, Language Modeling, Masked Language, training with Masked
备注:
点击查看摘要
Abstract:When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.
12. 【2605.12426】Geometric Factual Recall in Transformers
链接:https://arxiv.org/abs/2605.12426
作者:Shauli Ravfogel,Gilad Yehudai,Joan Bruna,Alberto Bietti
类目:Computation and Language (cs.CL)
关键词:memorize factual associations, transformer language models, models memorize factual, language models memorize, factual associations
备注: Preprint
点击查看摘要
Abstract:How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.
13. 【2605.12422】Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals
链接:https://arxiv.org/abs/2605.12422
作者:Yo Ehara
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:large language models, substantial human effort, requires substantial human, Automatic generation, language models
备注: Accepted to Educational Data Mining (EDM) 2026 (Poster/Demo Track)
点击查看摘要
Abstract:Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on English CEFR-based sentence difficulty assessment with GPT-OSS-120B and Qwen3-235B-A22B showed that the proposed method achieved higher AUC for predicting disagreement with human raters than probability-based baselines.
14. 【2605.12419】ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
链接:https://arxiv.org/abs/2605.12419
作者:Neha Verma,Nikhil Mehta,Shao-Chuan Wang,Naijing Zhang,Alicia Tsai,Li Wei,Lukasz Heldt,Lichan Hong,Ed Chi,Xinyang Yi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:language-based reasoning abilities, large language model, language-based reasoning, reasoning abilities, rapid advancements
备注:
点击查看摘要
Abstract:Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.
15. 【2605.12412】Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
链接:https://arxiv.org/abs/2605.12412
作者:Eric Bigelow,Raphaël Sarfati,Daniel Wurgaft,Owen Lewis,Thomas McGrath,Jack Merullo,Atticus Geiger,Ekdeep Singh Lubana
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, Large, Language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.
16. 【2605.12411】Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
链接:https://arxiv.org/abs/2605.12411
作者:Eilam Shapira,Moshe Tennenholtz,Roi Reichart
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:buyer bot facing, procurement assistant negotiating, unknown seller, negotiate and transact, transact in natural
备注:
点击查看摘要
Abstract:AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.
17. 【2605.12398】Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring
链接:https://arxiv.org/abs/2605.12398
作者:Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:improving large language, Estimating question difficulty, Estimating question, large language models, critical component
备注: Accepted at ACL 2026
点击查看摘要
Abstract:Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.
18. 【2605.12395】A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
链接:https://arxiv.org/abs/2605.12395
作者:Michela Lorandi,Anya Belz
类目:Computation and Language (cs.CL)
关键词:Toggle, CTG systems, Toggle Hugging Face, current CTG systems, evaluation
备注:
点击查看摘要
Abstract:Background: Many different approaches to controlled text generation (CTG) have been proposed over recent years, but it is difficult to get a clear picture of which approach performs best, because different datasets and evaluation methods are used in each case to assess the control achieved. Objectives: Our aim in the work reported in this paper is to develop an approach to evaluation that enables us to comparatively evaluate different CTG systems in a manner that is both informative and fair to the individual systems. Methods: We use a level-playing-field (LPF) approach to comparative evaluation where we (i) generate and process all system outputs in a standardised way, and (ii) apply a shared set of evaluation methods and datasets, selected based on those currently in use, in order to ensure fair evaluation. Results: When re-evaluated in this way, performance results for a representative set of current CTG systems differ substantially from originally reported results, in most cases for the worse. This highlights the importance of a shared standardised way of assessing controlled generation. Conclusions: The discrepancies revealed by LPF evaluation demonstrate the urgent need for standardised, reproducible evaluation practices in CTG. Our results suggest that without such practices, published performance claims may substantially misrepresent true system capabilities.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.12395 [cs.CL]
(or
arXiv:2605.12395v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.12395
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Michela Lorandi [view email] [v1]
Tue, 12 May 2026 16:57:53 UTC (1,646 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles, by Michela Lorandi and Anya BelzView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.CL
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
19. 【2605.12384】Scalable Token-Level Hallucination Detection in Large Language Models
链接:https://arxiv.org/abs/2605.12384
作者:Rui Min,Tianyu Pang,Chao Du,Minhao Cheng,Yi R. Fung
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:demonstrated remarkable capabilities, Large language models, Large language, frequently produce hallucinations, remarkable capabilities
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.
20. 【2605.12382】Pretraining Exposure Explains Popularity Judgments in Large Language Models
链接:https://arxiv.org/abs/2605.12382
作者:Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, exhibit systematic preferences, exhibit systematic, popularity
备注: Accepted at SIGIR 2026
点击查看摘要
Abstract:Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, largely due to the inaccessibility of most training corpora. We provide the first direct, large-scale analysis of popularity bias grounded in fully observable pretraining data. Leveraging the open OLMo models and their complete pretraining corpus, Dolma, we compute precise entity-level exposure statistics across 7.4 trillion tokens. We analyze 2,000 entities spanning five types (Person, Location, Organization, Art, Product) and compare pretraining exposure against Wikipedia pageviews and two elicited LLM popularity signals: direct scalar estimation and pairwise comparison. Our results show that pretraining exposure strongly correlates with Wikipedia popularity, validating exposure as a meaningful proxy for real-world salience during the training period. More importantly, we find that LLM popularity judgments align more closely with exposure than with Wikipedia, especially when elicited via pairwise comparisons. This alignment is strongest for larger models and persists in the long tail, where Wikipedia popularity becomes unreliable. Overall, our findings demonstrate that popularity priors in LLMs are primarily shaped by pretraining statistics rather than external popularity signals, offering concrete evidence that data exposure plays a central role in driving popularity bias.
21. 【2605.12370】Context Convergence Improves Answering Inferential Questions
链接:https://arxiv.org/abs/2605.12370
作者:Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, open-domain Question Answering, Language Models, Large Language, Question Answering
备注: Accepted at SIGIR 2026
点击查看摘要
Abstract:While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.
22. 【2605.12361】MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
链接:https://arxiv.org/abs/2605.12361
作者:Rezarta Islamaj,Robert Leaman,Joey Chan,Nicholas Wan,Qiao Jin,Natalie Xie,John Wilbur,Shubo Tian,Lana Yeganova,Po-Ting Lai,Chih-Hsuan Wei,Yifan Yang,Yao Ge,Qingqing Zhu,Zhizheng Wang,Zhiyong Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Evaluating large language, model capabilities improve, Evaluating large, large language models, capabilities improve
备注:
点击查看摘要
Abstract:Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
23. 【2605.12345】Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation
链接:https://arxiv.org/abs/2605.12345
作者:Michela Lorandi,Anya Belz
类目:Computation and Language (cs.CL)
关键词:offer task-specific fine-tuning, require separate fine-tuning, Parameter-efficient fine-tuning, separately trained PEFT, trained PEFT modules
备注:
点击查看摘要
Abstract:Parameter-efficient fine-tuning (PEFT) techniques offer task-specific fine-tuning at a fraction of the cost of full fine-tuning, but require separate fine-tuning for every new task (combination). In this paper, we explore three ways of generalising beyond single-task training/inference: (i) training on combinations of multiple, related datasets; (ii) at inference, composing the weight matrices of separately trained PEFT modules; and (iii) at inference, composing the outputs of separately trained PEFT modules. We test these approaches on three different LLMs, QLoRA as the PEFT technique, and three sets of controlled text generation datasets for sentiment control, topic control, and multi-attribute control. We find that summing PEFT module outputs is a particularly strong composition method, which consistently either outperforms or matches the performance of alternative approaches. This is the case even when comparing against single-task specialised modules on the single-task test set, where three-module output composition achieves an average 2% point performance increase across all models for sentiment control.
24. 【2605.12328】A categorical error sensitivity index (ISEC): A preventive ordinal decision-support measure for irrecoverable errors in manual data entry systems
链接:https://arxiv.org/abs/2605.12328
作者:Ricardo Raúl Palma,Mauro Anibal Benetti,Fabricio Orlando Sanchez Varretti
类目:Computation and Language (cs.CL)
关键词:medium sized enterprises, systems remain structurally, remain structurally vulnerable, entry systems remain, sized enterprises
备注: 15 pages, 4 figures
点击查看摘要
Abstract:Data entry systems remain structurally vulnerable to categorical misclassifications, particularly in small and medium sized enterprises (SMEs). When nominal categories exhibit semantic or morphological proximity, human machine interaction may produce errors that are irrecoverable ex post. In the absence of automated input controls, manual data entry frequently generates irrecoverable categorical distortions that propagate into Key Performance Indicators (KPIs), thereby misleading managerial decision making. State of the art normalization tools typically evaluate semantic and morphological dimensions in isolation and rely heavily on standard dictionaries, rendering them ineffective for SME master data rich in custom SKUs, abbreviations, and domain-specific technical jargon. This paper introduces the Categorical Error Sensitivity Index (ISEC), an ordinal composite score designed to rank category pairs according to their structural susceptibility to confusion. ISEC integrates semantic distance (via word embeddings), custom weighted morphological transformation costs (through an adapted Damerau Levenshtein algorithm), and empirical frequency into a unified, mathematically robust preventive framework. By leveraging vector database architectures, ISEC reduces computational complexity, achieving approximately a 195x performance improvement over brute-force methods. Validated across three heterogeneous datasets: governmental judicial records, retail inventory, and a synthetic ISO coded metalworking catalog, ISEC provides a scalable and proactive data governance instrument that enables SMEs to detect latent structural risk embedded within their categorical data assets.
25. 【2605.12313】Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
链接:https://arxiv.org/abs/2605.12313
作者:Rezarta Islamaj,Joey Chan,Robert Leaman,Jongmyung Jung,Hyeongsoon Hwang,Quoc-An Nguyen,Hoang-Quynh Le,Harikrishnan Gurushankar Saisudha,Ganesh Chandrasekar,Rustam R. Taktashov,Nadezhda Yu. Bizyukova,Sofia I. R. Conceição,Paulo R. C. Lopes,Reem Abdel Salam,Mary Adewunmi,Zhiyong Lu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Multi-hop question answering, remains a significant, biomedical domain, multiple sources, answer complex questions
备注:
点击查看摘要
Abstract:Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: this https URL and benchmark this https URL
26. 【2605.12299】GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
链接:https://arxiv.org/abs/2605.12299
作者:Leonor Veloso,Hinrich Schütze
类目:Computation and Language (cs.CL)
关键词:Recent works, gender, gender bias, gendered predictions, works have analyzed
备注: Accepted to ACL 2026
点击查看摘要
Abstract:Recent works have analyzed the impact of individual components of neural networks on gendered predictions, often with a focus on mitigating gender bias. However, mechanistic interpretations of gender tend to (i) focus on a very specific gender-related task, such as gendered pronoun prediction, or (ii) fail to distinguish between the production of factually gendered outputs (the correct assumption of gender given a word that carries gender as a semantic property) and gender biased outputs (based on a stereotype). To address these issues, we curate \gknow, a benchmark to assess gender knowledge and gender bias in language models across different types of gender-related predictions. \gknow allows us to identify and analyze circuits and individual neurons responsible for gendered predictions. We test the impact of neuron ablation on benchmarks for disentangling stereotypical and factual gender (DiFair and the test set of GKnow), as well as StereoSet. Results show that gender bias and factual gender are severely entangled on the level of both circuits and neurons, entailing that ablation is an unreliable debiasing method. Furthermore, we show that benchmarks for evaluating gender bias can hide the decrease in factual gender knowledge that accompanies neuron ablation. We curate GKnow as a contribution to the continuous development of robust gender bias benchmarks.
27. 【2605.12288】okenRatio: Principled Token-Level Preference Optimization via Ratio Matching
链接:https://arxiv.org/abs/2605.12288
作者:Truong Nguyen,Tien-Phat Nguyen,Linh Ngo Van,Duy Minh Ho Nguyen,Khoa Doan,Trung Le
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Direct Preference Optimization, aligning language models, Direct Preference, Bregman Preference Optimization, per-token decisions
备注:
点击查看摘要
Abstract:Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
28. 【2605.12281】What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty
链接:https://arxiv.org/abs/2605.12281
作者:Jonas Mayer Martins,Zhuojing Huang,Aaricia Herygers,Lisa Beinborn
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:difficult to learn, word difficult, word, Abstract, Spanish
备注: Submitted to BEA 2026 at ACL. 18 pages, 13 figures
点击查看摘要
Abstract:What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word's familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.
29. 【2605.12264】Reconstruction of Personally Identifiable Information from Supervised Finetuned Models
链接:https://arxiv.org/abs/2605.12264
作者:Sae Furukawa,Alina Oprea
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Supervised Finetuning, large language model, instruction-following tasks, extensive pre-trained knowledge, adapting a large
备注:
点击查看摘要
Abstract:Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric QA datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.
30. 【2605.12260】PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
链接:https://arxiv.org/abs/2605.12260
作者:Jingyi Peng,Zhongwei Wan,Weiting Liu,Qiuzhuang Sun
类目:Computation and Language (cs.CL)
关键词:language agents accumulate, agents accumulate conversation, accumulate conversation history, Long-horizon language agents, making memory management
备注: Preprint
点击查看摘要
Abstract:Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.
31. 【2605.12243】PreScam: A Benchmark for Predicting Scam Progression from Early Conversations
链接:https://arxiv.org/abs/2605.12243
作者:Weixiang Sun,Shang Ma,Yiyang Li,Tianyi Ma,Zehong Wang,Colby Nelson,Xusheng Xiao,Yanfang Ye
类目:Computation and Language (cs.CL)
关键词:online fraud, romance and investment, major form, form of online, scam
备注:
点击查看摘要
Abstract:Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in which scammers gradually manipulate victims using evolving psychological techniques. However, existing research mainly focuses on static scam detection or synthetic scams, leaving open whether language models can understand how real-world scams progress over time. We introduce PreScam, a benchmark for modeling scam progression from early conversations. Built from user-submitted scam reports, PreScam filters and structures 177,989 raw reports into 11,573 conversational scam instances spanning 20 scam categories. Each instance is hierarchically structured according to the scam lifecycle defined by the proposed scam kill chain, and further annotated at the turn level with scammer psychological actions and victim responses. We benchmark models on two tasks: real-time termination prediction, which estimates whether a conversation is approaching the termination stage, and scammer action prediction, which forecasts the scammer's subsequent actions. Results show a clear gap between surface-level fluency and progression modeling: supervised encoders substantially outperform zero-shot LLMs on real-time termination prediction, while next-action prediction remains only moderately successful even for strong LLMs. Taken together, these results show that current models can capture some scam-related cues, yet still struggle to track how risk escalates and how manipulation unfolds across turns.
32. 【2605.12242】Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs
链接:https://arxiv.org/abs/2605.12242
作者:Deepak Kumar,Baban Gain,Asif Ekbal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Automatic Speech Recognition, Automatic Speech, Speech Recognition, hinder downstream applications, false starts
备注: Accepted to ACL 2026 (Main)
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine-tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence-to-sequence models. These results highlight that detection-only strategies are insufficient. Combining token-level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech-driven NLP systems. We make the codes publicly available at this https URL.
33. 【2605.12227】Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
链接:https://arxiv.org/abs/2605.12227
作者:Miguel Moura Ramos,Duarte M. Alves,André F. T. Martins
类目:Computation and Language (cs.CL)
关键词:Adapting large language, Group Relative Policy, Relative Policy Optimization, large language models, tasks requires post-training
备注:
点击查看摘要
Abstract:Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher via OPD. We also introduce LongBlocks, a synthetic long-context dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. We conduct extensive experiments and ablations comparing off-policy training, sparse-reward GRPO, and our combined approach, leading to an improved recipe for long-context alignment. Overall, our results show that combining outcome-based policy optimization with knowledge distillation in a single objective provides a more stable and effective path to long-context reasoning, while preserving short-context capabilities.
34. 【2605.12225】Mechanistic Interpretability of ASR models using Sparse Autoencoders
链接:https://arxiv.org/abs/2605.12225
作者:Dan Pluth,Zachary Nicholas Houghton,Yu Zhou,Vijay K. Gurbani
类目:Computation and Language (cs.CL)
关键词:deep Transformer-based NLP, Transformer-based NLP models, Transformer-based NLP, NLP models, machinations of deep
备注: 10 pages + references and appendix
点击查看摘要
Abstract:Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance, health. While these models have advanced rapidly, their internal mechanisms remain largely a mystery. Techniques such as Sparse Autoencoders (SAE) have emerged to understand these mechanisms by projecting dense representations into a sparse vector. While existing research has demonstrated the viability of the SAE in interpreting text-based Large Language Models (LLMs), there are no equivalent studies that demonstrate the application of a SAE to audio processing models like Automatic Speech Recognizers (ASRs). In this work, a SAE is applied to Whisper, a Transformer-based ASR, training a high-dimensional sparse latent space on frame-level embeddings extracted from the Whisper encoder. Our work uncovers diverse monosemantic features across linguistic and non-linguistic boundaries, and demonstrates cross-lingual feature steering. This work establishes the viability of a SAE model and demonstrates that Whisper encodes a rich amount of linguistic information.
35. 【2605.12207】Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
链接:https://arxiv.org/abs/2605.12207
作者:Arijit Sehanobish,Charles Lovering
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:trainable entries, fixed budget, parameter placement problem, LoRA adapter, placement problem
备注: Preprint. Comments welcome
点击查看摘要
Abstract:We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).
36. 【2605.12185】Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding
链接:https://arxiv.org/abs/2605.12185
作者:Yigeng Zhou,Wu Li,Yifan Lu,Yequan Wang,Xuebo Liu,Wenya Wang,Jun Yu,Min Zhang,Jing Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, language models accumulate, models accumulate extensive, Large language, accumulate extensive parametric
备注: Accepted by IEEE TASLP
点击查看摘要
Abstract:Large language models accumulate extensive parametric knowledge through pre-training. However, knowledge conflicts occur when outdated or incorrect parametric knowledge conflicts with external knowledge in the context. Existing methods address knowledge conflicts through contrastive decoding, but in conflict-free scenarios, static approaches disrupt output distribution. Other dynamic decoding methods attempt to measure the degree of conflict but still struggle with complex real-world situations. In this paper, we propose a two-stage decoding method called Dynamic Cognitive Reconciliation Decoding (DCRD), to predict and mitigate context-memory conflicts. DCRD first analyzes the attention map to assess context fidelity and predict potential conflicts. Based on this prediction, the input is directed to one of two decoding paths: (1) greedy decoding, or (2) context fidelity-based dynamic decoding. This design enables DCRD to handle conflicts efficiently while maintaining high accuracy and decoding efficiency in conflict-free cases. Additionally, to simulate scenarios with frequent knowledge updates, we constructed ConflictKG, a knowledge conflict QA benchmark. Experiments on four LLMs across six QA datasets show that DCRD outperforms all baselines, achieving state-of-the-art performance.
37. 【2605.12178】Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
链接:https://arxiv.org/abs/2605.12178
作者:Jishnu Sethumadhavan Nair,Patrice Bechard,Rishabh Maheshwary,Surajit Dasgupta,Sravan Ramachandran,Aakash Bhagat,Shruthan Radhakrishna,Pulkit Pattnaik,Johan Obando-Ceron,Shiva Krishna Reddy Malay,Sagar Davasam,Seganrasan Subramanian,Vipul Mittal,Sridhar Krishna Nemala,Christopher Pal,Srinivas Sunkara,Sai Rajeswar
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:World models enable, anticipate the effects, actions by internalizing, models enable agents, World models
备注:
点击查看摘要
Abstract:World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system's configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning-focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment-shift evaluation to show that offline-trained world models can perform well in-distribution but degrade as dynamics change, whereas discovery-based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.
38. 【2605.12177】Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach
链接:https://arxiv.org/abs/2605.12177
作者:Andrea Morandi,Mahesh Viswanathan
类目:Computation and Language (cs.CL)
关键词:LLM deployments receive, Production LLM deployments, land 40-50 percentage, 40-50 percentage points, LLM deployments
备注:
点击查看摘要
Abstract:[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hat\pi_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hat\pi_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $\kappa_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.
39. 【2605.12156】Latent Causal Void: Explicit Missing-Context Reconstruction for Misinformation Detection
链接:https://arxiv.org/abs/2605.12156
作者:Hui Li,Zhongquan Jian,Jinsong Su,Junfeng Yao
类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:Automatic misinformation detection, Automatic misinformation, article explicitly states, misinformation detection performs, deception is visible
备注:
点击查看摘要
Abstract:Automatic misinformation detection performs well when deception is visible in what an article explicitly states. However, some misinformation articles remain locally coherent and only become misleading once compared with contemporaneous reports that supply background facts the article omits. We study this omission-relevant setting and observe that current omission-aware approaches typically either attach retrieved context as auxiliary evidence or infer a categorical omission signal, leaving the specific missing fact implicit. We propose \emph{Latent Causal Void} (LCV), a retrieval-guided detector that explicitly reconstructs the missing fact for each target sentence and uses it as a textual cross-source relation in graph reasoning. Concretely, LCV retrieves temporally aligned context articles, asks a frozen instruction-tuned large language model to generate a short missing-context description for each sentence--article pair, and feeds the resulting relation text into a heterograph over target sentences and context articles. On the bilingual benchmark of Sheng et al., LCV improves over the strongest omission-aware baseline by $2.56$ and $2.84$ macro-F1 points on the English and Chinese splits, respectively. The results indicate that modeling the missing cross-source fact itself, rather than only attaching retrieved evidence or predicting an omission signal, is a useful representation for omission-aware misinformation detection.
40. 【2605.12138】Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
链接:https://arxiv.org/abs/2605.12138
作者:Yexing Xu,Wei Feng,Shen Zhang,Haohan Wang,Yuxin Qin,Yaoyu Li,Ao Ma,Yuhao Luo,Lu Wang,Xudong Ren,Haoran Wang,Run Ling,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Longguang Wang,Yulan Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:challenge in e-commerce, realistic and user-preferred, key challenge, Generating realistic, Unified Advertisement Generative
备注: 22 pages, 19 figures, CVPR 2026
点击查看摘要
Abstract:Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at this https URL.
41. 【2605.12128】Metaphor Is Not All Attention Needs
链接:https://arxiv.org/abs/2605.12128
作者:Olga Sorokoletova,Francesco Giarrusso,Giacomo De Luca,Piercosma Bisconti,Matteo Prandi,Federico Pierucci,Marcello Galisai,Vincenzo Suriani,Daniele Nardi
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:resist harmful instructions, safety-critical applications, instructions is essential, increasingly deployed, deployed in safety-critical
备注:
点击查看摘要
Abstract:Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.
42. 【2605.12096】Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward
链接:https://arxiv.org/abs/2605.12096
作者:Nigar Alishzade,Gulchin Abdullayeva
类目:Computation and Language (cs.CL)
关键词:Azerbaijan Sign Language, Sign languages, Deaf communities worldwide, distinct sign languages, sign language
备注:
点击查看摘要
Abstract:Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational tools. This systematic review synthesizes literature on sign language recognition and translation for under-resourced languages, using Azerbaijan Sign Language (AzSL) as a case study. Analysis of global initiatives extracts eight actionable lessons, including community co-design, dialectal diversity capture, and privacy-preserving pose-based representations. Turkic sign languages (Kazakh, Turkish, Azerbaijani) receive special attention, as linguistic proximity enables effective transfer learning. We propose three paradigm shifts: from architecture-centric to data-centric AI, from signer-independent to signer-adaptive systems, and from reference-based to task-specific evaluation metrics. A technical roadmap for AzSL leverages lightweight MediaPipe-based architectures, community-validated annotations, and offline-first deployment. Progress requires sustained interdisciplinary collaboration centered on Deaf communities to ensure cultural authenticity, ethical governance, and practical communication benefit.
43. 【2605.12090】World Action Models: The Next Frontier in Embodied AI
链接:https://arxiv.org/abs/2605.12090
作者:Siyin Wang,Junhao Shi,Zhaoyang Fu,Xinzhe He,Feihong Liu,Chenchen Yang,Yikang Zhou,Zhaoye Fei,Jingjing Gong,Jinlan Fu,Mike Zheng Shou,Xuanjing Huang,Xipeng Qiu,Yu-Gang Jiang
类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved strong semantic, strong semantic generalization, World Action Models, physical world evolves, embodied policy learning
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
44. 【2605.12055】Do Language Models Encode Knowledge of Linguistic Constraint Violations?
链接:https://arxiv.org/abs/2605.12055
作者:Hardy,Sebastian Padó
类目:Computation and Language (cs.CL)
关键词:Large Language Models, predictions remain unclear, achieve strong linguistic, strong linguistic performance, Large Language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic constraint violations within their parameters, which are selectively activated when processing ungrammatical sentences. To test this, we use sparse autoencoders to decompose polysemantic activations into sparse, monosemantic features and recover candidates for violation-related features. We introduce a sensitivity score for identifying features that are preferentially activated on constraint-violated versus well-formed inputs, enabling unsupervised detection of potential violation-specific features. We further propose a conjunctive falsification framework with three criteria evaluated jointly. Overall, the results are negative in two respects: (1) the falsification criteria are not jointly satisfied across linguistic phenomena, and (2) no features are consistently shared across all categories. While some phenomena show partial evidence of selective causal structure, the overall pattern provides limited support for a unified set of grammatical violation detectors in current LMs.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.12055 [cs.CL]
(or
arXiv:2605.12055v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.12055
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
45. 【2605.12047】Is Child-Directed Language Optimized for Word Learning? A Computational Study of Verb Meaning Acquisition
链接:https://arxiv.org/abs/2605.12047
作者:Francesca Padovani,Jaap Jumelet,Yevgen Matusevych,Arianna Bisazza
类目:Computation and Language (cs.CL)
关键词:optimized to support, aspects of linguistic, linguistic development, CDL, support language learning
备注: 8 pages
点击查看摘要
Abstract:Is child-directed language (CDL) optimized to support language learning, and which aspects of linguistic development does it facilitate? We investigate this question using neural language models trained on CDL versus adult-directed language (ADL). We selectively remove syntactic or lexical co-occurrence information from the model training data, and evaluate the impact of these manipulations on verb meaning acquisition. While disrupting syntax impairs learning across all datasets, models trained on CDL and spoken ADL show significantly higher resilience than those trained on written input. Tracking semantic and syntactic performance over training, we observe a semantic-first trajectory, with verb meanings emerging prior to robust syntactic proficiency, an asynchrony most pronounced in the spoken domain, especially CDL. These results suggest that the advantage for verb learning previously attributed to CDL may instead reflect broader properties of the spoken register, rather than a uniquely CDL-specific optimization.
46. 【2605.12039】SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
链接:https://arxiv.org/abs/2605.12039
作者:Xiaoyuan Li,Moxin Li,Keqin Bao,Yubo Ma,Wenjie Wang,Dayiheng Liu,Fuli Feng
类目:Computation and Language (cs.CL)
关键词:existing libraries store, libraries store skills, Skill libraries enable, language model agents, libraries enable large
备注: Under Review
点击查看摘要
Abstract:Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.
47. 【2605.12028】Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking
链接:https://arxiv.org/abs/2605.12028
作者:David-Maximilian Caraman,Gheorghe Cosmin Silaghi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Reciprocal Rank Fusion, English-language domains, Task, Rank Fusion, dense retrieval combined
备注: Accepted at SemEval2026, task 8: MTRAGEval
点击查看摘要
Abstract:We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.
48. 【2605.12022】SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
链接:https://arxiv.org/abs/2605.12022
作者:Xiaoyuan Li,Yuzhe Wang,Moxin Li,Keqin Bao,Rui Men,Yichang Zhang,Dayiheng Liu,Wenjie Wang,Fuli Feng
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, achieve strong performance, capabilities remain brittle, knowledge evaluation benchmarks
备注: Under Review
点击查看摘要
Abstract:Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.
49. 【2605.12015】SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
链接:https://arxiv.org/abs/2605.12015
作者:Chang Jin,An Wang,Zeming Wei,Kai Wang,Biaojie Zeng,Qiaosheng Zhang,Chao Yang,Jingjing Qu,Xia Hu,Xingcheng Xu
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:packaging procedural guidance, extending large language, large language model, Reusable skills, packaging procedural
备注:
点击查看摘要
Abstract:Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, task-relevant skill materials or local artifacts can steer an agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.
50. 【2605.12004】Learning Agentic Policy from Action Guidance
链接:https://arxiv.org/abs/2605.12004
作者:Yuxiang Ji,Zengbin Wang,Yong Wang,Shidong Yang,Ziyu Ma,Guanhua Chen,Zonghua Sun,Liaoni Wu,Xiangxiang Chu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, training signals emerge, Agentic reinforcement learning
备注: Work in progress
点击查看摘要
Abstract:Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.
51. 【2605.11993】owards Visually-Guided Movie Subtitle Translation for Indic Languages
链接:https://arxiv.org/abs/2605.11993
作者:Tarun Chintada,Kshetrimayum Boynao Singh,Asif Ekbal
类目:Computation and Language (cs.CL)
关键词:low-resource Indic languages, English to Hindi, Tamil and Kannada, Movie subtitle translation, Indic languages
备注:
点击查看摘要
Abstract:Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses
52. 【2605.11978】On Predicting the Post-training Potential of Pre-trained LLMs
链接:https://arxiv.org/abs/2605.11978
作者:Xiaoyuan Li,Yubo Ma,Kexin Yang,Moxin Li,Keqin Bao,Wenie Wang,Fuli Feng,Dayiheng Liu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, acquired during pre-training, Rubric-based Discriminative Evaluation
备注: Under Review
点击查看摘要
Abstract:The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.
53. 【2605.11964】Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
链接:https://arxiv.org/abs/2605.11964
作者:Maodong Li,Yancui Li,Fang Kong
类目:Computation and Language (cs.CL)
关键词:steer conversations proactively, dialogue system aims, pre-defined targets, specific topics, target-guided proactive dialogue
备注: 21 pages, 9 Figures, 18 Tables
点击查看摘要
Abstract:A target-guided proactive dialogue system aims to steer conversations proactively toward pre-defined targets, such as designated keywords or specific topics. During guided conversations, dynamically modeling conversational scenarios and intent keywords to guide system utterance generation is beneficial; however, existing work largely overlooks this aspect, resulting in a mismatch with the dynamics of real-world conversations. In this paper, we jointly model user profiles and domain knowledge as conversational scenarios to introduce a scenario bias that dynamically influences system utterances, and employ intent-keyword bridging to predict intent keywords for upcoming dialogue turns, providing higher level and more flexible guidance. Extensive automatic and human evaluations demonstrate the effectiveness of conversational scenario modeling and intent keyword bridging, yielding substantial improvements in proactivity, fluency, and informativeness for target-guided proactive dialogue systems, thereby narrowing the gap with real world interactions.
54. 【2605.11959】Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models
链接:https://arxiv.org/abs/2605.11959
作者:Maham Nazir,Muhammad Aqeel,Richong Zhang,Francesco Setti
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal video summarization, Multimodal video, video summarization requires, summarization requires visual, language generation
备注: Accepted to ICPR 2026
点击查看摘要
Abstract:Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. this https URL
55. 【2605.11922】StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
链接:https://arxiv.org/abs/2605.11922
作者:Hao Wang,Rui Li,Lei Sha,Jie M. Zhang
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:primarily supervise final, Existing code reasoning, final code outputs, methods primarily supervise, supervise final code
备注:
点击查看摘要
Abstract:Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.
56. 【2605.11906】YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning
链接:https://arxiv.org/abs/2605.11906
作者:Yifan Le
类目:Computation and Language (cs.CL)
关键词:important post-training paradigm, paradigm for improving, abilities of large, Preference optimization, Preference
备注: 10 pages, 2figures. Work in progress
点击查看摘要
Abstract:Preference optimization has become an important post-training paradigm for improving the reasoning abilities of large language models. Existing methods typically rely on externally constructed preference data, using preferred and dispreferred responses as sample-level supervision. However, such external signals rarely make explicit use of capability-related information contained in the model's internal representations. For mathematical reasoning, certain neuron groups may exhibit activation patterns associated with mathematical knowledge, symbolic manipulation, or logical reasoning. Similar to reflexive behavioral signals, these internal activations may provide a coarse indication of whether the model is engaging math-related this http URL introduce YFPO, short for Yoked Feature Preference Optimization, a preliminary neuron-guided preference optimization framework for mathematical reasoning. YFPO first uses AttnLRP to identify math-related neurons, and then constructs an auxiliary reward from their activation margin between preferred and dispreferred responses. This design augments external preference learning with internal neuron-level signals. We conduct preliminary experiments on a small-scale language model using GSM8K as the main benchmark. Results suggest that neuron-level signals can interact with preference optimization and occasionally improve reasoning performance, offering a promising direction for more fine-grained and interpretable reasoning-oriented post-training.
57. 【2605.11887】Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
链接:https://arxiv.org/abs/2605.11887
作者:Boyi Deng,Xu Wang,Yaoning Wang,Yu Wan,Yubo Ma,Baosong Yang,Haoran Wei,Jialong Tang,Huan Lin,Ruize Gao,Tianhao Li,Qian Cao,Xuancheng Ren,Xiaodong Deng,An Yang,Fei Huang,Dayiheng Liu,Jingren Zhou
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:remain largely opaque, achieved remarkable capabilities, decision-making processes remain, processes remain largely, internal decision-making processes
备注:
点击查看摘要
Abstract:Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.
58. 【2605.11862】Concordance Comparison as a Means of Assembling Local Grammars
链接:https://arxiv.org/abs/2605.11862
作者:Juliana Pirovani,Elias de Oliveira,Eric Laporte
类目:Computation and Language (cs.CL)
关键词:Named Entity Recognition, Named Entity, Entity Recognition, important but non-trivial, non-trivial task
备注:
点击查看摘要
Abstract:Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.
59. 【2605.11856】UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
链接:https://arxiv.org/abs/2605.11856
作者:Houcheng Jiang,Jiajun Fu,Junfeng Fang,Chen Gao,Xiang Wang,Xiangnan He,Yong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal large language, Multimodal large, large language models, visual latent, visual latent reasoning
备注:
点击查看摘要
Abstract:Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.
60. 【2605.11854】Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
链接:https://arxiv.org/abs/2605.11854
作者:Kecheng Chen,Ziru Liu,Xijia Tao,Hui Liu,Yibing Liu,Xinyu Fu,Shi Wu,Suiyun Zhang,Dandan Tu,Lingpeng Kong,Rui Liu,Haoliang Li
类目:Computation and Language (cs.CL)
关键词:autoregressive language models, offering stronger global, highly parallel generation, stronger global awareness, Diffusion Language Models
备注: Under review
点击查看摘要
Abstract:Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.
61. 【2605.11853】GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
链接:https://arxiv.org/abs/2605.11853
作者:Sijia Li,Yuchen Huang,Zifan Liu,Yanping Li,Jingjing Fu,Li Zhao,Jiang Bian,Ling Zhang,Jun Zhang,Rui Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:training commonly relies, LLM agents, Reinforcement learning, approach for LLM, coarse supervision
备注:
点击查看摘要
Abstract:Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
62. 【2605.11845】Probabilistic Calibration Is a Trainable Capability in Language Models
链接:https://arxiv.org/abs/2605.11845
作者:Davide Baldelli,Sruthi Kuriakose,Maryam Hashemzadeh,Amal Zouaq,Sarath Chandar
类目:Computation and Language (cs.CL)
关键词:user-specified randomness constraints, satisfy user-specified randomness, randomness constraints, satisfy user-specified, user-specified randomness
备注:
点击查看摘要
Abstract:Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at this https URL.
63. 【2605.11836】More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing
链接:https://arxiv.org/abs/2605.11836
作者:Xin Ma,Wei Chen,Qi Liu,Derong Xu,Zhi Zheng,Tong Xu,Enhong Chen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Model Editing aims, Lifelong Model Editing, Large Language, Model Editing
备注:
点击查看摘要
Abstract:Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter-intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at this https URL.
64. 【2605.11800】ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
链接:https://arxiv.org/abs/2605.11800
作者:Wenyong Zhou,Yuannuo Feng,Yizhe Chen,Taiqiang Wu,Wendong Xu,Wenbo Qi,Zhengwu Liu,Wang Kang,Ngai Wong
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large language models, switching creates memory, creates memory bandwidth, memory bandwidth bottlenecks, Large language
备注: 11 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.
65. 【2605.11779】Choosing features for classifying multiword expressions
链接:https://arxiv.org/abs/2605.11779
作者:Eric Laporte
类目:Computation and Language (cs.CL)
关键词:Multiword expressions, heterogeneous set, Multiword, Abstract, features
备注:
点击查看摘要
Abstract:Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.
66. 【2605.11775】Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
链接:https://arxiv.org/abs/2605.11775
作者:Jiazheng Zhang,Ziche Fu,Junrui Shen,Yunbin Zhao,Yunke Zhang,Zhiheng Xi,Long Ma,Chenxin An,Zhihao Zhang,Shichun Liu,Dingwei Zhu,Shihan Dou,Shaofan Liu,Han Li,Wiggin Zhou,Aiden Adams,Tao Gui,Fei Huang,Qi Zhang,Xuanjing Huang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:entropy, fundamental measure, measure for understanding, understanding and controlling, reinforcement learning
备注:
点击查看摘要
Abstract:Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
67. 【2605.11774】From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
链接:https://arxiv.org/abs/2605.11774
作者:Mingcheng Zhu,Zhiyao Luo,Yu Liu,Tingting Zhu
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:electronic health records, processing electronic health, large language models, health records, processing electronic
备注: 21 pages, 6 figures, 13 tables
点击查看摘要
Abstract:By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.
68. 【2605.11769】Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
链接:https://arxiv.org/abs/2605.11769
作者:Yujing Chang,Yash Guleria,Duc-Thinh Pham,Nhut-Huy Pham,Ningli Wang,Vu N. Duong,Sameer Alam
类目:Computation and Language (cs.CL)
关键词:Air Traffic Control, Air Traffic, Traffic Control, safety-critical domain, interpretation of instructions
备注:
点击查看摘要
Abstract:Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.
69. 【2605.11750】DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
链接:https://arxiv.org/abs/2605.11750
作者:Xianzhe Fan,Yuxiang Lu,Shenyuan Gao,Xiaoyang Wu,Ruihua Han,Manling Li,Hengshuang Zhao
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:minor action errors, VLA models, existing VLA models, VLA models rely, enables VLA models
备注: 19 pages, 7 figures
点击查看摘要
Abstract:Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at this https URL.
70. 【2605.11744】raining-Inference Consistent Segmented Execution for Long-Context LLMs
链接:https://arxiv.org/abs/2605.11744
作者:Xianpeng Shang,Jiang Li,Zehua Duo,Qianyi Cai,Xiangdong Su
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Transformer-based large language, Transformer-based large, large language models, language models face, models face severe
备注: Accepted by ICML 2026. 19 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).
71. 【2605.11739】Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
链接:https://arxiv.org/abs/2605.11739
作者:Yuchen Cai,Ding Cao,Liang Lin,Chunxi Luo,Xin Xu,Kai Yang,Weijie Liu,Saiyong Yang,Tianxiang Zhao,Guangzhong Sun,Guiquan Liu,Junfeng Fang
类目:Computation and Language (cs.CL)
关键词:On-policy distillation, OPD, On-policy, OPD efficiency, textbf
备注:
点击查看摘要
Abstract:On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.
72. 【2605.11732】AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents
链接:https://arxiv.org/abs/2605.11732
作者:Jiarui Jin,Zexuan Yan,Shijian Wang,Wenxiang Jiao,Yuan Lu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Multimedia (cs.MM)
关键词:Collaborative agentic architecture, Disentangled and Collaborative, adversarial optimization problem, Collaborative agentic, exploration and exploitation
备注:
点击查看摘要
Abstract:In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes into a single module, AgentDisCo employs a critic agent to evaluate generated outlines and refine search queries, and a generator agent to retrieve updated results and revise outlines accordingly. The iteratively refined outline is then passed to a downstream report writer that synthesizes a comprehensive research report. The overall workflow supports both handcrafted and automatically discovered design strategies via a meta-optimization harness, in which the generator agent is repurposed as a scoring agent to evaluate critic outputs and generate quality signals. Powerful code-generation agents (e.g., Claude-Code, Codex) systematically explore agent configurations and construct a policy bank, a structured repository of reusable design strategies, enabling the framework to self-refine without extensive human intervention. We evaluate AgentDisCo on three established deep research benchmarks (DeepResearchBench, DeepConsult, DeepResearchGym) using Gemini-2.5-Pro, achieving performance comparable to or surpassing leading closed-source systems. Observing that existing benchmarks inadequately reflect real-world user needs, we introduce GALA (General AI Life Assistants), a benchmark that mines latent research interests from users' historical browsing behavior. We further develop a rendering agent that converts research reports into visually rich poster presentations, and demonstrate an end-to-end product, AutoResearch Your Interest, which delivers personalized deep research recommendations derived from individual browsing histories.
73. 【2605.11727】Allegory of the Cave: Measurement-Grounded Vision-Language Learning
链接:https://arxiv.org/abs/2605.11727
作者:Kepeng Xu,Li Xu,Gang He,Wenxin Yu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:models typically reason, post-ISP RGB images, Vision-language models typically, quantize sensor evidence, models typically
备注:
点击查看摘要
Abstract:Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
74. 【2605.11689】Slicing and Dicing: Configuring Optimal Mixtures of Experts
链接:https://arxiv.org/abs/2605.11689
作者:Margaret Li,Sneha Kudugunta,Danielle Rothermel,Luke Zettlemoyer
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:narrow configuration ranges, large language models, core design choices, load balancing, token dropping
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like this http URL, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.
75. 【2605.11685】Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
链接:https://arxiv.org/abs/2605.11685
作者:Zeguan Xiao,Xuanzhe Xu,Yun Chen,Yong Wang,Jian Yang,Yanqing Hu,Guanhua Chen
类目:Computation and Language (cs.CL)
关键词:Large language model, remove specific data, specific data influences, Large language, addressing privacy
备注:
点击查看摘要
Abstract:Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover "forgotten" knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.
76. 【2605.11663】Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
链接:https://arxiv.org/abs/2605.11663
作者:Kyosuke Takami,Yuka Tateisi,Satoshi Sekine,Yusuke Miyao
类目:Computation and Language (cs.CL)
关键词:school examinations provide, high-validity test bed, assessments remain scarce, Japan National Assessment, Authentic school examinations
备注:
点击查看摘要
Abstract:Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: this https URL
77. 【2605.11651】Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
链接:https://arxiv.org/abs/2605.11651
作者:Seonghoon Yu,Dongjun Nam,Byung-Kwan Lee,Jeany Son
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:limits real-world deployment, high computational cost, computational cost limits, cost limits real-world, boost reasoning performance
备注: Pre-print
点击查看摘要
Abstract:Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.
78. 【2605.11632】Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization
链接:https://arxiv.org/abs/2605.11632
作者:Yilong Wang,Qianli Wang,Bohao Chu,Yihong Liu,Jing Yang,Simon Ostermann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Self-generated counterfactual explanations, minimally modified inputs, causally grounded approach, Self-generated counterfactual, black-box LLM behavior
备注: In submission
点击查看摘要
Abstract:Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.
79. 【2605.11629】OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
链接:https://arxiv.org/abs/2605.11629
作者:Yuanhao Yue,Chengyu Wang,Yuanjie Lyu,Lei Shen,Jun Huang
类目:Computation and Language (cs.CL)
关键词:Recent multimodal large, multimodal large language, Recent multimodal, large language models, shown strong
备注:
点击查看摘要
Abstract:Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.
80. 【2605.11612】When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models
链接:https://arxiv.org/abs/2605.11612
作者:Ziyu Liu,Tao Li,Tianjie Ni,Xiaolong Lan,Wengang Ma,Tao Yang,Guohua Wang,Junjiang He
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:vulnerabilities widely exist, Backdoor vulnerabilities widely, large language models, vulnerabilities widely, widely exist
备注:
点击查看摘要
Abstract:Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.
81. 【2605.11609】Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
链接:https://arxiv.org/abs/2605.11609
作者:Guobin Shen,Xiang Cheng,Chenxiao Zhao,Lei Huang,Jindong Li,Dongcheng Zhao,Xing Yu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:stronger external teacher, advancing reasoning capability, On-policy self-distillation, offers a promising, promising direction
备注:
点击查看摘要
Abstract:On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.
82. 【2605.11608】PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
链接:https://arxiv.org/abs/2605.11608
作者:Chieh-Yen Lin,Shao-Hua Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Comparing post-training LLM, Proxy Risk Inference, Comparing post-training, requires a diagnostic, diagnostic that identifies
备注:
点击查看摘要
Abstract:Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.
83. 【2605.11601】DiffScore: Text Evaluation Beyond Autoregressive Likelihood
链接:https://arxiv.org/abs/2605.11601
作者:Wen Lai,Yingli Shen,Dingnan Jin,Qing Cui,Jun Zhou,Maosong Sun,Alexander Fraser
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:conflating architectural asymmetry, Diffusion Language Models, conflating architectural, Large Diffusion Language, language models
备注:
点击查看摘要
Abstract:Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: this https URL.
84. 【2605.11582】Efficient LLM-based Advertising via Model Compression and Parallel Verification
链接:https://arxiv.org/abs/2605.11582
作者:Wenxin Dong,Chang Gao,Guanghui Yu,Xuewu Jiao,Mingqing Hu,Qiang Fu,Peng Xu,Penghui Wei,Hui Xu,Yue Xing,Shuanglong Li,Lin Liu
类目:Computation and Language (cs.CL)
关键词:Large language models, shown remarkable potential, Large language, language models, Efficient Generative Targeting
备注: 10 pages, 7 figures, industry paper
点击查看摘要
Abstract:Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges due to their high inference latency and computational cost. In this paper, we propose an Efficient Generative Targeting framework that integrates adaptive group quantization, layer-adaptive hierarchical sparsification, and prefix-tree parallel verification to accelerate LLM inference while preserving generation quality. Extensive experiments on two real-world advertising scenarios demonstrate that our framework achieves significant speedup with acceptable quality degradation, making it operationally viable for practical deployments.
85. 【2605.11581】Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
链接:https://arxiv.org/abs/2605.11581
作者:Wenxin Dong,Mingqing Hu,Guanghui Yu,Qiang Fu,Peng Xu,Hui Xu,Yue Xing,Xuewu Jiao,Shuanglong Li,Lin Liu
类目:Computation and Language (cs.CL)
关键词:serve real-time inference, large language models, serve real-time, millisecond range, large language
备注: 10 pages, 8 figures
点击查看摘要
Abstract:When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.
86. 【2605.11577】BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
链接:https://arxiv.org/abs/2605.11577
作者:Shaobin Zhuang,Yuang Ai,Jiaming Han,Xiaohui Li,Huaibo Huang,Xiangyu Yue,Xuefeng Hu,Kun Xu,Yali Wang,Hao Chen
类目:Computation and Language (cs.CL)
关键词:carry meaning jointly, models generate text, including phrases, multi-token units, meaning jointly
备注: 12 pages, 4figures, 1 table
点击查看摘要
Abstract:Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.
87. 【2605.11574】hree Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
链接:https://arxiv.org/abs/2605.11574
作者:Pruthvinath Jeripity Venkata
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:ignoring provided documents, contradicting document presents, persistent empirical contradiction, studies find models, find models stubbornly
备注: 10 pages, 13 tables, no figures. 9,970 API calls across five frontier models
点击查看摘要
Abstract:The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p = .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.
88. 【2605.11538】aming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
链接:https://arxiv.org/abs/2605.11538
作者:Cheng Wang,Qin Liu,Wenxuan Zhou,Muhao Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Group Relative Policy, Relative Policy Optimization, large language models, Group Relative, Relative Policy
备注: ACL 2026
点击查看摘要
Abstract:Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.
89. 【2605.11533】Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
链接:https://arxiv.org/abs/2605.11533
作者:Sike Xiang,Shuang Chen,Kevin Qinghong Lin,Jialin Yu,Yijia Sun,Philip Torr,Amir Atapour-Abarghouei
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:combine page layouts, Clinical check-up reports, check-up reports, numerical biomarkers, abnormality flags
备注:
点击查看摘要
Abstract:Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, imaging-related evidence, and physician summaries. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.
90. 【2605.11519】Controllable User Simulation
链接:https://arxiv.org/abs/2605.11519
作者:Guy Tennenholtz,Ofer Meshi,Amir Globerson,Uri Shalit,Jihwan Jeong,Craig Boutilier
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:cover rare scenarios, testing new policies, offline datasets, datasets to evaluate, fails to cover
备注:
点击查看摘要
Abstract:Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.
91. 【2605.11518】AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
链接:https://arxiv.org/abs/2605.11518
作者:Taicheng Guo,Nitesh V. Chawla,Olaf Wiest,Xiangliang Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Effectively configuring scalable, spanning architecture design, large language model, waste substantial computational, substantial computational resources
备注:
点击查看摘要
Abstract:Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.
92. 【2605.11513】A Study on Hidden Layer Distillation for Large Language Model Pre-Training
链接:https://arxiv.org/abs/2605.11513
作者:Maxime Guigon,Lucas Dixon,Michaël E. Sander
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, training Large Language, Language Models, neglecting semantic information, Large Language
备注:
点击查看摘要
Abstract:Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.
93. 【2605.11502】Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations
链接:https://arxiv.org/abs/2605.11502
作者:Shufan Ming,Joe D. Menke,Neil R. Smalheiser,Halil Kilicoglu
类目:Computation and Language (cs.CL)
关键词:supporting evidence synthesis, Accurately and consistently, consistently indexing biomedical, indexing biomedical literature, study design indexing
备注: Accepted by IEEE ICHI 2026
点击查看摘要
Abstract:Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: this https URL
94. 【2605.11483】StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models
链接:https://arxiv.org/abs/2605.11483
作者:Ishmam Khan,Sindhuja Thogarrati,Shuo Zhang
类目:Computation and Language (cs.CL)
关键词:constraints remains underexplored, internalize nuanced philosophical, nuanced philosophical frameworks, severe data constraints, data constraints remains
备注:
点击查看摘要
Abstract:While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism's outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.
95. 【2605.11458】Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
链接:https://arxiv.org/abs/2605.11458
作者:Zihao Han,Tiangang Zhang,Huaibin Wang,Yilun Sun
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
关键词:recipe for LLM, LLM reasoning, privileged teacher supervises, Adaptive Teacher Exposure, rollouts while conditioning
备注: 11 pages, 4 figures; code not released yet
点击查看摘要
Abstract:On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.
96. 【2605.11442】Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
链接:https://arxiv.org/abs/2605.11442
作者:Zi Liang,Ronghua Li,Yanyun Wang,Qingqing Ye,Haibo Hu
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, orchestrating complex interactions, Language Model, key intermediaries
备注:
点击查看摘要
Abstract:Large Language Model (LLM) agents have emerged as key intermediaries, orchestrating complex interactions between human users and a wide range of digital services and LLM infrastructures. While prior research has extensively examined the security of LLMs and agents in isolation, the systemic risk of the agent acting as a disruptive hub within the user-agent-service chain remains largely overlooked. In this work, we expose a novel threat paradigm by introducing Mobius Injection, a sophisticated attack that weaponizes autonomous agents into zombie nodes to launch what we define as gent-based and -Oriented DDoS (AbO-DDoS) attacks. By exploiting a structural vulnerability in agentic logic named Semantic Closure, an adversary can induce sustained recursive execution of agent components through a single textual injection. We demonstrate that this attack is exceptionally lightweight, stealthy against both traditional DDoS monitors and contemporary AI safety filters, and highly configurable, allowing for surgical targeting of specific environments or model providers. To evaluate the real-world impact, we conduct extensive experiments across three representative claw-style agents and three mainstream coding agents, integrated with 12 frontier proprietary or open-weight LLMs. Our results demonstrate that Mobius Injection achieves substantial attack success across diverse tasks, driving single-node call amplification up to 51.0x and multi-node p95 latency inflation up to 229.1x. The attack performance exhibits a superlinear increase with the number of poisoning nodes. To mitigate Mobius Injection, we propose a proactive defense mechanism using Agent Component Energy (ACE) Analysis, which detects malicious recursive triggers by measuring anomalous energy in the agent's component graph.
97. 【2605.11436】Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
链接:https://arxiv.org/abs/2605.11436
作者:Joykirat Singh,Zaid Khan,Archiki Prasad,Justin Chih-Yao Chen,Akshay Nambi,Hyunji Lee,Elias Stengel-Eskin,Mohit Bansal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, belief state, complex environment state, act while inferring
备注: Code: [this https URL](https://github.com/joykirat18/Agent-BRACE)
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
98. 【2605.11416】Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
链接:https://arxiv.org/abs/2605.11416
作者:Yu-Hang Wu,Qin-Yuan Liu,Qiu-Yang Zhao,Bo Jiang,Jiang-Feng Yang,Qing-Wei Cong
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, empirical black-box problem, black-box problem due, Selective layer-wise updates
备注:
点击查看摘要
Abstract:Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture-agnostic diagnostic framework that reveals the evolution patterns of layer-wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre-training trials to compare diverse freeze-train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high-quality pre-trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low-cost and interpretable solution for resource-constrained teams, offering actionable guidance for layer-wise parameter allocation in continued pre-training and hybrid model construction.
99. 【2605.11408】MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
链接:https://arxiv.org/abs/2605.11408
作者:Bo Zheng,Yudong Chen,Zihua Xiong,Shuai Fang,Peidong He,Yang Yang,Sheng Guo
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:high-stakes decision systems, Tabular data forms, systems in finance, forms the backbone, backbone of high-stakes
备注:
点击查看摘要
Abstract:Tabular data forms the backbone of high-stakes decision systems in finance, healthcare, and beyond. Yet industrial tabular datasets are inherently difficult: high-dimensional, riddled with missing entries, and rarely labeled at scale. While foundation models have revolutionized vision and language, tabular learning still leans on handcrafted features and lacks a general self-supervised framework. We present MaskTab, a unified pre-training framework designed specifically for industrial-scale tabular data. MaskTab encodes missing values via dedicated learnable tokens, enabling the model to distinguish structural absence from random dropout. It jointly optimizes a hybrid supervised pre-training scheme--utilizing a twin-path architecture to reconcile masked reconstruction with task-specific supervision--and an MoE-augmented loss that adaptively routes features through specialized subnetworks. On industrial-scale benchmarks, it achieves +5.04% AUC and +8.28% KS over prior art under rigorous scaling. Moreover, its representations distill effectively into lightweight models, yielding +2.55% AUC and +4.85% KS under strict latency and interpretability constraints, while improving robustness to distribution shifts. Our work demonstrates that tabular data admits a foundation-model treatment--when its structural idiosyncrasies are respected.
100. 【2605.11403】fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
链接:https://arxiv.org/abs/2605.11403
作者:Mingxiong Lin,Zhangquan Gong,Maowen Tang,Qian Li,Chuangchuang Wang,Jian Ma,Sutian Huang,Kai Tang,Haonan Lu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Group Relative Policy, Relative Policy Optimization, Verifiable Rewards, Group Relative, paradigm for LLM
备注:
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
101. 【2605.11398】AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment
链接:https://arxiv.org/abs/2605.11398
作者:Robin Linzmayer(1 and 2),Georgianna Lin(2),Di Coneybeare(3),Jason Chu(3),Trudi Cloyd(3),Manish Garg(3),Miles Gordon(3),Elizabeth Hartofilis(3),Benjamin Hong(3),Ashraf Hussain(3),Eugene Y. Kim(3),Oluchi Iheagwara King(3),Ross McCormack(3),Erica Olsen(3),John K. Riggins Jr(3),Mustafa N. Rasheed(3),Dana L. Sacco(3),Vinay Saggar(3),Osman R. Sayan(3),Amit Shembekar(3),Janice Shin-Kim(3),Wendy W. Sun(3),Bernard P. Chang(3),David Kessler(3),Noémie Elhadad(1 and 2) ((1) Department of Computer Science, Columbia University, (2) Department of Biomedical Informatics, Columbia University, (3) Department of Emergency Medicine, Columbia University Irving Medical Center)
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:user medical presentations, language models identify, evaluating whether language, medical presentations, cases
备注: 41 pages, 5 figures. Preprint under review for the Track on Evaluations and Datasets at NeurIPS 2026
点击查看摘要
Abstract:We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.
102. 【2605.11388】Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
链接:https://arxiv.org/abs/2605.11388
作者:Dean Light,Michael Theologitis,Kshitish Ghate,Shuyue Stella Li,Benjamin Newman,Chirag Shah,Aylin Caliskan,Pang Wei Koh,Dan Suciu,Yulia Tsvetkov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Humans intuitively solve, revise intermediate goals, Humans intuitively, apply formal procedures, intuitively solve complex
备注: Preprint under review
点击查看摘要
Abstract:Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard-code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning -- an inference-time approach for constructing task-specific scaffolds through structured meta-reasoning. Deep Reasoning uses a formal language that represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in-context examples that guide test-time scaffold construction. We instantiate this approach in a general-purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state-of-the-art scaffolding methods across four hard benchmarks: multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower-load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just-in-time.
103. 【2605.11378】An Empirical Study of Automating Agent Evaluation
链接:https://arxiv.org/abs/2605.11378
作者:Kang Zhou,Sangmin Woo,Haibo Ding,Kiran Ramnath,Subramanian Chidambaram,Aosong Feng,Vinayak Arannil,Muhyun Kim,Ishan Singh,Darren Wang,Zhichao Xu,Megha Gandhi,Nirmal Prabhu,Soumya Smruti Mishra,Vivek Singh,Gouri Pandeshwar,Lin Lee Cheong
类目:Computation and Language (cs.CL)
关键词:multi-step behaviors involving, behaviors involving tool, evaluation requires assessing, requires assessing complex, assessing complex multi-step
备注:
点击查看摘要
Abstract:Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.
104. 【2605.11374】st-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
链接:https://arxiv.org/abs/2605.11374
作者:Han Xiao
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Test-time compute, widely believed, large reasoning models, Test-time, large reasoning
备注: 37 pages, 5 figures, 16 tables
点击查看摘要
Abstract:Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Most modern embedding checkpoints are distilled from large LLM backbones and inherit their representation space; a frozen embedding model should therefore benefit from extra inference compute without retraining. Using an agentic program-search loop, we explore 259 candidate inference programs over a frozen embedding API across ninety generations. The entire Pareto frontier collapses onto a single algebra: a softmax-weighted centroid of the local top-K documents interpolated with the query. This parameter-free default lifts nDCG@10 statistically significantly across seven embedding-model families spanning a tenfold parameter range, with held-out full-BEIR validation confirming the lift on every model tested.
105. 【2605.11363】PresentAgent-2: Towards Generalist Multimodal Presentation Agents
链接:https://arxiv.org/abs/2605.11363
作者:Wei Wu,Ziyang Xu,Zeyu Zhang,Yang Zhao,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Presentation, presentation video, presentation video generation, interactive delivery, moving beyond static
备注:
点击查看摘要
Abstract:Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: this https URL. Website: this https URL.
106. 【2605.11348】Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence
链接:https://arxiv.org/abs/2605.11348
作者:Ujun Jeong,Saketh Vishnubhatla,Bohan Jiang,Andre Harrison,Adrienne Raglin,Huan Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
关键词:strengthen situational awareness, identifying factors linked, physical damage, infrastructure disruption, linked to casualties
备注: Submitted to EMNLP
点击查看摘要
Abstract:During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.
107. 【2605.11336】Much of Geospatial Web Search Is Beyond Traditional GIS
链接:https://arxiv.org/abs/2605.11336
作者:Ilya Ilyankou,Stefano Cavazzi,James Haworth
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:labelling schemes suggest, remains poorly characterised, geospatial web search, existing labelling schemes, Web search queries
备注:
点击查看摘要
Abstract:Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope traditional GIS systems and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.
108. 【2605.11334】VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
链接:https://arxiv.org/abs/2605.11334
作者:Jasmine Qi,Danylo Dantsev,Muyang Sun
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:practitioners lack reliable, lack reliable methods, widely deployed, deployed for automated, practitioners lack
备注: 16 pages, 6 figures
点击查看摘要
Abstract:LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.
Comments:
16 pages, 6 figures
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2605.11334 [cs.LG]
(or
arXiv:2605.11334v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.11334
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
109. 【2605.11317】SOMA: Efficient Multi-turn LLM Serving via Small Language Model
链接:https://arxiv.org/abs/2605.11317
作者:Xueqi Cheng,Qiong Wu,Zhengyi Zhou,Xugui Zhou,Tyler Derr,Yushun Dong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:preserving conversational context, multi-turn dialogue settings, Large Language Models, large proprietary models, increasingly deployed
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: this https URL.
110. 【2605.11303】Predicting Psychological Well-Being from Spontaneous Speech using LLMs
链接:https://arxiv.org/abs/2605.11303
作者:Erfan Loweimi,Sofia de la Fuente Garcia,Saturnino Luz
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Ryff Psychological Well-Being, Language Models, Large Language, Ryff Psychological
备注:
点击查看摘要
Abstract:We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD database, we evaluated 12 instruction-tuned LLMs, including Llama-3 (8B, 70B), Ministral, Mistral, Gemma-2-9B, Gemma-3 (1B, 4B, 27B), Phi-4, DeepSeek (Qwen and Llama), and QwQ-Preview. A domain-informed prompt was developed in collaboration with experts in clinical psychology and linguistics. Results show that LLMs can extract semantically meaningful cues from spontaneous speech, achieving Spearman correlations of up to 0.8 on 80\% of the data. Additionally, to enhance explainability, we conducted statistical analyses to characterise prediction variability and systematic biases, alongside keyword-based word cloud analyses to highlight the linguistic features driving the models' predictions.
111. 【2605.11302】A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse
链接:https://arxiv.org/abs/2605.11302
作者:Atul Ganju,Travis McVoy,Shaddin Dughmi,Shang-Hua Teng
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Kleinberg and Wei, global preference ordering, introduced by Kleinberg, global preference, preference ordering
备注:
点击查看摘要
Abstract:We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As in [arXiv:2504.14370, arXiv:2511.05295], we aim for \emph{breadth}, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.
112. 【2605.11301】LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
链接:https://arxiv.org/abs/2605.11301
作者:Xueqi Cheng,Yushun Dong
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:strengths across OCR, Multimodal large language, large language models, visual question answering, chart understanding
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: this https URL.
113. 【2605.11299】Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
链接:https://arxiv.org/abs/2605.11299
作者:Yizhu Jiao,Ruixiang Zhang,Richard Bai,Jiawei Han,Ronan Collobert,Yizhe Zhang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:sparse execution feedback, receives sparse execution, Code generation, dual judgment space, solution and receives
备注:
点击查看摘要
Abstract:Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.
114. 【2605.11290】ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
链接:https://arxiv.org/abs/2605.11290
作者:Xueqi Cheng,Xugui Zhou,Tyler Derr,Yushun Dong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, selected model capabilities, applies knowledge distillation, distillation applies knowledge, aiming to compress
备注:
点击查看摘要
Abstract:Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at this https URL.
115. 【2605.11258】Unlocking LLM Creativity in Science through Analogical Reasoning
链接:https://arxiv.org/abs/2605.11258
作者:Andrew Shen,Shaul Druckmann,James Zou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
关键词:Autonomous science promises, Autonomous science, augment scientific discovery, scientific discovery, fields like biomedicine
备注:
点击查看摘要
Abstract:Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($\rho$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.
116. 【2605.11255】HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
链接:https://arxiv.org/abs/2605.11255
作者:Noam Kayzer,Dan Revital,Ori Bar Joseph,Smadar Arvatz,Or Levi,Tal Geva,Shaltiel Shmidman,Amir DN Cohen,Noam Ordan,Omer Baruch,Kate Zinkovskaia,Zevi Apini,Sarel Weinberger
类目:Computation and Language (cs.CL)
关键词:Hebrew-specialized open-weight large, open-weight large language, present Hebatron, NVIDIA, large language model
备注:
点击查看摘要
Abstract:We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous anti-forgetting anchoring, followed by supervised fine-tuning on 2 million bilingual Hebrew--English samples. The curriculum ordering alone yields a 3-point aggregate benchmark gain over the reversed configuration. Hebatron achieves a Hebrew reasoning average of 73.8\%, outperforming DictaLM-3.0-24B-Thinking (68.9\%) and remaining competitive with Gemma-3-27B-IT on GSM8K-HE and Israeli Trivia, while activating only 3B parameters per forward pass across a 30B-parameter model, delivering approximately 9 times higher inference throughput at native context lengths up to 65,536 tokens. To our knowledge, this is the first language-specific adaptation of the Nemotron-3 architecture for any target language, and the first open-weight Hebrew-specialized MoE model with native long-context support. Model weights are released openly to support further research in Hebrew and Semitic-language NLP.
117. 【2605.11242】RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German
链接:https://arxiv.org/abs/2605.11242
作者:Ignacio Sastre,Ignacio Remersaro,Facundo Díaz,Nicolás De Horta,Luis Chiruzzo,Aiala Rosá,Santiago Góngora
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Rubric-based Short Answer, present the RETUYT-INCO, RETUYT-INCO participation, Rubric-based Short, Unseen answers two-way
备注: To be presented at the BEA 2026 workshop, co-located with ACL 2026
点击查看摘要
Abstract:In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.
118. 【2605.11212】ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
链接:https://arxiv.org/abs/2605.11212
作者:Amirhossein Abaskohi,Yuhang He,Peter West,Giuseppe Carenini,Pranit Chawla,Vibhav Vineet
类目:Computation and Language (cs.CL)
关键词:graphical user interfaces, Computer-use agents, user interfaces, graphical user, large number
备注:
点击查看摘要
Abstract:Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.
119. 【2605.11206】Instructions shape Production of Language, not Processing
链接:https://arxiv.org/abs/2605.11206
作者:Andreas Waldis,Leshem Choshen,Yufang Hou,Yotam Perlit
类目:Computation and Language (cs.CL)
关键词:trigger a production-centered, tokens, output tokens, production-centered mechanism, information
备注:
点击查看摘要
Abstract:Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.
120. 【2605.11195】How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation
链接:https://arxiv.org/abs/2605.11195
作者:Eduardo Tenorio,Karuna Bhaila,Xintao Wu
类目:Computation and Language (cs.CL)
关键词:Large language models, posing significant privacy, significant privacy risks, Large language, memorize sensitive training
备注: 14 pages, 1 figure
点击查看摘要
Abstract:Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.
121. 【2605.11167】he Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models
链接:https://arxiv.org/abs/2605.11167
作者:Cedric Flamant,Udaya Ghai,Kanna Shimizu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:tool-augmented systems communicate, Existing multi-model, serializing every exchange, output vocabulary, multi-model and tool-augmented
备注: 9 pages main text, 5 figures, 24 pages appendix
点击查看摘要
Abstract:Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent channel? The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step, both models run in lockstep: a primary model drives the task while an auxiliary model operates tools, solves constraints, or executes code, with both conditioning on each other's activations through a translation network and a learned suppression gate ($\sim$1\% of combined parameters). The gate learns a selective communication protocol from task loss alone, without a prescribed format. We demonstrate the mechanism across three tool backends. On arithmetic, coupling two 0.5B models with a calculator raises accuracy from 36\% to 96\%. On logic grid puzzles, coupling two 0.6B models with a Z3 solver achieves $1.7\times$ the unaugmented baseline on ZebraLogic. On mathematical reasoning, coupling with a Python sandbox enables the auxiliary to generate problem-specific code from hidden-state signals alone, without ever seeing the problem text.
122. 【2605.11153】Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary
链接:https://arxiv.org/abs/2605.11153
作者:Ramchand Kumaresan
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
关键词:parallel sigmoid gate, fed post-stack hidden, bounded temperature anneal, learnable per-adapter floor, post-stack hidden states
备注:
点击查看摘要
Abstract:We decompose an evolutionary mixture-of-LoRA system on a from-scratch ~150M-parameter widened-D substrate (D=1536, V=32000; D/V approx 0.048; the "widened-1536" substrate) into three factors -- a router rewrite (parallel sigmoid gate with learnable per-adapter floor and bounded temperature anneal, fed post-stack hidden states rather than token-embedding means), a per-domain leave-one-out evaluation scope, and a lifecycle of death plus alpha-blend inheritance plus SVD mutation plus slot reallocation -- and report a 5-of-8 partial 2^3 factorial run at n=3 seeds and 25000 adaptation steps per cell. The attribution chain is sharp on this substrate: the router rewrite carries the entire +0.0426 nat balanced log-PPL improvement (Delta = log PPL_ref - log PPL_test, positive = improvement; t=12.86, p=0.006) attributed to "the full evolutionary system vs the static B3 baseline"; the headline full-system-vs-B3 balanced contrast itself is +0.015 nats, t=1.94, p=0.19 at n=3 and does not clear alpha=0.05. The per-domain evaluation scope is null at seed-resolution, and the lifecycle is a net drag of approx -0.028 nats (t=-4.46,p=0.047 in the primary chain). An auxiliary alpha=0 inheritance counterfactual at n=3 seeds is sign-inconsistent at the headline metric and underpowered for either an equivalence or load-bearing conclusion (corrected from an earlier arithmetic-mean aggregator that erroneously cleared inheritance; see Appendix B.11). A base-perturbation probe directionally refutes a "genomic-context" reframe of the lifecycle role. A controllable synthetic sandbox locates a substrate-conditional regime boundary: evolutionary search on the routing channel is load-bearing only when adapters are pre-aligned to the task; in every other regime tested it underperforms, ties, or actively degrades the gradient solution.
123. 【2605.11143】ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
链接:https://arxiv.org/abs/2605.11143
作者:Alex Stinard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:measure clinical performance, Reasoning benchmarks measure, benchmarks measure clinical, percentage points, clean inputs
备注: 46 pages including appendices (two-column preprint format). Under review at JAMIA. Code, frozen evaluator, and benchmark released at [this https URL](https://huggingface.co/datasets/alexstinard/epikg-clinicalbench) . ClinicalBench v2 is a 400-question MIMIC-IV stress test for assertion-aware retrieval
点击查看摘要
Abstract:Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent. Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size. Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.
124. 【2605.11128】Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
链接:https://arxiv.org/abs/2605.11128
作者:Amin Banayeeanzade,Qingchuan Yang,Dhruv Tarsadiya,Fatemeh Bahrani,Leonardo Blas,Alfy Samuel,Robin Jia,Meisam Razaviyayn,Sai Praneeth Karimireddy
类目:Computation and Language (cs.CL)
关键词:language-model applications ranging, scientific discovery, plausible outputs, essential for language-model, language-model applications
备注:
点击查看摘要
Abstract:Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.
125. 【2605.11051】On Problems of Implicit Context Compression for Software Engineering Agents
链接:https://arxiv.org/abs/2605.11051
作者:Kirill Gelvan,Igor Slinko,Felix Steinbauer,Egor Bogomolov,Florian Kofler,Yaroslav Zharov
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:LLM-based Software Engineering, Software Engineering agents, Engineering agents face, Software Engineering, LLM-based Software
备注:
点击查看摘要
Abstract:LLM-based Software Engineering agents face a critical bottleneck: context length limitations cause failures on complex, long-horizon tasks. One promising solution is to encode context as continuous embeddings rather than discrete tokens, enabling denser information storage. We apply the recently proposed In-Context Autoencoder for this purpose. While the method performs well on single-shot common-knowledge and code-understanding tasks, our experiments demonstrate that it fails on multi-step agentic coding tasks. In this paper, we explore this phenomenon and discuss possible factors contributing to this failure.
126. 【2605.11026】AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
链接:https://arxiv.org/abs/2605.11026
作者:Yassin H. Rassul,Tarik A. Rashid
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Defenses against indirect, indirect prompt injection, structural weaknesses, share two structural, LLM agents share
备注: 20 pages, 5 figures. Code: [this https URL](https://github.com/Yassin-H-Rassul/AgentShield)
点击查看摘要
Abstract:Defenses against indirect prompt injection (IPI) in tool-using LLM agents share two structural weaknesses. First, they all attempt to prevent attacks rather than detect the compromises that slip through. Second, they have only been evaluated in English, leaving users of low-resource languages such as Kurdish and Arabic without tested protection. This paper addresses both gaps with AgentShield, a deception-based detection framework that places three layers of traps inside the agent's tool interface: fake tools, fake credentials, and allowlisted parameters. The same trap triggers serve as high-precision labels for a self-supervised classifier. An LLM agent that follows an attacker's hidden instruction almost always touches one of these traps, which gives both a real-time compromise signal and a zero-FP label for training a downstream detector without manual annotation. Across 176 cross-lingual attack prompts and four LLMs from three providers, and because modern LLMs already refuse most IPI attempts on their own (attack success rate = 10%), AgentShield's job is to catch the attacks that do slip through. On commercial models, it catches 90.7%-100% of such successful attacks, with zero false alarms on 485 normal-use tests. It survives a systematic adaptive-attack evaluation with zero evasion on commercial models, and the self-supervised classifier transfers across models and languages without retraining.
127. 【2605.10971】Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
链接:https://arxiv.org/abs/2605.10971
作者:Hanhan Zhou,Shamik Roy,Rashmi Gangadharaiah
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Discrete diffusion language, diffusion language models, Discrete diffusion, autoregressive models, language models
备注: preprint, 47 pages
点击查看摘要
Abstract:Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2\% of denoising, whereas sentiment emerges gradually over 20\% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive scheduler that concentrates interventions on the steps where an attribute is actively forming and leaves the rest of generation untouched. The cost-control trade-off admits a closed-form characterization: the advantage of adaptive over uniform scheduling is governed by a single dispersion statistic of the commitment distribution. Across four DLMs and seven steering tasks, our method achieves precise control without the degradation typical of uniform interventions. Especially on challenging simultaneous three-attribute control, it reaches up to 93\% steering strength, beating the strongest baseline by up to 15\% points while preserving generation quality.
128. 【2605.10082】FERA: Uncertainty-Aware Federated Reasoning for Large Language Models
链接:https://arxiv.org/abs/2605.10082
作者:Ruhan Wang,Chengkai Huang,Zhiyong Wang,Junda Wu,Rui Wang,Tong Yu,Julian McAuley,Lina Yao,Dongruo Zhou
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, Large language, exhibit strong reasoning, strong reasoning capabilities, language models
备注: 44 pages, 8 figures
点击查看摘要
Abstract:Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.
129. 【2605.06940】MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
链接:https://arxiv.org/abs/2605.06940
作者:Souvik Pramanik,S.M. Riaz Rahman Antu,Shak Mohammad Abyad,Md. Ibrahim Khalil,Md. Shahriar Hussain
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Language Models, Large Language, automation via Large, low-resource languages
备注: 21 pages, 14 figures, 13 tables
点击查看摘要
Abstract:Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a "label agreement illusion", statistically validated via almost null Fleiss' Kappa ($\kappa \approx -0.001$) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.
信息检索
1. 【2605.12487】ask-Adaptive Embedding Refinement via Test-time LLM Guidance
链接:https://arxiv.org/abs/2605.12487
作者:Ariel Gera,Shir Ashury-Tahan,Gal Bloch,Ohad Eytan,Assaf Toledo
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:query refinement paradigm, LLM-guided query refinement, challenging zero-shot search, explore the effectiveness, paradigm for extending
备注:
点击查看摘要
Abstract:We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at this https URL.
2. 【2605.12419】ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
链接:https://arxiv.org/abs/2605.12419
作者:Neha Verma,Nikhil Mehta,Shao-Chuan Wang,Naijing Zhang,Alicia Tsai,Li Wei,Lukasz Heldt,Lichan Hong,Ed Chi,Xinyang Yi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:language-based reasoning abilities, large language model, language-based reasoning, reasoning abilities, rapid advancements
备注:
点击查看摘要
Abstract:Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.
3. 【2605.12398】Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring
链接:https://arxiv.org/abs/2605.12398
作者:Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:improving large language, Estimating question difficulty, Estimating question, large language models, critical component
备注: Accepted at ACL 2026
点击查看摘要
Abstract:Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.
4. 【2605.12370】Context Convergence Improves Answering Inferential Questions
链接:https://arxiv.org/abs/2605.12370
作者:Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, open-domain Question Answering, Language Models, Large Language, Question Answering
备注: Accepted at SIGIR 2026
点击查看摘要
Abstract:While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.
5. 【2605.12361】MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
链接:https://arxiv.org/abs/2605.12361
作者:Rezarta Islamaj,Robert Leaman,Joey Chan,Nicholas Wan,Qiao Jin,Natalie Xie,John Wilbur,Shubo Tian,Lana Yeganova,Po-Ting Lai,Chih-Hsuan Wei,Yifan Yang,Yao Ge,Qingqing Zhu,Zhizheng Wang,Zhiyong Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Evaluating large language, model capabilities improve, Evaluating large, large language models, capabilities improve
备注:
点击查看摘要
Abstract:Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
6. 【2605.12335】EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
链接:https://arxiv.org/abs/2605.12335
作者:Saeed Shurrab,Mariam Al-Omari,Dana El Samad,Farah E. Shamout
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Electronic Health Records, Electronic Health, Health Records, predictive modeling applications, rich longitudinal patient
备注: Retrieval Augmented EHR Foundation Model
点击查看摘要
Abstract:Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.
7. 【2605.12313】Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
链接:https://arxiv.org/abs/2605.12313
作者:Rezarta Islamaj,Joey Chan,Robert Leaman,Jongmyung Jung,Hyeongsoon Hwang,Quoc-An Nguyen,Hoang-Quynh Le,Harikrishnan Gurushankar Saisudha,Ganesh Chandrasekar,Rustam R. Taktashov,Nadezhda Yu. Bizyukova,Sofia I. R. Conceição,Paulo R. C. Lopes,Reem Abdel Salam,Mary Adewunmi,Zhiyong Lu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Multi-hop question answering, remains a significant, biomedical domain, multiple sources, answer complex questions
备注:
点击查看摘要
Abstract:Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: this https URL and benchmark this https URL
8. 【2605.12272】BatchBench: Toward a Workload-Aware Benchmark for Autoscaling Policies in Big Data Batch Processing -- A Proposed Framework
链接:https://arxiv.org/abs/2605.12272
作者:Venkata Krishna Prasanth Budigi,Siri Chandana Sirigiri
类目:Information Retrieval (cs.IR); Databases (cs.DB)
关键词:cloud-native big data, large language model, big data processing, large language, expectation for cloud-native
备注: 5 pages, 1 table, position paper. Reference implementation in active development. Empirical follow-up to appear
点击查看摘要
Abstract:Autoscaling has become a baseline expectation for cloud-native big data processing, and the design space has expanded beyond rule-based heuristics to include learned controllers and, most recently, large language model (LLM) agents. Yet despite a growing body of work spanning these paradigms, the community lacks a shared benchmark for comparing them. Existing evaluations rely on synthetic TPC-style queries, vendor blog posts with proprietary baselines, or narrow trace replays. Each new policy reports favorable numbers against a different baseline, on a different workload, with a different cost model, making cross-paper comparison effectively impossible. This is a position paper. We propose BatchBench, an open benchmarking framework designed to place rule-based, learned, and agentic autoscaling policies on equal experimental footing. The contribution is the design of the framework, not empirical results. We contribute: (1) a workload taxonomy of six batch processing classes synthesized from published autoscaling benchmarks and publicly released cluster traces; (2) the design of a parameterized workload generator with a validation methodology based on two-sample Kolmogorov-Smirnov and earth-mover distance; (3) a five-axis evaluation harness specification covering cost, SLA attainment, scaling responsiveness, scaling thrash, and decision interpretability, with first-class accounting for LLM inference cost; and (4) a standardized agent interface that lets LLM-based and reinforcement-learning autoscalers be evaluated alongside rule-based controllers with a single API. We discuss the expected evaluation surface, identify open research questions the framework is designed to answer, and outline a roadmap for the empirical paper that will follow. BatchBench's reference implementation is in active development and will be released as open source.
9. 【2605.12226】Unlocking Crowdsourcing for Ontology Matching Validation
链接:https://arxiv.org/abs/2605.12226
作者:Zhangcheng Qiang
类目:Information Retrieval (cs.IR)
关键词:large language models, Recent advances, language models, pose new challenges, ontology matching
备注: 4 pages, 1 figure
点击查看摘要
Abstract:Recent advances in large language models (LLMs) pose new challenges for ontology matching (OM). While OM systems built on LLMs have shown remarkable capabilities in discovering more mappings, traditional OM validation that relies on domain experts has become overwhelming. In this study, we explore the use of crowdsourcing for OM validation and introduce a novel crowdsourcing system. We propose three domain-specific mechanisms, namely differential trustworthiness, coherence pre-filling, and time-dependent beliefs, to ensure the quality of crowdsourcing for OM validation. We demonstrate that our crowdsourcing system can be integrated with state-of-the-art OM systems to enable human-in-the-loop validation. Two real-world use cases illustrate the effectiveness of our crowdsourcing system.
10. 【2605.12138】Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
链接:https://arxiv.org/abs/2605.12138
作者:Yexing Xu,Wei Feng,Shen Zhang,Haohan Wang,Yuxin Qin,Yaoyu Li,Ao Ma,Yuhao Luo,Lu Wang,Xudong Ren,Haoran Wang,Run Ling,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Longguang Wang,Yulan Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:challenge in e-commerce, realistic and user-preferred, key challenge, Generating realistic, Unified Advertisement Generative
备注: 22 pages, 19 figures, CVPR 2026
点击查看摘要
Abstract:Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at this https URL.
11. 【2605.12028】Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking
链接:https://arxiv.org/abs/2605.12028
作者:David-Maximilian Caraman,Gheorghe Cosmin Silaghi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Reciprocal Rank Fusion, English-language domains, Task, Rank Fusion, dense retrieval combined
备注: Accepted at SemEval2026, task 8: MTRAGEval
点击查看摘要
Abstract:We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.
12. 【2605.11958】From Trajectories to Phenotypes: Disease Progression as Structural Priors for Multi-organ Imaging Representation Learning
链接:https://arxiv.org/abs/2605.11958
作者:Zian Wang,Lizhen Lan,Guangming Wang,Haosen Zhang,Minxuan Xu,Qing Li,Tianxing He,Mo Yang,Wenyue Mao,Yajing Zhang,Yan Li,Chengyan Wang
类目:Information Retrieval (cs.IR)
关键词:summarize multi-organ physiology, Imaging-derived phenotypes, summarize multi-organ, evolve over time, multi-organ physiology
备注:
点击查看摘要
Abstract:Imaging-derived phenotypes (IDPs) summarize multi-organ physiology but provide only static snapshots of diseases that evolve over time. In contrast, longitudinal electronic health records encode disease trajectories through temporal dependencies among past diagnosis events and comorbidity structure. We hypothesize that IDPs and disease trajectories contain partially shared disease-relevant structure. We propose a trajectory-aware distillation framework that transfers structural knowledge from a generative disease trajectory Transformer into an organ-wise IDP encoder. A population-scale trajectory model trained on longitudinal diagnosis sequences produces subject-level embeddings that supervise IDP representation learning via geometry-preserving alignment. During downstream prediction, trajectory and imaging representations can also be fused via cross-attention. Across 159 diseases in the UK Biobank cohort, trajectory-aware pretraining consistently improves both discrimination (AUC) and time-to-onset prediction (MAE), with the largest gains for low-prevalence diseases. Similarity relationships in IDP embedding space also align with those in trajectory space, providing supportive evidence for partially aligned representation geometry. These results suggest that population-scale generative disease models can serve as structural priors for data-limited imaging modalities, improving robustness under realistic cohort constraints.
13. 【2605.11921】On the LSH Distortion of Ulam and Cayley Similarities
链接:https://arxiv.org/abs/2605.11921
作者:Flavio Chierichetti,Mirko Giacchini,Ravi Kumar,Erasmo Tani
类目:Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
关键词:nearest neighbor search, accelerate nearest neighbor, Locality-sensitive hashing, LSH distortion, LSH
备注:
点击查看摘要
Abstract:Locality-sensitive hashing (LSH) has found widespread use as a fundamental primitive, particularly to accelerate nearest neighbor search. An LSH scheme for a similarity function $S:\mathcal{X} \times \mathcal{X} \to [0,1]$ is a distribution over hash functions on $\mathcal{X}$ with the property that the probability of collision of any two elements $x,y\in \mathcal{X}$ is exactly equal to $S(x,y)$. However, not all similarity functions admit exact LSH schemes. The notion of LSH distortion measures how multiplicatively close a similarity function is to having an LSH scheme. In this work, we study the LSH distortion of the Ulam and Cayley similarities, which are popular similarity measures on permutations of $n$ elements. We show that the Ulam similarity admits a sublinear LSH distortion of $O(n / \sqrt{\log n})$; we also prove a lower bound of $\Omega(n^{0.12})$ on the best LSH distortion achievable. On the other hand, we show that the LSH distortion of the Cayley similarity is $\Theta(n)$.
Subjects:
Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
Cite as:
arXiv:2605.11921 [cs.DS]
(or
arXiv:2605.11921v1 [cs.DS] for this version)
https://doi.org/10.48550/arXiv.2605.11921
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2605.11874】RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems
链接:https://arxiv.org/abs/2605.11874
作者:Wenwen Zeng,Jinhui Zhang,Hao Chen,Zhaoyu Hu,Yongqi Liang,Jiajun Chai,Dengcan Liu,Zhenfeng Liu,Shurui Yan,Minglong Xue,Xiaohan Wang,Wei Lin,Guojun Yin
类目:Information Retrieval (cs.IR)
关键词:Large Language Model, Large Language, simple query-item matching, integration of Large, Language Model
备注:
点击查看摘要
Abstract:The integration of Large Language Model (LLM) agents is transforming recommender systems from simple query-item matching towards deeply personalized and interactive recommendations. Reinforcement Learning (RL) provides an essential framework for the optimization of these agents in recommendation tasks. However, current methodologies remain limited by a reliance on single dimensional outcome-based rewards that focus exclusively on final user interactions, overlooking critical intermediate capabilities, such as instruction following and complex intent understanding. Despite the necessity for designing multi-dimensional reward, the field lacks a standardized benchmark to facilitate this development. To bridge this gap, we introduce RecRM-Bench, the largest and most comprehensive benchmark to date for agentic recommender systems. It comprises over 1 million structured entries across four core evaluation dimensions: instruction following, factual consistency, query-item relevance, and fine-grained user behavior prediction. By supporting comprehensive assessment from syntactic compliance to complex intent grounding and preference modeling, RecRM-Bench provides a foundational dataset for training sophisticated reward models. Furthermore, we propose a systematic framework for the construction of multi-dimensional reward models and the integration of a hybrid reward function, establishing a robust foundation for developing reliable and highly capable agentic recommender systems. The complete RecRM-Bench dataset is publicly available at this https URL.
15. 【2605.11864】Very Efficient Listwise Multimodal Reranking for Long Documents
链接:https://arxiv.org/abs/2605.11864
作者:Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:computationally expensive component, multimodal retrieval-augmented generation, retrieval-augmented generation, key yet computationally, computationally expensive
备注: To appear in ICML 2026
点击查看摘要
Abstract:Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at this https URL.
16. 【2605.11732】AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents
链接:https://arxiv.org/abs/2605.11732
作者:Jiarui Jin,Zexuan Yan,Shijian Wang,Wenxiang Jiao,Yuan Lu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Multimedia (cs.MM)
关键词:Collaborative agentic architecture, Disentangled and Collaborative, adversarial optimization problem, Collaborative agentic, exploration and exploitation
备注:
点击查看摘要
Abstract:In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes into a single module, AgentDisCo employs a critic agent to evaluate generated outlines and refine search queries, and a generator agent to retrieve updated results and revise outlines accordingly. The iteratively refined outline is then passed to a downstream report writer that synthesizes a comprehensive research report. The overall workflow supports both handcrafted and automatically discovered design strategies via a meta-optimization harness, in which the generator agent is repurposed as a scoring agent to evaluate critic outputs and generate quality signals. Powerful code-generation agents (e.g., Claude-Code, Codex) systematically explore agent configurations and construct a policy bank, a structured repository of reusable design strategies, enabling the framework to self-refine without extensive human intervention. We evaluate AgentDisCo on three established deep research benchmarks (DeepResearchBench, DeepConsult, DeepResearchGym) using Gemini-2.5-Pro, achieving performance comparable to or surpassing leading closed-source systems. Observing that existing benchmarks inadequately reflect real-world user needs, we introduce GALA (General AI Life Assistants), a benchmark that mines latent research interests from users' historical browsing behavior. We further develop a rendering agent that converts research reports into visually rich poster presentations, and demonstrate an end-to-end product, AutoResearch Your Interest, which delivers personalized deep research recommendations derived from individual browsing histories.
17. 【2605.11707】Quality-Aware Collaborative Multi-Positive Contrastive Learning for Sequential Recommendation
链接:https://arxiv.org/abs/2605.11707
作者:Wei Wang
类目:Information Retrieval (cs.IR)
关键词:consistent and diverse, semantically consistent, contrastive learning, sequential recommendation hinges, sequential recommendation
备注:
点击查看摘要
Abstract:The effectiveness of contrastive learning in sequential recommendation hinges on the construction of contrastive views, which ideally should be both semantically consistent and diverse. However, most existing CL-based methods rely on heuristic augmentations that are prone to removing crucial items or disrupting transition patterns, leading to semantic drift. While a few studies have explored learnable augmentations to improve view quality, they often suffer from limited diversity and still necessitate heuristic aids. Furthermore, the quality differences across views are rarely modeled explicitly and adaptively, aggravating the false-positive issue. To address these issues, we propose Quality-aware Collaborative Multi-Positive Contrastive Learning for sequential recommendation. First, we introduce a learnable collaborative sequence augmentation module that generates two augmented views under two complementary collaborative contexts, one based on same-target sequences and the other on similar sequences, thereby enhancing view diversity while preserving intent this http URL, we design a quality-aware mechanism, tightly integrated into the model representations, which estimates each view' s quality from the confidence of its augmentation operations and assigns adaptive weights to ensure that high-confidence views contribute more supervision while low-confidence ones contribute this http URL experiments on three real-world datasets demonstrate that QCMP-CL outperforms state-of-the-art CL-based sequential recommendation baselines.
18. 【2605.11662】HSUGA: LLM-Enhanced Recommendation with Hierarchical Semantic Understanding and Group-Aware Alignment
链接:https://arxiv.org/abs/2605.11662
作者:Guorui Li,Dugang Liu,Lei Li,Xing Tang,Zhong Ming
类目:Information Retrieval (cs.IR)
关键词:Large language model, enhanced sequential recommendation, sequential recommendation typically, recommendation typically aims, Large language
备注: Accepted by ACL 2026 Findings
点击查看摘要
Abstract:Large language model (LLM)-enhanced sequential recommendation typically aims to improve two core components: user semantic embedding extraction and utilization. Despite promising results, existing methods still have two limitations: 1) In the extraction stage, most methods directly input long interaction sequence fragments into LLM for preference summarization. However, excessively long sequences increase inference difficulty, making it challenging to reliably infer accurate user embeddings. 2) In the utilization stage, most methods employ the same semantic embedding utilization strategy for all users, neglecting the differences caused by user activity levels, leading to suboptimal performance. To address these issues, we propose HSUGA, which introduces a simple yet effective plugin for each of the two core components: Hierarchical Semantic Understanding (HSU) and Group-Aware Alignment (GAA). HSU performs a staged two-phase preference mining and models preference evolution through constrained editing operations, thereby improving the reliability of user semantic extraction. GAA adjusts the intensity of semantic utilization based on user activity levels, providing weaker alignment for active users and stronger guidance for users with sparse historical data. Finally, extensive experiments on three benchmark datasets demonstrate the effectiveness and compatibility of HSUGA.
19. 【2605.11553】wiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
链接:https://arxiv.org/abs/2605.11553
作者:Shiteng Cao,Kaian Jiang,Yunlong Gong,Zhiheng Li
类目:Information Retrieval (cs.IR)
关键词:Semantic IDs, fast direct generation, fixed inference strategy, Generative recommendation, existing methods apply
备注: 16pages,3 figures
点击查看摘要
Abstract:Generative recommendation with Semantic IDs (SIDs) has emerged as a promising paradigm, yet existing methods apply a fixed inference strategy, either fast direct generation or slow chain-of-thought reasoning, uniformly across all user histories. This approach creates a trade-off: fast recommendation model produces suboptimal accuracy on hard samples, while always invoking slow reasoning incurs prohibitive latency and wastes computation on easy cases. To address this, we propose Think Fast, Think Slow, Then Act, a framework that learns to adaptively allocate reasoning effort per user sequence. Our system equips an LLM with three complementary tools: a fast SID-based retriever, a lightweight candidate ranker, and a slow reasoning model that generates explicit rationales before recommending. Crucially, we inject collaborative commonsense into the slow model by transforming item-to-item knowledge into natural language explanations. A planner, trained through supervised warm-up followed by agentic reinforcement learning, dynamically decides which tool to invoke. Experiments on three datasets demonstrate that our method outperforms strong baselines, achieving consistent accuracy gains while reducing inference latency compared to uniform slow reasoning.
20. 【2605.11447】Conditional Memory Enhanced Item Representation for Generative Recommendation
链接:https://arxiv.org/abs/2605.11447
作者:Ziwei Liu,Yejing Wang,Shengyu Zhou,Xinhang Li,Xiangyu Zhao
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Generative recommendation, predicts target items, SID, promising paradigm, paradigm that predicts
备注:
点击查看摘要
Abstract:Generative recommendation (GR) has emerged as a promising paradigm that predicts target items by autoregressively generating their semantic identifiers (SID). Most GR methods follow a quantization-representation-generation pipeline, first assigning each item a SID, then constructing input representations from SID-token embeddings, and finally predicting the target SID through autoregressive generation. Existing item-level representation constructions mainly take two forms: directly merging SID-token embeddings into a compact vector, or enriching item-level representations with external inputs through additional networks. However, these item-level constructors still expose two practical challenges: direct merging may amplify the information loss caused by quantization and ID collision while obscuring SID code relations, whereas external-input-based methods can strengthen item semantics but cannot reliably preserve the SID-structured evidence required for token-level generation. These limitations make representation construction an underexplored bottleneck, leading to two severe problems, \ie{} the Identity-Structure Preservation Conflict and Input-Output Granularity Mismatch. To this end, we propose ComeIR, a Conditional Memory enhanced Item Representation framework that reconstructs SID-token embeddings into item-aware inputs and restores the token granularity during SID decoding. Specifically, MM-guided token scoring adaptively estimates the contribution of each code within the SID, dual-level Engram memory captures intra-item code composition and inter-item transition patterns, and a memory-restoring prediction head reuses the memories during SID decoding. Extensive experiments demonstrate the effectiveness and flexibility of ComeIR, and further reveal scalable gains from enlarging conditional memory.
21. 【2605.11433】FedMM: Federated Collaborative Signal Quantization for Multi-Market CTR Prediction
链接:https://arxiv.org/abs/2605.11433
作者:Jun Zhang,Dugang Liu,Xing Tang,Xiuqiang He,Zhong Ming
类目:Information Retrieval (cs.IR)
关键词:Netflix serve users, Amazon and Netflix, Online platforms, Netflix serve, countries and regions
备注: Accepted by SIGIR 2026
点击查看摘要
Abstract:Online platforms such as Amazon and Netflix serve users across multiple countries and regions, underscoring the importance of multi-market recommendation (MMR). Most MMR methods adopt a pre-training and fine-tuning paradigm, in which a unified model is first trained on centralized, global data and subsequently adapted to specific markets. However, this approach ignores the privacy of market data. While traditional federated learning preserves privacy, it typically aims to obtain a global model by aggregating model parameters and does not account for significant market heterogeneity. Additionally, because ID spaces are disjoint across markets, embedding-based aggregation strategies become ineffective. To overcome these challenges, we propose a federated collaborative signal quantization (FedMM) method for multi-market click-through rate (CTR) prediction. Our core idea leverages a discrete codebook mechanism to achieve privacy-preserving transmission and align disjoint ID spaces. We further employ a hierarchical codebook structure to capture cross-market shared patterns and market-specific characteristics. Specifically, we deploy a residual quantized variational autoencoder (RQ-VAE) with a dual-layer codebook mechanism for each market to quantize collaborative embeddings. The first layer utilizes a global federated codebook, updated via aggregation to capture universally shared collaborative patterns, while the second layer maintains a local codebook to learn market-specific semantics. Finally, the learned discrete codes, which integrate both general and specific collaborative signals, are incorporated into downstream CTR models to enhance prediction accuracy across all markets. Extensive experiments on benchmark datasets demonstrate that FedMM significantly improves recommendation performance with privacy guarantees.
22. 【2605.11374】st-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
链接:https://arxiv.org/abs/2605.11374
作者:Han Xiao
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Test-time compute, widely believed, large reasoning models, Test-time, large reasoning
备注: 37 pages, 5 figures, 16 tables
点击查看摘要
Abstract:Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Most modern embedding checkpoints are distilled from large LLM backbones and inherit their representation space; a frozen embedding model should therefore benefit from extra inference compute without retraining. Using an agentic program-search loop, we explore 259 candidate inference programs over a frozen embedding API across ninety generations. The entire Pareto frontier collapses onto a single algebra: a softmax-weighted centroid of the local top-K documents interpolated with the query. This parameter-free default lifts nDCG@10 statistically significantly across seven embedding-model families spanning a tenfold parameter range, with held-out full-BEIR validation confirming the lift on every model tested.
23. 【2605.11348】Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence
链接:https://arxiv.org/abs/2605.11348
作者:Ujun Jeong,Saketh Vishnubhatla,Bohan Jiang,Andre Harrison,Adrienne Raglin,Huan Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
关键词:strengthen situational awareness, identifying factors linked, physical damage, infrastructure disruption, linked to casualties
备注: Submitted to EMNLP
点击查看摘要
Abstract:During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.
24. 【2605.11336】Much of Geospatial Web Search Is Beyond Traditional GIS
链接:https://arxiv.org/abs/2605.11336
作者:Ilya Ilyankou,Stefano Cavazzi,James Haworth
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:labelling schemes suggest, remains poorly characterised, geospatial web search, existing labelling schemes, Web search queries
备注:
点击查看摘要
Abstract:Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope traditional GIS systems and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.
25. 【2605.11334】VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
链接:https://arxiv.org/abs/2605.11334
作者:Jasmine Qi,Danylo Dantsev,Muyang Sun
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:practitioners lack reliable, lack reliable methods, widely deployed, deployed for automated, practitioners lack
备注: 16 pages, 6 figures
点击查看摘要
Abstract:LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.
Comments:
16 pages, 6 figures
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2605.11334 [cs.LG]
(or
arXiv:2605.11334v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.11334
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
26. 【2605.11325】Beyond Similarity Search: Tenure and the Case for Structured Belief State in LLM Memory
链接:https://arxiv.org/abs/2605.11325
作者:Jeffrey Flynt
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:vocabulary, Stateless LLM sessions, retrieval, similarity, cs.IR
备注:
点击查看摘要
Abstract:Why do we need another AI to help the AI? We argue you don't. Stateless LLM sessions impose re-orientation costs on iterative, session-heavy workflows. Prior work addresses cross-session memory through retrieval-augmented approaches: store history, embed it, retrieve by semantic similarity. Cross-session memory is a state management problem, not a search problem. Similarity search fails for named entity resolution within bounded vocabulary contexts because beliefs about a shared technical domain are semantically proximate by construction. A single user is the simplest bounded vocabulary context; engineering teams converge on the same property through shared codebases and terminology. We present Tenure, a local-first proxy that maintains a typed belief store with epistemic status, versioned supersession, and scope isolation, injecting curated context into every LLM session through precision-first retrieval. Hard scope isolation provides a structural guarantee: the right beliefs surface, and only within the boundaries the user has authorized. Tenure's typed schema converts extracted facts into imperative instructions via a why it matters field, making injected beliefs directly actionable rather than raw material for the model to re-derive. A controlled evaluation on 72 retrieval cases demonstrates the gap. Cosine similarity over dense embeddings achieves mean precision of 0.12. Alias-weighted BM25 maintains mean precision of 1.0, passing 72/72 cases versus 8/72 for cosine similarity on the same corpus. Hybrid retrieval typically solves vocabulary mismatch between disparate authors; Tenure eliminates this structurally: query and belief authors are the same person, and an alias enrichment flywheel continuously indexes their specific vocabulary. Under multi-turn topic drift this worsens: the vector backend produces drift scores of 0.43--0.50 on noise-critical turns where BM25 maintains 0.
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.11325 [cs.IR]
(or
arXiv:2605.11325v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2605.11325
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
27. 【2605.11272】Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank
链接:https://arxiv.org/abs/2605.11272
作者:Suryaa Veerabathiran Seran,Ashwin Naresh Kumar,Tracy Holloway King,Jing Zheng
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Adobe Express, Express is expanding, expanding internationally, interaction volume, disproportionately large content
备注:
点击查看摘要
Abstract:Adobe Express is expanding internationally, but the US has a disproportionately large content supply and interaction volume. Learning-to-rank (LTR) models trained primarily on behavioral feedback inherit this imbalance: templates popular in US are over-served in non-US locales. This cross-locale exposure bias suppresses local content discoverability and degrades ranking quality in growth locales. We show that click-only training suppresses semantically informative localization features. Adding vision-language model (VLM) graded relevance labels as auxiliary supervision alongside clicks improves semantic alignment but does not preserve local content visibility. We propose a multi-objective framework combining behavioral supervision, VLM-derived relevance signals, and locale-aware boosting. Across five locales, the resulting model improves relevance while restoring stable localization, demonstrating the importance of disentangling exposure from semantic supervision.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2605.11272 [cs.LG]
(or
arXiv:2605.11272v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.11272
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
28. 【2605.11254】MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval
链接:https://arxiv.org/abs/2605.11254
作者:Mehmet Deniz Türkmen,Suchana Datta,Dwaipayan Roy,Daniel Hienert,Philipp Mayr,Derek Greene
类目:Information Retrieval (cs.IR)
关键词:increasingly expect modern, diverse data sources, seamlessly retrieves information, Users increasingly expect, expect modern search
备注: Accepted to SIGIR 2026. Resource Paper. 8 pages, 2 figures. DOI: [https://doi.org/10.1145/3805712.3808614](https://doi.org/10.1145/3805712.3808614)
点击查看摘要
Abstract:Users increasingly expect modern search systems to offer a unified interface that seamlessly retrieves information from diverse data sources and formats. However, current information retrieval (IR) evaluation benchmarks have not kept pace with this development, primarily due to the lack of test collections that represent the diversity of contemporary search domains. We address this critical gap with MIRA, a novel benchmark based on a large-scale social science search platform. MIRA is designed for category-aware ranking across heterogeneous categories - Publications, Research Data, Variables, and Instruments Tools - within a single, unified evaluation framework. The proposed collection is distinctive in several ways: (1) it is built upon real user queries, providing a more realistic basis for evaluation; (2) it covers scholarly items from four distinct categories, enabling multi-faceted evaluation; and (3) it leverages a Large Language Model to generate topic descriptions and narratives, as well as for relevance assessment with respect to these topics, substantially reducing the labor and cost of test collection generation. We release this resource to benefit the community by providing a foundational testbed for the research on multi-faceted, category-aware, integrated, or cross-category information retrieval.
29. 【2605.11145】Debiasing Message Passing to Mitigate Popularity Bias in GNN-based Collaborative Filtering
链接:https://arxiv.org/abs/2605.11145
作者:Md Aminul Islam,Ahmed Sayeed Faruk,Sourav Medya,Elena Zheleva
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:graph neural networks, achieve strong performance, Collaborative filtering, propagating user-item signals, neural networks
备注:
点击查看摘要
Abstract:Collaborative filtering (CF) models based on graph neural networks (GNNs) achieve strong performance in recommender systems by propagating user-item signals over interaction graphs. However, they are highly susceptible to popularity bias, since skewed interaction distributions and repeated message passing across high-order neighborhoods amplify the influence of popular items while suppressing long-tail ones. Existing debiasing approaches, including re-weighting objectives, regularization, causal methods, and post-processing, are less effective in GNN-based settings because they do not directly counteract bias propagated through the aggregation process, and recent in-aggregation weighting methods often rely on static heuristics or unstable embedding estimates. We propose Debiasing Popularity Amplification in Aggregation (DPAA), a popularity debiasing framework for GNN-based CF that integrates adaptive, embedding-aware interaction weighting and layer-wise weighting directly into message passing. DPAA assigns interaction-level weights from a representation-aware popularity signal, stabilized by a smooth transition from pre-trained to evolving model embeddings during training. It further introduces a layer-wise weighting that amplifies higher-order neighborhoods, surfacing long-range interactions with diverse and underexposed items. Experiments on real-world and semi-synthetic datasets show that DPAA outperforms state-of-the-art popularity-bias correction methods for GNN-based CF.
30. 【2605.11143】ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
链接:https://arxiv.org/abs/2605.11143
作者:Alex Stinard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:measure clinical performance, Reasoning benchmarks measure, benchmarks measure clinical, percentage points, clean inputs
备注: 46 pages including appendices (two-column preprint format). Under review at JAMIA. Code, frozen evaluator, and benchmark released at [this https URL](https://huggingface.co/datasets/alexstinard/epikg-clinicalbench) . ClinicalBench v2 is a 400-question MIMIC-IV stress test for assertion-aware retrieval
点击查看摘要
Abstract:Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent. Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size. Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.
31. 【2605.11118】A Cascaded Generative Approach for e-Commerce Recommendations
链接:https://arxiv.org/abs/2605.11118
作者:Moein Hasani,Hamidreza Shahidi,Trace Levinson,Yuan Zhong,Guanghua Shu,Vinesh Gudla,Tejaswi Tenneti
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:large e-commerce marketplaces, fetch eligible products, Personalized storefronts, independent components, large e-commerce
备注:
点击查看摘要
Abstract:Personalized storefronts in large e-commerce marketplaces are often assembled from many independent components: static themes per page section ("placement"), retrieval systems to fetch eligible products per placement, and pointwise rankers to order content. While effective in optimizing for aggregate preferences, this paradigm is rigid and can limit personalization and semantic cohesion across the page. This makes it poorly suited to support dynamic objectives and merchandising requirements over time. To address this, we introduce a cascaded merchandising framework that decomposes storefront construction into two generative tasks: (i) placement-level theme generation and (ii) constrained keyword generation per placement to power product retrieval. Teacher-student fine-tuning is leveraged to improve scalability of this framework under production latency and cost constraints. Fine-tuned model ablations are shown to approach closed-weight LLM performance. We further contribute frameworks for AI-driven content evaluation and quality filtering, enabling safe and automated deployment of dynamic content at scale. Generative output is fused with traditional ranking models to preserve hybrid infrastructure. In online experiments, this framework yields an estimated +2.7% lift in cart adds per page view over a strong baseline.
32. 【2605.11017】Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
链接:https://arxiv.org/abs/2605.11017
作者:Chao Zhou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:fitting parametric functions, Behavioral curve modeling, fitting parametric, practice in recommendation, clinical dosing
备注: Submitted to NeurIPS 2026
点击查看摘要
Abstract:Behavioral curve modeling -- fitting parametric functions to engagement-versus-exposure data -- is standard practice in recommendation, advertising, and clinical dosing. We show that aggregation introduces a systematic distortion: Simpson's paradox in behavioral curves. On Goodreads (3.3M users, 9 genres), individual users peak at n* approximately 11 exposures while the aggregate peaks at n* approximately 34 -- a 3x gap driven by survival bias. Amazon Electronics (18M reviews) shows a 5.3x distortion. MovieLens-25M (D approximately 1) serves as a negative control, confirming that survival bias -- not aggregation per se -- is the operative mechanism. The distortion is robust to category granularity, engagement operationalization, and classifier calibration. We develop Synthetic Null Calibration to address a 32% false positive rate in per-user classification. Our findings apply wherever individual behavioral parameters are estimated from aggregate curves under differential attrition.
33. 【2605.10950】Continuous Flood Nowcasting in South Asia: A Multi-Sensor Ensemble Remote Sensing Framework for Flood Extent
链接:https://arxiv.org/abs/2605.10950
作者:Usman Nazir,Disha Gomathinayagam,Muhammad Kamran,Sara Khalid
类目:Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
关键词:June and December, unusually severe flood, severe flood season, season between June, Google Earth Engine
备注: Visualising Climate 2026
点击查看摘要
Abstract:Pakistan experienced an unusually severe flood season between June and December 2025, with cascading impacts on population, infrastructure, and agriculture. Existing operational flood products (e.g., UNOSAT) provide valuable episode-level snapshots but rarely deliver spatially and temporally continuous inundation maps at near-real-time latency within the country. We present a multi-sensor, ensemble-based remote-sensing framework for continuous flood nowcasting in Pakistan that integrates Sentinel-1 SAR, Harmonized Landsat-Sentinel (HLS L30 and S30), MODIS, and VIIRS observations on a harmonized grid in Google Earth Engine. The framework employs a tiered nowcasting ensemble that prioritizes higher-resolution sensors (Sentinel-1 and HLS) and falls back to MODIS and VIIRS when necessary, preserving daily continuity of flood extent at each sensor's native resolution. Applied to the 2025 monsoon period, the system generates near-real-time, spatially consistent inundation maps across Pakistan. As a nowcasting case study, we track the super-flood of 26 August-7 September 2025 day by day, demonstrating the framework's ability to capture the evolving flood footprint in near real time and extend beyond the temporal limits of episodic mapping products. Validation against GloFAS discharge anomalies and precipitation datasets (CHIRPS v3.0, MSWEP) shows strong agreement with observed hydrometeorological conditions. By integrating nowcast outputs with exposure layers (WorldPop, ESA WorldCover, Giga-HOTOSM), the framework enables rapid estimation of affected populations, cropland, and critical infrastructure, supporting timely disaster response and resilience planning in South Asia.
计算机视觉
1. 【2605.12501】Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
链接:https://arxiv.org/abs/2605.12501
作者:Miaosen Zhang,Xiaohan Zhao,Zhihong Tan,Zhou Huoshen,Yijia Fan,Yifan Yang,Kai Qiu,Bei Liu,Justin Wagle,Chenzhong Yin,Mingxi Cheng,Ji Li,Qi Dai,Chong Luo,Xu Yang,Xin Geng,Baining Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:automate on-screen work, Computer-use agents, automate on-screen, on-screen work, Computer-use
备注:
点击查看摘要
Abstract:Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at this https URL
2. 【2605.12500】SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
链接:https://arxiv.org/abs/2605.12500
作者:Haiwen Diao,Penghao Wu,Hanming Deng,Jiahao Wang,Shihao Bai,Silei Wu,Weichen Fan,Wenjie Ye,Wenwen Tong,Xiangyu Fan,Yan Li,Yubo Wang,Zhijie Cao,Zhiqian Lin,Zhitao Yang,Zhongang Cai,Yuwei Niu,Yue Zhu,Bo Liu,Chengguang Lv,Haojia Yu,Haozhe Xie,Hongli Wang,Jianan Fan,Jiaqi Li,Jiefan Lu,Jingcheng Ni,Junxiang Xu,Kaihuan Liang,Lianqiang Shi,Linjun Dai,Linyan Wang,Oscar Qian,Peng Gao,Pengfei Liu,Qingping Sun,Rui Shen,Ruisi Wang,Shengnan Ma,Shuang Yang,Siyi Xie,Siying Li,Tianbo Zhong,Xiangli Kong,Xuanke Shi,Yang Gao,Yongqiang Yao,Yves Wang,Zhengqi Bai,Zhengyu Lin,Zixin Yin,Wenxiu Sun,Ruihao Gong,Quan Wang,Lewei Lu,Lei Yang,Ziwei Liu,Dahua Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remain fundamentally constrained, misaligned representation spaces, Recent large vision-language, Recent large, cascaded pipelines
备注: Project page: [this https URL](https://github.com/OpenSenseNova/SenseNova-U1)
点击查看摘要
Abstract:Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
3. 【2605.12498】EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
链接:https://arxiv.org/abs/2605.12498
作者:Christen Millerdurai,Shaoxiang Wang,Yaxu Xie,Vladislav Golyanik,Didier Stricker,Alain Pagani
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:hand-centric manipulation tasks, practical egocentric interaction, manipulation tasks, compact and unobtrusive, crucial for practical
备注: 23 pages, 19 figures and 10 tables; project page: [this https URL](https://dfki-av.github.io/EgoForce) (source code, data and demo available); SIGGRAPH 2026 Conference
点击查看摘要
Abstract:Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at this https URL.
4. 【2605.12497】From Web to Pixels: Bringing Agentic Search into Visual Perception
链接:https://arxiv.org/abs/2605.12497
作者:Bokang Yang,Xinyi Sun,Kaituo Feng,Xingping Dong,Dongming Wu,Xiangyu Yue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:frozen model knowledge, connects high-level semantic, high-level semantic understanding, existing settings assume, perception connects high-level
备注: Project page: [this https URL](https://pixel-searcher.github.io/)
点击查看摘要
Abstract:Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.
5. 【2605.12496】CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
链接:https://arxiv.org/abs/2605.12496
作者:Yihao Meng,Zichen Liu,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Yue Yu,Hanlin Wang,Haobo Li,Jiapeng Zhu,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen,Huamin Qu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:open-ended synthesis, Autoregressive, Abstract, video generation aims, generation
备注: Project page: [this https URL](https://yihao-meng.github.io/CausalCine/)
点击查看摘要
Abstract:Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at this https URL
6. 【2605.12495】AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
链接:https://arxiv.org/abs/2605.12495
作者:Runhui Huang,Jie Wu,Rui Yang,Zhe Liu,Hengshuang Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Relative Policy Optimization, Group Relative Policy, applies Group Relative, Policy Optimization, AR-Diffusion Unified Multimodal
备注: ICML2026
点击查看摘要
Abstract:In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: this https URL
7. 【2605.12494】Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
链接:https://arxiv.org/abs/2605.12494
作者:Jiahe Li,Jiawei Zhang,Xiao Bai,Jin Zheng,Xiaohan Yu,Lin Gu,Gim Hee Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved impressive performance, bottlenecked existing approaches, Gaussian Splatting, recent years, strictly bottlenecked existing
备注: Accepted at ICML 2026. Project page: [this https URL](https://fictionarry.github.io/AmbiSuR-Proj/)
点击查看摘要
Abstract:Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguity-robust surface 3D reconstruction with high performance. Starting by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior surface reconstructions compared to existing methods across various challenging scenarios, excelling in broad compatibility. Project: this https URL .
8. 【2605.12491】Elastic Attention Cores for Scalable Vision Transformers
链接:https://arxiv.org/abs/2605.12491
作者:Alan Z. Song,Yinjie Chen,Mu Nan,Rui Zhang,Jiahang Cao,Weijian Mai,Muquan Yu,Hossein Adeli,Deva Ramanan,Michael J. Tarr,Andrew F. Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:strong data-driven scaling, achieve strong data-driven, strong data-driven, Vision Transformers, VECA
备注: Project repository here: [this https URL](https://github.com/alansong1322/VECA)
点击查看摘要
Abstract:Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.
9. 【2605.12480】OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
链接:https://arxiv.org/abs/2605.12480
作者:Guohui Zhang,XiaoXiao Ma,Jie Huang,Hang Xu,Hu Yu,Siming Fu,Yuming Li,Zeyue Xue,Lin Song,Haoyang Huang,Nan Duan,Feng Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:strong per-modality fidelity, real-world applications demand, applications demand strong, demand strong per-modality, Recent advances
备注: Project page: [this https URL](https://zghhui.github.io/OmniNFT/)
点击查看摘要
Abstract:Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.
10. 【2605.12451】FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation
链接:https://arxiv.org/abs/2605.12451
作者:Nicholas Ikechukwu,Keanu Nichols,Deepti Ghadiyaram,Bryan A. Plummer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Continual Panoptic Segmentation, Continual Panoptic, Panoptic Segmentation, quickly adapt, Continual
备注:
点击查看摘要
Abstract:Continual Panoptic Segmentation (CPS) requires methods that can quickly adapt to new categories over time. The nature of this dense prediction task means that training images may contain a mix of labeled and unlabeled objects. As nothing is known about these unlabeled objects a priori, existing methods often simply group any unlabeled pixel into a single "background" class during training. In effect, during training, they repeatedly tell the model that all the different background categories are the same (even when they aren't). This makes learning to identify different background categories as they are added challenging since these new categories may require using information the model was previously told was unimportant and ignored. Thus, we propose a Future-Targeted Contrastive and Repulsive (FuTCR) framework that addresses this limitation by restructuring representations before new classes are introduced. FuTCR first discovers confident future-like regions by grouping model-predicted masks whose pixels are consistently classified as background but exhibit non-background logits. Next, FuTCR applies pixel-to-region contrast to build coherent prototypes from these unlabeled regions, while simultaneously repelling background features away from known-class prototypes to explicitly reserve representational space for future categories. Experiments across six CPS settings and a range of dataset sizes show FuTCR improves relative new-class panoptic quality over the state-of-the-art by up to 28%, while preserving or improving base-class performance with gains up to 4%.
11. 【2605.12449】LychSim: A Controllable and Interactive Simulation Framework for Vision Research
链接:https://arxiv.org/abs/2605.12449
作者:Wufei Ma,Chloe Wang,Siyi Chen,Jiawei Peng,Patrick Li,Alan Yuille
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduced vision systems', vision systems' reliance, optimization and rigorous, self-supervised pretraining, pretraining has reduced
备注: 3D-LLM/VLA Workshop at CVPR 2026. Project page: [this https URL](https://lychsim.github.io/)
点击查看摘要
Abstract:While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.
12. 【2605.12437】3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark
链接:https://arxiv.org/abs/2605.12437
作者:Yunxiao Zhang,Suryansh Kumar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:view synthesis, dynamic, fundamental to applications, NVS, Gaussian Splatting
备注: Accepted for publication at CVPR 2026; 4D World Models Workshop. Draft info: 14 pages, 4 figures, 8 tables
点击查看摘要
Abstract:Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi-view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi-body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM-derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high-quality, synchronized camera rigs and exports training-ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset-generation framework enables reproducible and principled benchmarking of dynamic NVS methods.
13. 【2605.12431】GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
链接:https://arxiv.org/abs/2605.12431
作者:Huiran Duan,Qian Zhou,Zhongliang Guo,Junhao Dong,Yuqi Li,Guoying Zhao,Yingli Tian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:introduce spatiotemporal distortions, Conventional gait de-identification, provide insufficient identity, insufficient identity suppression, structure-sensitive downstream applications
备注: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)
点击查看摘要
Abstract:Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.
14. 【2605.12430】AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
链接:https://arxiv.org/abs/2605.12430
作者:Joaquín Figueira,Rob Van Gastel,Giacomo D'Amicantonio,Zhuoran Liu,Ioan Gabriel Bucur,Faysal Boughorbel,Egor Bondarev
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:automated optical inspection, wire-bonded semiconductors, models in automated, automated optical, typically device-specific
备注: Accepted to the AI4RWC Workshop at CVPR 2026
点击查看摘要
Abstract:Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.
15. 【2605.12413】Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
链接:https://arxiv.org/abs/2605.12413
作者:Yuangong Chen(1),Wai Keung Wong(1),Jiaxing Li(2),Ioannis Patras(3),Xu Zheng(3 and 4) ((1) The Hong Kong Polytechnic University, (2) Guangzhou University, (3) Queen Mary University of London, (4) HKUST (Guangzhou))
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, show strong visual
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.
16. 【2605.12399】GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
链接:https://arxiv.org/abs/2605.12399
作者:Xiao Cao,Yuze Li,Youmin Zhang,Jiayu Song,Cheng Yan,Wen Li,Lixin Duan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, prominent paradigm, Gaussian, Splatting, Cross-view Attention
备注: Accept to SIGGRAPH 2026 Conference Track
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a prominent paradigm for 3D reconstruction and novel view synthesis. However, it remains vulnerable to severe artifacts when trained under sparse-view constraints. While recent methods attempt to rectify artifacts in rendered views using image diffusion models, they typically rely on multi-view self-attention to retrieve information from reference images. We observe that this mechanism often fails when the rendered novel views output by 3DGS are heavily corrupted: damaged query features lead to erroneous cross-view retrieval, resulting in inconsistent rendering refinement. To address this, we propose GeoQuery, a geometry-guided diffusion framework that integrates generative priors with explicit geometric cues via a novel Geometry-guided Cross-view Attention (GCA) mechanism. First, by leveraging predicted depth maps and camera poses, we construct a geometry-induced correspondence field to sample reference features, forming a geometry-aligned proxy query that replaces the corrupted rendering features. Furthermore, we design a new cross-view feature aggregation pipeline, in which we restrict the cross-view attention to a local window around each proxy query to effectively retrieve useful features while suppressing spurious matches. GeoQuery can be seamlessly integrated into existing diffusion-based pipelines, enabling robust reconstruction even under extreme view sparsity. Extensive experiments on sparse-view novel view synthesis and rendering artifact removal demonstrate the effectiveness of our approach.
17. 【2605.12389】SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation
链接:https://arxiv.org/abs/2605.12389
作者:Luke James Miller,Yugyung Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:extreme class imbalance, coupling computational cost, Segmenting small, attenuating boundary evidence, boundary evidence precisely
备注: 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices
点击查看摘要
Abstract:Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.
18. 【2605.12377】Fast Image Super-Resolution via Consistency Rectified Flow
链接:https://arxiv.org/abs/2605.12377
作者:Jiaqi Xu,Wenbo Li,Haoze Sun,Fan Li,Zhixin Wang,Long Peng,Jingjing Ren,Haoran Yang,Xiaowei Hu,Renjing Pei,Pheng-Ann Heng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable success, time-consuming multi-step sampling, multi-step sampling largely, sampling largely hinders, real-world image super-resolution
备注: Accepted by ICCV 2025
点击查看摘要
Abstract:Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality.
19. 【2605.12374】Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
链接:https://arxiv.org/abs/2605.12374
作者:Yanting Miao,Yutao Sun,Dexin Wang,Mengyu Zhou,Pascal Poupart,Lei Lv,Qi Zhao,Li Wang,Hao Li,Xiaoxi Jiang,Guanjun Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:avoiding external tools, multimodal large language, create intermediate visual, large language model, create intermediate
备注:
点击查看摘要
Abstract:Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.
20. 【2605.12325】VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
链接:https://arxiv.org/abs/2605.12325
作者:Hao Zhu,Shuo Jin,Wenbin Liao,Jiayu Xiao,Yan Zhu,Siyue Yu,Feng Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pursuing training-free open-vocabulary, Pursuing training-free, bias in CLIP, deep-seated spatial bias, training-free open-vocabulary semantic
备注: Accepted by ICML2026
点击查看摘要
Abstract:Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware this http URL framework to facilitate more efficient and high-quality dense prediction. While this http URL exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in this http URL, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{this https URL}{Our code is publicly available at GitHub \faGithub}.
21. 【2605.12320】Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos
链接:https://arxiv.org/abs/2605.12320
作者:Luca Parolari,Pietro Gori,Lamberto Ballan,Carlo Biffi,Loic Le Folgoc
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:AI-assisted colonoscopy applications, Learning robust representations, enabling multiple AI-assisted, key to enabling, characterization to automated
备注: Accepted to MICCAI 2026
点击查看摘要
Abstract:Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at this https URL.
22. 【2605.12309】G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
链接:https://arxiv.org/abs/2605.12309
作者:Junxian Li,Kai Liu,Zizhong Ding,Zhixin Wang,Zhikai Chen,Renjing Pei,Yulun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:separate-encoder Unified multimodal, Unified multimodal models, Unified multimodal, separate-encoder Unified, rapidly growing inference
备注: Code is at: [this https URL](https://github.com/lijunxian111/G2TR)
点击查看摘要
Abstract:The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.
23. 【2605.12306】KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
链接:https://arxiv.org/abs/2605.12306
作者:Minjong Cheon
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:apply uniform penalties, existing regularization methods, Catastrophic forgetting remains, continual learning framework, continual learning
备注:
点击查看摘要
Abstract:Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.
24. 【2605.12305】Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
链接:https://arxiv.org/abs/2605.12305
作者:Yabo Zhang,Kunchang Li,Dewei Zhou,Xinyu Huang,Xun Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing methods struggle, recent advancements, struggle to maintain, texttt, enabled image generation
备注:
点击查看摘要
Abstract:While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.
25. 【2605.12303】From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review
链接:https://arxiv.org/abs/2605.12303
作者:Moussa Kassem Sbeyti,Joshua Holstein,Philipp Spitzer,Nadja Klein,Gerhard Satzger
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:High-quality labeled data, scale remains expensive, training robust machine, robust machine learning, High-quality labeled
备注:
点击查看摘要
Abstract:High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at this https URL.
26. 【2605.12297】EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
链接:https://arxiv.org/abs/2605.12297
作者:Luming Wang,Hao Shi,Jiajun Zhai,Kailun Yang,Kaiwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:virtual reality, human-computer interaction, immersive augmented, essential for immersive, gesture recognition
备注: Extended version of SMC 2025 paper [arXiv:2503.12419](https://arxiv.org/abs/2503.12419) . The established dataset and source code will be publicly released at [this https URL](https://github.com/ZJUWang01/EgoEV-HandPose)
点击查看摘要
Abstract:Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird's-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at this https URL.
27. 【2605.12282】Large-Small Model Collaboration for Farmland Semantic Change Detection
链接:https://arxiv.org/abs/2605.12282
作者:Xinjia Li,Rui Wang,Qiurong Peng,Lingfei Ye,Dengrong Zhang,Haoyu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:farmland conversion monitoring, cultivated land protection, models remain insufficient, fine-grained farmland conversion, Change Detection
备注:
点击查看摘要
Abstract:Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated "from-to" annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at this https URL.
28. 【2605.12271】Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
链接:https://arxiv.org/abs/2605.12271
作者:Yaofang Liu,Kangning Cui,Meng Chu,Zhaoqing Li,Suiyun Zhang,Jean-Michel Morel,Xiaodong Cun,Haoxuan Che,Rui Liu,Raymond H. Chan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:typography sheets, annotated scenes, visual artifacts, Humans, visual
备注: Project Page: [this https URL](https://yaofang-liu.github.io/V2V_Web)
点击查看摘要
Abstract:Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.
Comments:
Project Page: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.12271 [cs.CV]
(or
arXiv:2605.12271v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.12271
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2605.12266】CAD-feature enhanced machine learning for manufacturing effort estimation on sheet metal bending parts
链接:https://arxiv.org/abs/2605.12266
作者:Matteo Ballegeer,Toon Van Camp,Willem Jaspers,Alp Bayar,Aung Nyein Soe,Martin Roelfs,Dries F. Benoit,Bieke Decraemer,Joost R. Duflou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:represented as Boundary, Boundary Representations, CAD models represented, topological connectivity, geometry and topological
备注:
点击查看摘要
Abstract:Graph-based machine learning has emerged as a promising approach for manufacturability analysis by learning directly from CAD models represented as Boundary Representations (B-reps), exploiting both surface geometry and topological connectivity. However, purely geometric representations often lack the process-specific semantics required for accurate manufacturability prediction: many manufacturing factors, such as surface roles or bend intent, are not explicitly encoded in shape alone and are difficult for data-driven models to infer reliably. We propose a hybrid approach that addresses this challenge by enriching B-rep attributed adjacency graphs with manufacturing features recognized through a rule-based module. Applied to sheet metal bending, recognized features, such as bend characteristics, flange lengths, and surface roles are integrated as node attributes, concentrating the learning signal on process-relevant geometric patterns. Experiments on both a large-scale synthetic manufacturability benchmark and a real-world industrial dataset with measured bending times, one of the first such validations on genuine production data, demonstrate that combining domain knowledge with graph-based learning improves prediction accuracy across both tasks. The results demonstrate that hybrid modeling offers a feasible and effective path toward deployable tools for manufacturability assessment and effort estimation in industrial CAD environments.
30. 【2605.12259】From Image Hashing to Scene Change Detection
链接:https://arxiv.org/abs/2605.12259
作者:Anh-Kiet Duong,Marie-Claire Iatrides,Petra Gomez-Krämer,Jean-Michel Carozza
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inherently limited, scene change detection, detection, hashing, compact representations
备注: 18 pages; accepted to ICPR 2026
点击查看摘要
Abstract:Image hashing provides compact representations for efficient storage and retrieval but is inherently limited to global comparison and cannot reason about where changes occur. This limitation prevents hashing from being directly applicable to scene change detection, where spatial localization is essential. In this work, we revisit hashing from a scene change detection perspective and propose HashSCD, a patch-wise hashing framework that enables both efficient global change detection and localized change identification. HashSCD encodes spatially aligned patches into compact hash codes and aggregates them through an XOR-like operation, allowing change detection and localization to be performed directly in the Hamming space without repeated inference on previous images. The model is trained in an unsupervised manner using contrastive learning at both patch and global levels. Experiments demonstrate that HashSCD achieves competitive performance compared to state-of-the-art unsupervised hashing and scene change detection methods, while significantly reducing computational cost and storage requirements.
31. 【2605.12252】H3D-MarNet: Wavelet-Guided Dual-Path Learning for Metal Artifact Suppression and CT Modality Transformation for Radiotherapy Workflows
链接:https://arxiv.org/abs/2605.12252
作者:Mubashara Rehman,Niki Martinel,Michele Avanzo,Riccardo Spizzo,Christian Micheloni
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severely degrade image, compromising diagnostic accuracy, degrade image quality, computed tomography, severely degrade
备注: Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026
点击查看摘要
Abstract:Metal artifacts in computed tomography (CT) severely degrade image quality, compromising diagnostic accuracy and radiotherapy planning, especially in cancer patients with high-density implants. We propose H3D-MarNet, a two-stage framework for artifact-aware CT domain transformation from kilo-voltage CT (kVCT) to mega-voltage CT (MVCT). In the first stage, a wavelet-based preprocessing module suppresses metal-induced artifacts through frequency-aware denoising while preserving anatomical structures. In second stage, Domain-TransNet performs kVCT-to-MVCT domain transformation using a hybrid volumetric learning architecture. Domain-TransNet integrates a CNN-based encoder to capture fine-grained local anatomical details and a transformer-based encoder to model long-range volumetric dependencies. The complementary representations are fused through an attention-based feature fusion mechanism to ensure spatial and contextual coherence across slices. A multi-stage, attention-guided decoder, supported by deep supervision, progressively reconstructs artifact-suppressed MVCT volumes. Extensive experiments demonstrate that H3D-MarNet achieves 28.14 dB PSNR and 0.717 SSIM on artifact-affected slices from full dataset, indicating effective metal artifact suppression and anatomical preservation, highlighting its potential for reliable CT modality transformation in clinical radiotherapy workflows.
Comments:
Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.12252 [cs.CV]
(or
arXiv:2605.12252v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.12252
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
32. 【2605.12237】UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
链接:https://arxiv.org/abs/2605.12237
作者:Shuo Ni,Tong Wang,Jing Zhang,He Chen,Haonan Guo,Ning Zhang,Bo Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth observation imagery, large-scale scene context, severe scale mismatch, Earth observation, increasingly operate
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at this https URL.
33. 【2605.12220】riBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
链接:https://arxiv.org/abs/2605.12220
作者:Mohammad Khoshkdahan,Alexey Vinel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Safe autonomous agents, vulnerable road users, Safe autonomous, road users, autonomous agents
备注: Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
点击查看摘要
Abstract:Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.
34. 【2605.12218】Learning Ego-Centric BEV Representations from a Perspective-Privileged View: Cross-View Supervision for Online HD Map Construction
链接:https://arxiv.org/abs/2605.12218
作者:Daniel Lengerer,Mathias Pechinger,Klaus Bogenberger,Carsten Markgraf
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:online high-definition, derived from multi-camera, central interface, interface for online, BEV
备注:
点击查看摘要
Abstract:Bird's-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9\,mAP in the standard $60\times30\,\mathrm{m}$ region and +9.9\,mAP in the extended $100\times50\,\mathrm{m}$ setting, corresponding to a 44\% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.
35. 【2605.12198】Enhancing Domain Generalization in 3D Human Pose Estimation through Controllable Generative Augmentation
链接:https://arxiv.org/abs/2605.12198
作者:Xinhao Hu,Yiyi Zhang,Liqing Zhang,Jianfu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pedestrian motion, testing data distributions, domain gaps arising, causal nature, strongly influenced
备注:
点击查看摘要
Abstract:Pedestrian motion, due to its causal nature, is strongly influenced by domain gaps arising from discrepancies between training and testing data distributions. Focusing on 3D human pose estimation, this work presents a controllable human pose generation framework that synthesizes diverse video data by systematically varying poses, backgrounds, and camera viewpoints. This generative augmentation enriches training datasets, enhances model generalization, and alleviates the limitations of existing methods in handling domain discrepancies. By leveraging both indoor/real-world and outdoor/virtual datasets, we perform cross-domain data fusion and controllable video generation to construct enriched training data, tailored to realistic deployment settings. Extensive experiments show that the augmented datasets significantly improve model performance on unseen scenarios and datasets, validating the effectiveness of the proposed approach.
36. 【2605.12179】SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
链接:https://arxiv.org/abs/2605.12179
作者:Xin Cheng,Xihua Wang,Ying Ba,Yuyue Wang,Kaisi Guan,Yinbo Wang,Wenpu Li,Ruihua Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, Recent advancements, semantic correspondence, video-audio joint generation, advancements in video-audio
备注: Preprint. Under review
点击查看摘要
Abstract:Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in this https URL.
37. 【2605.12169】UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
链接:https://arxiv.org/abs/2605.12169
作者:Sihan Chen,Xiang Zhang,Yang Zhang,Tunc Aydin,Christopher Schroers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generative models, diffusion-based approaches, recent surge, surge of generative, view synthesis tasks
备注:
点击查看摘要
Abstract:With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.
38. 【2605.12167】From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
链接:https://arxiv.org/abs/2605.12167
作者:Yajie Li,Bozhou Zhang,Chun Gu,Zipei Ma,Jiahui Zhang,Jiankang Deng,Xiatian Zhu,Li Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:predicting long-horizon future, execution remains challenging, generation models offer, promising imagination mechanism, remains challenging
备注: ICML 2026
点击查看摘要
Abstract:Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.
39. 【2605.12163】Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
链接:https://arxiv.org/abs/2605.12163
作者:Chenfeng Wang,Wei He,Xuhan Zhu,Chunpeng Zhou,Qizhen Li,Song Yan,Yufei Zheng,Chengjun Yu,Fan Lu,Wei Zhai,Yang Cao,Pengfei Yu,Zheng-Jun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:thought consistently yield, Information Gain Collapse, chains of thought, thought consistently, consistently yield
备注: 17 pages, 6 figures
点击查看摘要
Abstract:In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
40. 【2605.12145】Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
链接:https://arxiv.org/abs/2605.12145
作者:Souptik Sen,Raneen Younis,Zahra Ahmadi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse sensory sources, current approaches struggle, balance cross-modal generalizability, sensory sources, seeks to integrate
备注:
点击查看摘要
Abstract:Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.
41. 【2605.12144】PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization
链接:https://arxiv.org/abs/2605.12144
作者:Yanan Zhou,Zhaoyan Qian,Yanli Li,Nan Yang,Zhongliang Guo,Dong Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Absolute Pose Regression, fine-tuning data quality, camera pose inference, enables real-time, Absolute Pose
备注:
点击查看摘要
Abstract:In visual localization, Absolute Pose Regression (APR) enables real-time 6-DoF camera pose inference from single images, yet critically depends on fine-tuning data quality and coverage. While recent methods leverage 3D Gaussian Splatting (3DGS) for novel view synthesis-based data augmentation, random sampling generates redundant views and noisy samples from poorly reconstructed regions. To mitigate this research gap, we propose PoseCompass, an intelligent pose selection pipeline for 3DGS-based APR. PoseCompass formulates synthetic pose selection and derives a value-based pose ranking mechanism to identify informative poses. The ranking integrates three dimensions: Localization Difficulty, favoring challenging regions; Coverage Novelty, exploring under-sampled areas; and Rendering Observability, filtering artifacts and noise. PoseCompass then generates trajectory-constrained candidates, selects the top-K ranked poses, and synthesizes views using 3DGS with lightweight diffusion-based alignment. Finally, the pose regressor is fine-tuned on mixed real and synthetic data. We evaluate PoseCompass on 7-Scenes, where it reduces adaptation time from 15.2 to 5.1 minutes, a 3x speedup, while cutting median pose errors by 53.8 percent and significantly outperforming random baselines.
42. 【2605.12140】EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion
链接:https://arxiv.org/abs/2605.12140
作者:Md Abulkalam Azad,Vegard Holmstrøm,John Nyberg,Lasse Lovstakken,Håvard Dalen,Bjørnar Grenne,Andreas Østvik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Myocardial point tracking, estimation in echocardiography, driven by advances, myocardial motion fundamentally, recently emerged
备注: Early accepted (top 9%) to MICCAI 2026
点击查看摘要
Abstract:Myocardial point tracking (MPT) has recently emerged as a promising direction for motion estimation in echocardiography, driven by advances in general-purpose point tracking methods. However, myocardial motion fundamentally differs from motion encountered in natural videos, as it arises from physiologically constrained deformation that is spatially and temporally continuous throughout the cardiac cycle. Consequently, motion trajectories typically remain locally confined despite substantial tissue deformation. Motivated by these properties, we revisit the architectural design for MPT and find that coarse initialization in commonly used two-stage coarse-to-fine architectures may be unnecessary in this domain. In this work, we propose a fine-stage-only architecture, \textbf{EchoTracker2}, which enriches pixel-precise features with local spatiotemporal context and integrates them with long-range joint temporal reasoning for robust tracking. Experimental results across in-distribution, out-of-distribution (OOD), and public synthetic datasets show that our model improves position accuracy by $6.5\%$ and reduces median trajectory error by $12.2\%$ relative to a domain-specific state-of-the-art (SOTA) model. Compared to the best general-purpose point tracking method, the improvements are $2.0\%$ and $5.3\%$, respectively. Moreover, EchoTracker2 shows better agreement with expert-derived global longitudinal strain (GLS) and enhances test-rest reproducibility. Source code will be available at: this https URL.
43. 【2605.12138】Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
链接:https://arxiv.org/abs/2605.12138
作者:Yexing Xu,Wei Feng,Shen Zhang,Haohan Wang,Yuxin Qin,Yaoyu Li,Ao Ma,Yuhao Luo,Lu Wang,Xudong Ren,Haoran Wang,Run Ling,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Longguang Wang,Yulan Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:challenge in e-commerce, realistic and user-preferred, key challenge, Generating realistic, Unified Advertisement Generative
备注: 22 pages, 19 figures, CVPR 2026
点击查看摘要
Abstract:Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at this https URL.
44. 【2605.12134】MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation
链接:https://arxiv.org/abs/2605.12134
作者:Sonali Godavarthy,Matthias Neuwirth-Trapp,Tim-Felix Faasch,Maarten Bieshaar,Michael Moeller,Danda Pani Paudel
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:models produce high-quality, text ambiguity hinders, ambiguity hinders precise, hinders precise control, produce high-quality images
备注: Accepted at ICPR 2026
点击查看摘要
Abstract:Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.
45. 【2605.12122】Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
链接:https://arxiv.org/abs/2605.12122
作者:Hyeonjin Kim,Hangyeol Jung,Heechan Yun,Sungjun Yun,Dong-Jun Han
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:undesirable content generation, preventing undesirable content, content generation, increasingly important, important for preventing
备注: 40 pages, 23 figures
点击查看摘要
Abstract:Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.
46. 【2605.12119】MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
链接:https://arxiv.org/abs/2605.12119
作者:Haofeng Liu,Yang Zhou,Ziheng Wang,Zhengbo Xu,Zhan Peng,Jie Ma,Jun Liang,Shengfeng He,Jing Li
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:offer visual fidelity, lack geometric correspondence, priors provide spatial, priors offer visual, provide spatial alignment
备注: Project page: [this https URL](https://orange-3dv-team.github.io/MoCam)
点击查看摘要
Abstract:Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion this http URL first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion this http URL demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.
47. 【2605.12112】When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
链接:https://arxiv.org/abs/2605.12112
作者:Xiaofeng Tan,Jun Liu,Bin-Bin Gao,Yuanting Fan,Xi Jiang,Chengjie Wang,Hongsong Wang,Feng Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:RLHF is widely, align flow-matching, human preferences, leads to severe, severe diversity collapse
备注:
点击查看摘要
Abstract:RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline's 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline's 0.047). Our project page (this https URL) is publicly available.
48. 【2605.12090】World Action Models: The Next Frontier in Embodied AI
链接:https://arxiv.org/abs/2605.12090
作者:Siyin Wang,Junhao Shi,Zhaoyang Fu,Xinzhe He,Feihong Liu,Chenchen Yang,Yikang Zhou,Zhaoye Fei,Jingjing Gong,Jinlan Fu,Mike Zheng Shou,Xuanjing Huang,Xipeng Qiu,Yu-Gang Jiang
类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved strong semantic, strong semantic generalization, World Action Models, physical world evolves, embodied policy learning
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
49. 【2605.12088】UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
链接:https://arxiv.org/abs/2605.12088
作者:Yiyan Xu,Qiulin Wang,Wenjie Wang,Yunyao Mao,Xintao Wang,Pengfei Wan,Kun Gai,Fuli Feng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:faithfully preserving subject, preserving subject identities, multiple reference images, aims to synthesize, faithfully preserving
备注:
点击查看摘要
Abstract:Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.
50. 【2605.12077】he Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
链接:https://arxiv.org/abs/2605.12077
作者:Ofir Itzhak Shahar,Gur Elkin,Ohad Ben-Shahar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:vision research community, computer vision research, increasingly popular task, research community, popular task
备注:
点击查看摘要
Abstract:Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.
51. 【2605.12074】BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding
链接:https://arxiv.org/abs/2605.12074
作者:Patrick Knab,Orgest Xhelili,Inis Buzi,Drago Andres Guggiana Nilo,Mohd Saquib Khan,Lorenz Kolb,Manuel Scherzer,Kerem Yildirir,Christian Bartelt,Philipp Johannes Schubert
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:general physical intelligence, central to general, primary modality, modality for capturing, capturing both state
备注:
点击查看摘要
Abstract:Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we derive zero-shot language-based tasks spanning phrase grounding, hand-object interaction recognition, referring, activity recognition, relation extraction, and temporal visual question answering. Experiments reveal strong variation across task families and no consistently dominant model family, positioning BARISTA as a challenging diagnostic benchmark for procedural video understanding. Code and dataset available at this https URL.
52. 【2605.12072】PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting
链接:https://arxiv.org/abs/2605.12072
作者:Hantang Li,Qiang Zhu,Xiandong Meng,Xingtao Wang,Debin Zhao,Xiaopeng Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:randomly suppressing Gaussian, suppressing Gaussian primitives, Gaussian Splatting, methods alleviate overfitting, sparse-view Gaussian splatting
备注: 11 pages,8 figures
点击查看摘要
Abstract:Dropout-based sparse-view 3D Gaussian Splatting (3DGS) methods alleviate overfitting by randomly suppressing Gaussian primitives during training. Existing methods mainly focus on designing increasingly sophisticated dropout strategies, while they overlook the resulting inconsistencies among different dropped Gaussian subsets. This oversight often leads to unstable reconstruction and suboptimal Gaussian representation this http URL this paper, we revisit dropout-based sparse-view 3DGS from a consistency regularization perspective and propose PairDropGS, a Paired Dropout-induced Consistency Regularization framework for sparse-view Gaussian splatting. Specifically, PairDropGS first constructs a pair of the dropped Gaussian subsets from a shared Gaussian field and designs a low-frequency consistency regularization to constrain their low-frequency rendered structures. This design encourages the shared Gaussian field to preserve stable scene layout and coarse geometry under different random dropouts, while avoiding excessive constraints on ambiguous high-frequency details. Moreover, we introduce a progressive consistency scheduling strategy to gradually strengthen the consistency regularization during training for stability and robustness of reconstruction. Extensive experiments on widely-used sparse-view benchmarks demonstrate that PairDropGS achieves superior training stability, significantly outperforms existing dropout-based 3DGS methods in reconstruction quality, while exhibiting the simplicity and plug-and-play nature for improving dropout-based optimization.
53. 【2605.12069】Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection
链接:https://arxiv.org/abs/2605.12069
作者:Muhammad Aqeel,Maham Nazir,Uzair Khan,Marco Cristani,Francesco Setti
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Zero-shot anomaly detection, anomaly detection aims, Zero-shot anomaly, anomaly detection, detection aims
备注: Accepted to ICIP 2026
点击查看摘要
Abstract:Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. this https URL
54. 【2605.12064】AR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images
链接:https://arxiv.org/abs/2605.12064
作者:Zhuoyu Cai,Dou Quan,Ning Huyan,Pei He,Shuang Wang,Licheng Jiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing deep learning-based, synthetic aperture radar, Existing deep, capture shared features, deep learning-based methods
备注:
点击查看摘要
Abstract:Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.
55. 【2605.12038】OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
链接:https://arxiv.org/abs/2605.12038
作者:Yiren Song,Xiyao Deng,Pei Yang,Yihan Wang,Mike Zheng Shou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video generation aims, embodied intelligence, generation aims, Cross-embodiment video generation, scalable data generation
备注:
点击查看摘要
Abstract:Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.
56. 【2605.12034】Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
链接:https://arxiv.org/abs/2605.12034
作者:Che Liu,Lichao Ma,Xiangyu Tony Zhang,Yuxin Zhang,Haoyang Zhang,Xuerui Yang,Fei Tian
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:jointly understand audio, understand audio, answer a query, intended to jointly, jointly understand
备注:
点击查看摘要
Abstract:Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.
57. 【2605.12031】Resilient Vision-Tabular Multimodal Learning under Modality Missingness
链接:https://arxiv.org/abs/2605.12031
作者:Camillo Maria Caruso,Valerio Guarrasi,Paolo Soda
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong potential, integrating heterogeneous data, heterogeneous data sources, structured clinical variables, medical applications
备注:
点击查看摘要
Abstract:Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.
58. 【2605.12027】4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation
链接:https://arxiv.org/abs/2605.12027
作者:Ying Zang,Xuanyi Liu,Yidong Han,Deyi Ji,Chaotao Ding,Yuanqi Hu,Qi Zhu,Xuanfu Li,Jin Ma,Lingyun Sun,Tianrun Chen,Lanyun Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging task, Reconstructing dynamic, monocular videos, Reconstructing, Topological Subspace Surgery
备注:
点击查看摘要
Abstract:Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.
59. 【2605.12026】Spectral Vision Transformer for Efficient Tokenization with Limited Data
链接:https://arxiv.org/abs/2605.12026
作者:Alexandra G. Roberts,Maneesh John,Jinwei Zhang,Dominick Romano,Mert Sisman,Ki Sueng Choi,Heejong Kim,Mert R. Sabuncu,Thanh D. Nguyen,Alexey V. Dimov,Pascal Spincemaille,Brian H. Kopell,Yi Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
关键词:medical imaging, architecture for efficient, efficient tokenization, tokenization in limited, emphasis on medical
备注:
点击查看摘要
Abstract:We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+this http URL.
60. 【2605.12021】What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
链接:https://arxiv.org/abs/2605.12021
作者:Ryota Yoshihashi,Masahiro Kada,Satoshi Ikehata,Rei Kawakami,Ikuro Sato
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:understanding tasks involve, tasks involve identifying, image understanding tasks, involve identifying, image understanding
备注:
点击查看摘要
Abstract:Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.
61. 【2605.12017】FAME: Feature Activation Map Explanation on Image Classification and Face Recognition
链接:https://arxiv.org/abs/2605.12017
作者:Xinyi Zhang,Manuel Günther
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized machine learning, reaching unprecedented levels, Deep Learning, machine learning, reaching unprecedented
备注: Accepted for CVPR Workshop 2026
点击查看摘要
Abstract:Deep Learning has revolutionized machine learning, reaching unprecedented levels of accuracy, but at the cost of reduced interpretability. Especially in image processing systems, deep networks transform local pixel information into more global concepts in a highly obscured manner. Explainable AI methods for image processing try to shed light on this issue by highlighting the regions of the image that are important for the prediction task. Among these, Class Activation Mapping (CAM) and its gradient-based variants compute attributions based on the feature map and upscale them to the image resolution, assuming that feature map locations are influenced only by underlying regions. Perturbation-based methods, such as CorrRISE, on the other hand, try to provide pixel-level attributions by perturbing the input with fixed patches and checking how the output of the network changes. In this work, we propose Feature Activation Map Explanation (FAME), which combines both worlds by using network gradients to compute changes to the input image, manipulating it in a gradient-driven way rather than using fixed patches. We apply this technique on two common tasks, image classification and face recognition, and show that CAM's above-mentioned assumption does not hold for deeper networks. We qualitatively and quantitively show that FAME produces attribution maps that are competitive state-of-the-art systems. Our code is available: {\footnotesize this https URL.}
62. 【2605.12013】L2P: Unlocking Latent Potential for Pixel Generation
链接:https://arxiv.org/abs/2605.12013
作者:Zhennan Chen,Junwei Zhu,Xu Chen,Jiangning Zhang,Jiawei Chen,Zhuoqi Zeng,Wei Zhang,Chengjie Wang,Jian Yang,Ying Tai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:recently regained attention, Pixel diffusion models, recently regained, regained attention, attention for visual
备注: project page: [this https URL](https://nju-pcalab.github.io/projects/L2P/)
点击查看摘要
Abstract:Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.
63. 【2605.12006】Robust Promptable Video Object Segmentation
链接:https://arxiv.org/abs/2605.12006
作者:Sohyun Lee,Yeho Gwon,Lukas Hoyer,Konrad Schindler,Christos Sakaridis,Suha Kwak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prevents PVOS deployment, models substantially degrades, prevents PVOS, PVOS deployment, safety-critical domains
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at this https URL.
64. 【2605.12002】EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization
链接:https://arxiv.org/abs/2605.12002
作者:Minh-Khoa Le-Phan,Minh-Hoang Le,Minh-Triet Tran,Trong-Le Do
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:SID and IFL, forgery increasingly realistic, challenging both SID, Text-guided inpainting, made image forgery
备注: Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)
点击查看摘要
Abstract:Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.
65. 【2605.11989】A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
链接:https://arxiv.org/abs/2605.11989
作者:Nermeen Abou Baker,Nico Zengeler,Uwe Handmann
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:reusing learned weights, machine learning technique, Transfer learning, previously acquired knowledge, learned weights
备注: Published by Machine Learning and Knowledge Extraction Journal
点击查看摘要
Abstract:Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.
66. 【2605.11977】Optimizing 4D Wires for Sparse 3D Abstraction
链接:https://arxiv.org/abs/2605.11977
作者:Dong-Yi Wu,Tong-Yee Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:B-spline with spatial, variable width, spatial coordinates, coordinates and variable, single continuous
备注:
点击查看摘要
Abstract:We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width $(x,y,z,w)$. Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.
67. 【2605.11967】H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes
链接:https://arxiv.org/abs/2605.11967
作者:ByungHa Ko,Youngmin Lee,Dong Hwan Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recover scene groups, fine object parts, fixed vocabulary, aims to recover, recover scene
备注:
点击查看摘要
Abstract:Hierarchical 3D grouping aims to recover scene groups across multiple granularities, from fine object parts to complete objects, without relying on semantic labels or a fixed vocabulary. The main challenge is to transform 2D foundation-model cues into coherent hierarchy supervision and embed that hierarchy in a 3D representation. We propose H2G, a hyperbolic affinity field for hierarchical 3D grouping. Our method derives semantically organized tree supervision by interpreting foundation-model affinities through Dasgupta's objective for similarity-based hierarchical clustering. This supervision is distilled into a single Lorentz hyperbolic feature field, whose geometry is well suited for tree-like branching structures. A hierarchy-aware objective aligns the field with fine-level assignments, coarse object structure, compact feature clusters, and LCA (Lowest Common Ancestor) ordering. This formulation represents multiple grouping levels in one feature space, enabling semantic hierarchical grouping grounded in 2D foundation-model knowledge.
68. 【2605.11963】What Does It Mean for a Medical AI System to Be Right?
链接:https://arxiv.org/abs/2605.11963
作者:Antony Gitau
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digitized bone marrow, bone marrow smears, specific clinical context, multiple myeloma, grounding the question
备注: Part of a PhD ethics course
点击查看摘要
Abstract:This paper examines what it means for a medical AI system to be right by grounding the question in a specific clinical context: the automatic classification of plasma cells in digitized bone marrow smears for the diagnosis of multiple myeloma. Drawing on philosophy of science and research ethics, the paper argues that correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. As such, the paper develops this argument through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings.
69. 【2605.11960】Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
链接:https://arxiv.org/abs/2605.11960
作者:Gengluo Li,Shangpin Peng,Xingyu Wan,Chengquan Zhang,Hao Feng,Xin Xu,Pian Wu,Bang Li,Zengmao Ding,Yongge Liu,Yipei Ye,Yang Yang,Zhan Shu,Guojun Yan,Zhe Li,Can Ma,Weiping Wang,Yu Zhou,Han Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Large Language, Large Language Models, Vision Large, Language Models, Large Language
备注:
点击查看摘要
Abstract:Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at this https URL.
70. 【2605.11959】Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models
链接:https://arxiv.org/abs/2605.11959
作者:Maham Nazir,Muhammad Aqeel,Richong Zhang,Francesco Setti
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal video summarization, Multimodal video, video summarization requires, summarization requires visual, language generation
备注: Accepted to ICPR 2026
点击查看摘要
Abstract:Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. this https URL
71. 【2605.11939】Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
链接:https://arxiv.org/abs/2605.11939
作者:Boyang Guo,Liang Li,Lin Peng,Yuhan Gao,Xichun Sheng,Chenggang Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pre-trained vision-language models, fine-tuning pre-trained vision-language, vision-language models, learning has emerged, efficient alternative
备注:
点击查看摘要
Abstract:Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.
72. 【2605.11934】Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution
链接:https://arxiv.org/abs/2605.11934
作者:Chen Wu,Ling Wang,Zhuoran Zheng,Xiangyu Chen,Jingyuan Xia,Weidong Jiang,Jiantao Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Guided depth super-resolution, Guided depth, Guided, RGB guidance, Abstract
备注: ISCAS2026
点击查看摘要
Abstract:Guided depth super-resolution (GDSR) reconstructs HR depth maps from LR inputs with HR RGB guidance. Existing methods either model each modality independently or rely on computationally expensive attention mechanisms with quadratic complexity, hindering the establishment of efficient and semantically interactive joint representations. In this paper, we observe that feature maps from different modalities exhibit semantic-level correlations during feature extraction. This motivates us to develop a more flexible approach enabling dense, semantically-aware deep interactions between modalities. To this end, we propose a novel GDSR framework centered around the Interactive State Space Model. Specifically, we design a cross-modal local scanning mechanism that enables fine-grained semantic interactions between RGB and depth features. Leveraging the Mamba architecture, our framework achieves global modeling with linear complexity. Furthermore, a cross-modal matching transform module is introduced to enhance interactive modeling quality by utilizing representative features from both modalities. Extensive experiments demonstrate competitive performance against state-of-the-art methods.
73. 【2605.11931】Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
链接:https://arxiv.org/abs/2605.11931
作者:Qihuang Zhong,Liang Ding,Wenjie Xuan,Juhua Liu,Bo Du,Dacheng Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Large Language, explicit reasoning traces, reasoning traces
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.
74. 【2605.11927】RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
链接:https://arxiv.org/abs/2605.11927
作者:Qi Zhao,Jun Chen,Ivor Tsang,Guang Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse single images, generating diverse single, sequential generation reveals, modern diffusion models, diffusion models excel
备注: CVPR2026
点击查看摘要
Abstract:While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at this https URL.
75. 【2605.11913】Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization
链接:https://arxiv.org/abs/2605.11913
作者:Jaerin Lee,Kanggeon Lee,Kyoung Mu Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabled powerful gradient-based, Differentiable vector graphics, powerful gradient-based optimization, Differentiable vector, raster images
备注: 22 pages, 12 figures
点击查看摘要
Abstract:Differentiable vector graphics have enabled powerful gradient-based optimization of vector primitives directly from raster images. However, existing frameworks formulate this as a flat optimization problem, forcing hundreds to thousands of randomly initialized curves to blindly compete for pixel-level error reduction. This disordered optimization leads to topology collapse, where macroscopic structures are distorted by internal high-frequency noise, resulting in a redundant and uneditable "polygon soup" that limits practical editability. To address this limitation, we propose Vector Scaffolding, a novel hierarchical optimization framework that shifts from flat pixel-matching to structured topological construction tailored for vector graphics. By identifying a key cause of topology collapse as the mathematical imbalance between area and boundary gradients, we introduce Interior Gradient Aggregation to stabilize the learning dynamics of multi-scale curve mixtures. Upon this stabilized landscape, we employ Progressive Stratification and Rapid Inflation Scheduling to progressively densify vector primitives with extremely high learning rates ($\times 50$). Experiments demonstrate that our approach accelerates optimization by $2.5\times$ while simultaneously improving PSNR by up to 1.4 dB over the previous state of the art.
76. 【2605.11904】Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning
链接:https://arxiv.org/abs/2605.11904
作者:Huiyu Yi,Zhiming Xu,Dunwei Tu,Zhicheng Wang,Baile Xu,Furao Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Fully Connected layers, Nearest Class, Fully Connected, catastrophic forgetting compared, Connected layers
备注: accepted by ICML2026
点击查看摘要
Abstract:The Nearest Class Mean (NCM) classifier is widely favored in Class-Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global'' representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at this https URL.
77. 【2605.11900】Mobile Traffic Camera Calibration from Road Geometry for UAV-Based Traffic Surveillance
链接:https://arxiv.org/abs/2605.11900
作者:Alexey Popov,Natalia Trukhina,Vadim Vashkelis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unmanned aerial vehicles, Unmanned aerial, flexible traffic surveillance, fixed roadside cameras, surveillance where fixed
备注:
点击查看摘要
Abstract:Unmanned aerial vehicles (UAVs) can provide flexible traffic surveillance where fixed roadside cameras are unavailable, costly, or impractical. However, raw UAV video is difficult to use for traffic analytics because vehicle motion is observed in perspective image coordinates rather than in a stable metric road coordinate system. This paper presents a lightweight pipeline for converting monocular oblique UAV traffic video into a local metric bird's-eye-view (BEV) representation. Visible road geometry, including lane markings, road borders, and crosswalks, is used to estimate a road-plane homography from image coordinates to metric ground-plane coordinates. Vehicle observations from dataset annotations or detectors are then projected to BEV using estimated ground contact points. The resulting trajectories support estimation of vehicle direction, speed, heading, and dynamic 3D cuboids on the road plane. We evaluate the pipeline on UAVDT using ground-truth annotations to isolate calibration and geometric reconstruction from detector and tracker errors. For sequence M1401, 40 sampled frames from img000001-img000196 produce 632 metric cuboid instances across 23 tracks. Results show that road-geometry calibration can transform monocular UAV footage into interpretable traffic-camera-style analytics, including BEV tracks and synchronized 3D cuboid visualizations. They also reveal key limitations: far-field vehicles are sensitive to homography errors, manual validation is currently more reliable than fully automatic calibration, and the single-plane assumption limits performance in non-planar or ambiguous road regions. The proposed pipeline provides a practical foundation for deployable UAV traffic cameras and future real-time traffic digital-twin systems.
78. 【2605.11898】Few-Shot Synthetic Data Generation with Diffusion Models for Downstream Vision Tasks
链接:https://arxiv.org/abs/2605.11898
作者:Daniil Dushenev,Nazariy Karpov,Daniil Zinovjev,Alexander Gorin,Konstantin Kulikov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Magnetic Tile Defect, inherently underrepresented, persistent challenge, collecting positive, events are inherently
备注: 5 pages, 3 figures, 1 table. Accepted at SynData4CV Workshop @ CVPR 2026
点击查看摘要
Abstract:Class imbalance is a persistent challenge in visual recognition, particularly in safety-critical domains where collecting positive examples is expensive and rare events are inherently underrepresented. We propose a lightweight synthetic data augmentation pipeline that fine-tunes a LoRA adapter on as few as 20-50 real images of a rare class and uses a pretrained diffusion model to generate synthetic samples for training. We systematically vary the synthetic-to-real ratio and evaluate the approach across two structurally different domains: chest X-ray pathology classification (NIH ChestX-ray14) and industrial surface crack detection (Magnetic Tile Defect dataset). All evaluations are performed on held-out sets of real images only. Across both domains, synthetic augmentation consistently improves rare-class recall and F1 compared to training with real data alone. Performance improves with moderate synthetic augmentation and shows diminishing returns as the synthetic ratio increases. These results suggest that LoRA-adapted diffusion models provide a simple and scalable mechanism for augmenting rare classes, enabling effective learning in data-scarce scenarios across heterogeneous visual domains.
Comments:
5 pages, 3 figures, 1 table. Accepted at SynData4CV Workshop @ CVPR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.11898 [cs.CV]
(or
arXiv:2605.11898v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.11898
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Daniil Dushenev [view email] [v1]
Tue, 12 May 2026 10:11:57 UTC (421 KB)
79. 【2605.11881】Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data
链接:https://arxiv.org/abs/2605.11881
作者:Jie Chen,Yuanbiao Gou,Chuanbin Liu,Zhu Wang,Xi Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale unlabeled data, extracted from large-scale, large-scale unlabeled, pretrained models, models with diverse
备注: 18 pages
点击查看摘要
Abstract:The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace structures when exploiting complementary information across multiple views. Therefore, a fundamental challenge involves constructing sparse similarity graphs that preserve these underlying subspace structures for achieving semantic alignment across heterogeneous views. In this paper, we propose a sparse attention graph learning (SAGL) method that learns subspace-preserving sparse attention graphs from heterogeneous multiview data. Specifically, we introduce a bilinear attention factorization scheme to capture asymmetric similarities among the high-dimensional features, which breaks the symmetry bottleneck that is inherent in the traditional representation learning techniques. A dynamic sparsity gating mechanism then predicts a feature-specific compression factor for adaptively controlling the topological contributions of neighbors. Furthermore, we employ a structured sparse projection via $\alpha$-entmax to generate subspace-preserving sparse attention graphs for individual views. SAGL leverages these view-specific graphs to conduct sparse information aggregation, yielding discriminative representations for multiview learning tasks. In addition, we provide a rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints. Extensive experiments conducted on multiple benchmark datasets demonstrate that SAGL consistently outperforms the state-of-the-art unsupervised transfer learning approaches.
80. 【2605.11871】$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
链接:https://arxiv.org/abs/2605.11871
作者:Yuzhu Wang,Xi Ye,Duo Su,Yangyang Xu,Jun Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:supplies noisy evidence, partial-observation inverse problem, pretrained flow-matching video, flow-matching video generators, video supplies noisy
备注:
点击查看摘要
Abstract:Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.
81. 【2605.11869】FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
链接:https://arxiv.org/abs/2605.11869
作者:Jian Tang,Jiawei Fan,Qingbin Liu,Zheng Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Video Diffusion Transformers, per-step inference latency, Diffusion Transformers, inference latency remains, inference latency
备注:
点击查看摘要
Abstract:While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.
82. 【2605.11867】When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI
链接:https://arxiv.org/abs/2605.11867
作者:Louise E.G. Baron,Ross Callaghan,David M. Cash,Philip S. Weston,Hojjat Azadbakht,Hui Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Alzheimer disease, assessment in Alzheimer, non-invasive alternative, Structural, PET synthesis
备注: MICCAI 2026 accepted paper (no rebuttal)
点击查看摘要
Abstract:Structural MRI-to-amyloid PET synthesis has been proposed as a non-invasive alternative for amyloid assessment in Alzheimer's disease (AD). However, reported performance of identical models varies widely across studies, and increasingly complex architectures have not led to consistent gains. This inconsistency is thought to be caused by a fundamental biological ambiguity: MRI captures neurodegeneration, while PET measures amyloid pathology - two processes that are often temporally decoupled in AD. As a result, similar MRI patterns may correspond to different amyloid states, creating ambiguous one-to-many mappings. MRI-to-amyloid PET synthesis may therefore be intrinsically ill-posed; however, this idea has yet to be tested scientifically. The aim of this work is to test this hypothesis through two controlled experiments. We first control the training distribution by stratifying paired MRI-PET data by amyloid and neurodegeneration status. Using two standard synthesis models under a controlled design, we show that biologically unambiguous mappings are learnable in isolation, but performance collapses when data ambiguity is introduced. This demonstrates that ambiguity in the data distribution, rather than architectural capacity, constrains performance. Second, we show that introducing orthogonal biological information in the form of plasma biomarkers resolves this ambiguity. When multimodal inputs are incorporated, performance improves and stability is restored. Together, these findings suggest that limited and inconsistent performance in MRI-to-amyloid PET synthesis is explained by intrinsic biological ambiguity, and that stable, meaningful progress requires multimodal integration rather than architectural complexity.
83. 【2605.11864】Very Efficient Listwise Multimodal Reranking for Long Documents
链接:https://arxiv.org/abs/2605.11864
作者:Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:computationally expensive component, multimodal retrieval-augmented generation, retrieval-augmented generation, key yet computationally, computationally expensive
备注: To appear in ICML 2026
点击查看摘要
Abstract:Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at this https URL.
84. 【2605.11863】GATA2Floor: Graph attention for floor counting in street-view facades
链接:https://arxiv.org/abs/2605.11863
作者:Ngoc Tan Le,Tzoulio Chamiti,Eirini Papagiannopoulou,Nikos Deligiannis
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Automated analysis, energy assessment, urban analytics, emergency planning, street-level imagery
备注: Accepted at IEEE ICIP 2026; 6 pages, 5 figures, 3 tables
点击查看摘要
Abstract:Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.
85. 【2605.11856】UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
链接:https://arxiv.org/abs/2605.11856
作者:Houcheng Jiang,Jiajun Fu,Junfeng Fang,Chen Gao,Xiang Wang,Xiangnan He,Yong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal large language, Multimodal large, large language models, visual latent, visual latent reasoning
备注:
点击查看摘要
Abstract:Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.
86. 【2605.11840】Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
链接:https://arxiv.org/abs/2605.11840
作者:Zhangcheng Hou,Tomoaki Ohtsuki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Radar-camera depth estimation, per-pixel depth map, dense per-pixel depth, Radar-camera depth, metric radar signal
备注: 16 pages, 3 figures, 9 tables
点击查看摘要
Abstract:Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods -- concatenation, confidence-aware gating, sparse supervision, graph-based extraction -- combine radar and image features outside the backbone's sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba's selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size $\Delta$ and readout $\mathbf{C}$ while leaving the input projection $\mathbf{B}$ and state dynamics $\mathbf{A}$ image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation, ensuring radar only influences the model where it improves accuracy. Two further properties follow that out-of-scan fusion cannot offer: linear-cost cross-modal coupling at every recurrence step, and a natural fallback to the image-only backbone when radar is absent. We deploy RMS in a Multi-View Scan Pyramid (MVSP) that matches the fusion operator to radar's spatial reach at each scale. SemoDepth achieves state-of-the-art performance on nuScenes, reducing MAE by 34.0%, 29.9%, and 29.9% over the previous best at 0--50, 0--70, and 0--80m, while attaining the lowest single-frame latency (26.8ms). A further ablation shows that out-of-scan feature blending adds no accuracy on top of RMS, providing empirical validation that in-scan selection can replace out-of-scan fusion.
87. 【2605.11824】REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View
链接:https://arxiv.org/abs/2605.11824
作者:Kavin Chandrasekaran,Sorin Grigorescu,Gijs Dubbelman,Pavol Jancura
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:environmental perception, surroundings is generally, generally offered, crucial for environmental, realistic view
备注: IEEE Intelligent Transportation Systems Conference (ITSC) 2025
点击查看摘要
Abstract:A realistic view of the vehicle's surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird's-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.
88. 【2605.11818】RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
链接:https://arxiv.org/abs/2605.11818
作者:Binhao Wang,Shihao Zhao,Bo Cheng,Qiuyu Ji,Yuhang Ma,Liebucha Wu,Shanyuan Liu,Dawei Leng,Yuhui Yin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made substantial progress, Recent diffusion-based approaches, Recent diffusion-based, made substantial, substantial progress
备注:
点击查看摘要
Abstract:Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.
89. 【2605.11817】See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
链接:https://arxiv.org/abs/2605.11817
作者:Yixu Feng,Zinan Zhao,Yanxiang Ma,Chenghao Xia,Chengbin Du,Yunke Wang,Chang Xu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:hinders real-time deployment, shown remarkable promise, high computational cost, computational cost hinders, cost hinders real-time
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at this https URL.
90. 【2605.11808】Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
链接:https://arxiv.org/abs/2605.11808
作者:Zhenxin Qin,Qiang Li,Qingzhuo Wang,Ruiyang Qin,Zhihua Wei,Wen Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, diverse vision-language tasks, Large Vision-Language, Vision-Language Models, vision-language tasks
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.
91. 【2605.11804】Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning
链接:https://arxiv.org/abs/2605.11804
作者:Patryk Krukowski,Jacek Tabor,Przemysław Spurek,Marek Śmieja,Łukasz Struski
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:mitigate catastrophic forgetting, catastrophic forgetting, synthesize pseudo-samples, pseudo-samples and mitigate, mitigate catastrophic
备注:
点击查看摘要
Abstract:Data-free continual learning (DFCIL) relies on model inversion to synthesize pseudo-samples and mitigate catastrophic forgetting. However, existing inversion methods are fundamentally limited by a simplifying assumption: they model feature distributions using diagonal covariance, effectively ignoring correlations that define the geometry of learned representations. As a result, synthesized samples often lack fidelity, limiting knowledge retention. In this work, we show that modeling feature dependencies is a key ingredient for effective DFCIL. We introduce REMIX, a structured covariance modeling framework that enables scalable full-covariance modeling without the prohibitive cost of dense matrix inversion and log-determinant computation. By leveraging a Laplace kernel parameterization, REMIX captures structured feature dependencies using memory that scales linearly with the feature dimensionality, while requiring only an additional logarithmic factor in computation. Modeling these correlations produces more coherent synthetic samples and consistently improves performance across standard DFCIL benchmarks. Our results demonstrate that moving beyond diagonal assumptions is essential for effective and scalable data-free continual learning. Our code is available at https://github. com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.
92. 【2605.11803】OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
链接:https://arxiv.org/abs/2605.11803
作者:Minseok Kang,Minhyeok Lee,Jungho Lee,Minjung Kim,Donghyeong Kim,Dayeon Lee,Heeseung Choi,Ig-jae Kim,Sangyoun Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Video Large Language, Language Models, Large Language, grows rapidly due
备注: 22pages, 9 figures. Code available at [this https URL](https://github.com/minseokii/OTT-Vid)
点击查看摘要
Abstract:As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.
93. 【2605.11799】SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions
链接:https://arxiv.org/abs/2605.11799
作者:Markus Essl,Marta Moscati,Mubashir Noman,Muhammad Zaigham Zaheer,Usman Naseem,Shah Nawaz,Markus Schedl
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable performance, autonomous vehicles, demonstrated remarkable, Multimodal sensor fusion, LiDAR data
备注: Accepted at ICIP 2026
点击查看摘要
Abstract:Multimodal sensor fusion has demonstrated remarkable performance improvements over unimodal approaches in 3D object detection for autonomous vehicles. Typically, existing methods transform multimodal data from independent sensors, such as camera and LiDAR, into a unified bird's-eye view (BEV) representation for fusion. Although effective in ideal conditions, this strategy suffers from substantial performance deterioration when camera or LiDAR data are missing, corrupted, or noisy. To address this vulnerability, we develop a framework-agnostic fusion module for camera and LiDAR data that allows for handling cases when one of the two modalities is missing or corrupted. To demonstrate the effectiveness of our module, we instantiate it in BEVFusion [1], a well-established framework to combine camera and LiDAR data for 3D object detection. By means of quantitative experiments on the MultiCorrupt dataset, we demonstrate that our module achieves favorable performance improvements under scenarios of missing and corrupted modalities, substantially outperforming existing unified representation approaches across a wide range of sensor deterioration scenarios and reaching state-of-the-art performance in scenarios of corrupted modality due to extreme weather conditions and sensor failure.
94. 【2605.11782】Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision
链接:https://arxiv.org/abs/2605.11782
作者:Antoni Valls,Jordi Sanchez-Riera
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:impairment affects hundreds, Visual impairment affects, navigate urban environments, urban environments safely, severely limiting
备注: 10 pages, 6 figures, submitted to IEEE T-ITS
点击查看摘要
Abstract:Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.
95. 【2605.11771】Revisiting Shadow Detection from a Vision-Language Perspective
链接:https://arxiv.org/abs/2605.11771
作者:Yonghui Wang,Wengang Zhou,Hao Feng,Houqiang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:models rely primarily, pixel-wise visual supervision, similar dark regions, commonly formulated, models rely
备注:
点击查看摘要
Abstract:Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.
96. 【2605.11760】M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection
链接:https://arxiv.org/abs/2605.11760
作者:Jiyuan Liu,Jia Lin,Xiaofei Zhou,Runmin Cong,Deyang Liu,Zhi Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Segment Anything Model, foundation model, universal segmentation, model for universal, Model
备注: 10 pages, 3 figures
点击查看摘要
Abstract:The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.
97. 【2605.11756】Focusable Monocular Depth Estimation
链接:https://arxiv.org/abs/2605.11756
作者:Yuxin Du,Tao Lin,Zile Zhong,Runting Li,Xiyao Chen,Jiting Liu,Chenglin Liu,Ying-Cong Chen,Yuqian Fu,Bo Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:uniform pixel-wise objectives, foundation models generalize, Monocular depth foundation, Focusable Monocular Depth, monocular relative depth
备注:
点击查看摘要
Abstract:Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.
98. 【2605.11755】One-Step Generative Modeling via Wasserstein Gradient Flows
链接:https://arxiv.org/abs/2605.11755
作者:Jiaqi Han,Puheng Li,Qiushan Guo,Renyuan Xu,Stefano Ermon,Emmanuel J. Candès
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:impressive generative capability, shown impressive generative, shown impressive, requires many iterative, target distribution
备注: 38 pages, 14 figures
点击查看摘要
Abstract:Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.
99. 【2605.11750】DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
链接:https://arxiv.org/abs/2605.11750
作者:Xianzhe Fan,Yuxiang Lu,Shenyuan Gao,Xiaoyang Wu,Ruihua Han,Manling Li,Hengshuang Zhao
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:minor action errors, VLA models, existing VLA models, VLA models rely, enables VLA models
备注: 19 pages, 7 figures
点击查看摘要
Abstract:Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at this https URL.
100. 【2605.11748】BronchoLumen: Analysis of recent YOLO-based architectures for real-time bronchial orifice detection in video bronchoscopy
链接:https://arxiv.org/abs/2605.11748
作者:Yongchao Li,Marian Himstedt
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intensive care units, tract remains challenging, respiratory tract remains, care units, remains challenging
备注: 10 pages, 4 figures, IPCAI 2026
点击查看摘要
Abstract:Bronchoscopy is routinely conducted in pulmonary clinics and intensive care units, but navigating the complex branching of the respiratory tract remains challenging. This paper introduces BronchoLumen, a real-time YOLO-based system for detecting bronchial orifices in video bronchoscopy, aiming to assist navigation and CAD systems. The paper investigates if bronchial orifices can be robustly detected across image domains using state-of-the-art object detection and a limited set of public image data. The study includes the description and comparison of YOLOv8, a widely adopted architecture, and YOLOv12, a more recent architecture integrating attention-based modules to improve spatial reasoning. Both models are trained and tested solely on publicly available datasets comprising different image domains. A comparison of both models is conducted based on the common metrics mAP@0.5 and mAP@0.5:0.9 with the latter emphasizing localization accuracy. For YOLOv8 we obtained a mAP@0.5 of 0.91 on an in-domain and 0.68 on a cross-domain test set. YOLOv12 achieved 0.84 and 0.68 respectively with slightly better localization accuracy with mAP@0.5:0.9 of 0.48 and 0.26 compared to YOLOv8 with 0.45 and 0.25. Challenges like motion blur and low contrast occasionally entailed uncertainties but the system demonstrated overall robustness in most scenarios. BronchoLumen is an open-weight, YOLO-based solution for bronchial orifice detection offering high accuracy and efficiency across multiple image domains. While the more recent YOLOv12 achieves better localization accuracy, we observed a slightly worse precision. The models have been made publicly available to foster further research in bronchoscopy navigation.
101. 【2605.11743】WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views
链接:https://arxiv.org/abs/2605.11743
作者:SeongMin Jin,Doo Seok Jeong
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:capture both semantic, information is central, semantic and spatial, spatial information, Learning latent representations
备注: Accepted as a regular paper at ICML2026
点击查看摘要
Abstract:Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at this https URL.
102. 【2605.11727】Allegory of the Cave: Measurement-Grounded Vision-Language Learning
链接:https://arxiv.org/abs/2605.11727
作者:Kepeng Xu,Li Xu,Gang He,Wenxin Yu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:models typically reason, post-ISP RGB images, Vision-language models typically, quantize sensor evidence, models typically
备注:
点击查看摘要
Abstract:Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
103. 【2605.11723】CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
链接:https://arxiv.org/abs/2605.11723
作者:Jiyuan Wang,Huan Ouyang,Jiuzhou Lin,Chunyu Lin,Dewen Fan,Boheng Zhang,Haonan Fan,Fei Zuo,Jia Sun,Huaiqing Wang,Honglie Wang,Yiyang Fan,Zhenlong Yuan,Zijun Li,Yongrui Heng,Guosheng Lin,Fan Yang,Tingting Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Relative Policy Optimization, Group Relative Policy, propose Concentrate, Vision-Language Models, temporal anomaly windows
备注: 27 pages, 10 figures
点击查看摘要
Abstract:In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.
104. 【2605.11722】EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
链接:https://arxiv.org/abs/2605.11722
作者:Sunung Mun,Sunghyun Cho,Jungseul Ok
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:involving multiple objects, synthesize realistic images, prompts involving multiple, synthesize realistic, involving multiple
备注:
点击查看摘要
Abstract:Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.
105. 【2605.11710】Unlocking Compositional Generalization in Continual Few-Shot Learning
链接:https://arxiv.org/abs/2605.11710
作者:Phu-Quy Nguyen-Lam,Phu-Hoa Pham,Dao Sy Duy Minh,Chi-Nguyen Tran,Huynh Trung Kiet,Long Tran-Thanh
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Object-centric representations promise, individual object-level parts, single unit, Object-centric representations, promise a key
备注: 10 pages
点击查看摘要
Abstract:Object-centric representations promise a key property for few-shot learning: Rather than treating a scene as a single unit, a model can decompose it into individual object-level parts that can be matched and compared across different concepts. In practice, this potential is rarely realized. Continual learners either collapse scenes into global embeddings, or train with part-level matching objectives that tie representations too closely to seen patterns, leaving them unable to generalize to truly novel concepts. In this paper, we identify this fundamental structural conflict and pioneer a new paradigm that strictly decouples representation learning from compositional inference. Leveraging the inherent patch-level semantic geometry of self-supervised Vision Transformers (ViTs), our framework employs a dual-phase strategy. During training, slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries. At inference, preserved slots are dynamically composed to match novel scenes. We demonstrate that this paradigm offers dual structural benefits: The frozen backbone naturally prevents representation drift, while our lightweight, holistic optimization preserves the features' capacity for novel-concept transfer. Extensive experiments validate this approach, achieving state-of-the-art unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.
106. 【2605.11705】CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection
链接:https://arxiv.org/abs/2605.11705
作者:Boran Zhao,Hetian Liu,Zhenxian Hu,Yuqing Yuan,Yu Yan,Pengju Ren
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prohibitive computational overhead, inevitably incur prohibitive, incur prohibitive computational, massive image-text datasets, models fundamentally relies
备注:
点击查看摘要
Abstract:The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.
107. 【2605.11704】ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation
链接:https://arxiv.org/abs/2605.11704
作者:Inwoo Hwang,Hojun Jang,Bing Zhou,Jian Wang,Young Min Kim,Chuan Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scale-wise autoregressive framework, text-driven human motion, human motion generation, framework for text-driven, text-driven human
备注: Project page: [this https URL](https://inwoohwang.me/ScaleMoGen)
点击查看摘要
Abstract:We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.
108. 【2605.11696】WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting
链接:https://arxiv.org/abs/2605.11696
作者:Lezhong Wang,Mehmet Onurcan Kaya,Siavash Bigdeli,Jeppe Revall Frisvad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:achieved impressive photorealism, Recent single-image relighting, advanced generative models, Recent single-image, single-image relighting methods
备注: Companion paper to the CVPR26 findings paper 'WildRelight', introducing the physics-guided adaptation method evaluated on the dataset. Project Page: [this https URL](https://lez-s.github.io/wildrelight_proj/)
点击查看摘要
Abstract:Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.
109. 【2605.11695】Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning
链接:https://arxiv.org/abs/2605.11695
作者:Mikako Ochiai,Masatoshi Nagano,Tadahiro Taniguchi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:visual, shared, sequences, agents, token sequences
备注:
点击查看摘要
Abstract:Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.
110. 【2605.11683】DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers
链接:https://arxiv.org/abs/2605.11683
作者:Kaixuan He,Song Chen,Yi Kang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:token sequence length, Vision Transformers, incur significant computational, significant computational overhead, computational overhead due
备注: Preprint. Under review
点击查看摘要
Abstract:Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.
111. 【2605.11680】ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
链接:https://arxiv.org/abs/2605.11680
作者:Shivam Kumar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rendered raster image, executable drawing program, deterministic evaluator re-renders, raster image, rendered raster
备注: 14 pages, 5 figures, 2 tables. Code, data, and artifacts: [this https URL](https://github.com/shivamk3r/shape-code-bench) ; archival release: [this https URL](https://doi.org/10.5281/zenodo.20132286)
点击查看摘要
Abstract:We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.
112. 【2605.11659】Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
链接:https://arxiv.org/abs/2605.11659
作者:Yaze Zhao,Yicong Liu,Yixiong Zou,Yuhua Li,Ruixuan Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Cross-Domain Few-Shot Learning, CLIP remains underexplored, large-scale pretrained models, adapt large-scale pretrained, specialized target domains
备注:
点击查看摘要
Abstract:Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.
113. 【2605.11654】Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
链接:https://arxiv.org/abs/2605.11654
作者:Chi-Nguyen Tran,Dao Sy Duy Minh,Huynh Trung Kiet,Nguyen Lam Phu Quy,Phu-Hoa Pham,Long Tran-Thanh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:autonomous drone navigation, geo-referenced satellite tile, oblique drone view, Cross-view geo-localization, navigation when GNSS
备注: 37 pages, 7 figures, 6 tables
点击查看摘要
Abstract:Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.
114. 【2605.11651】Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
链接:https://arxiv.org/abs/2605.11651
作者:Seonghoon Yu,Dongjun Nam,Byung-Kwan Lee,Jeany Son
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:limits real-world deployment, high computational cost, computational cost limits, cost limits real-world, boost reasoning performance
备注: Pre-print
点击查看摘要
Abstract:Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.
115. 【2605.11634】Unlocking UML Class Diagram Understanding in Vision Language Models
链接:https://arxiv.org/abs/2605.11634
作者:Artem Naboichenko,René Peinl
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision Language Models, Language Models, Vision Language, regard-ing diagrams compared, compared to photos
备注:
点击查看摘要
Abstract:Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
116. 【2605.11628】Single-Shot HDR Recovery via a Video Diffusion Prior
链接:https://arxiv.org/abs/2605.11628
作者:Chinmay Talegaonkar,Jinshi He,Christopher McKenna,Nicholas Antipa
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:HDR, Recent generative, Recent generative methods, HDR image, struggle with preserving
备注:
点击查看摘要
Abstract:Recent generative methods for single-shot high dynamic range (HDR) image reconstruction show promising results, but often struggle with preserving fidelity to the input image. They require separate models to handle highlights and shadows, or sacrifice interpretability by directly predicting the final HDR image. We address these limitations by re-casting single-shot HDR reconstruction as conditional video generation and fusing the generated frames into an HDR image. We finetune a video diffusion model to generate an exposure bracket, conditioned on a low dynamic range (LDR) input. We fuse this image bracket using per-pixel weights predicted by a light-weight UNet. This formulation is simple, interpretable, and effective. Rather than directly hallucinating an HDR image, it explicitly reconstructs the intermediate exposure stack and fuses it into the final output. Our method eliminates the need for separate models across exposure regimes and produces HDR reconstructions with high input fidelity. On quantitative benchmarks, we outperform state-of-the-art generative baselines with comparable model capacity on several reconstruction metrics. Human evaluators further prefer our results in 72% of pairwise comparisons against existing methods. Finally, we show that this input-conditioned sequence generation and fusion framework extends beyond HDR to other image reconstruction tasks, such as all-in-focus image recovery from a single defocus-blurred input.
117. 【2605.11622】RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq Prediction
链接:https://arxiv.org/abs/2605.11622
作者:Yaxuan Song,Jianan Fan,Tianyi Wang,Qiuyue Hu,Hang Chang,Heng Huang,Weidong Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Histopathology whole-slide images, defining pathological states, rich tissue morphology, lack direct molecular, direct molecular architecture
备注: 15 pages, 13 tables, 3 figures. Accepted by the Forty-Third International Conference on Machine Learning (ICML2026). Code is available at [this https URL](https://github.com/YXSong000/RNA-FM)
点击查看摘要
Abstract:Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at this https URL.
118. 【2605.11616】Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
链接:https://arxiv.org/abs/2605.11616
作者:Qirui Wang,Jingyi He,Yining Pan,Xulei Yang,Shijie Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Functional affordance grounding, affordance grounding requires, recognizing an object, button to press, localize the specific
备注:
点击查看摘要
Abstract:Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.
119. 【2605.11605】Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
链接:https://arxiv.org/abs/2605.11605
作者:Chaeyoung Jung,Kyeongha Rho,Joon Son Chung
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Omnimodal Large Language, Large Language Models, Language Models, Omnimodal Large, Large Language
备注:
点击查看摘要
Abstract:Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.
120. 【2605.11596】HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
链接:https://arxiv.org/abs/2605.11596
作者:Conglang Zhang,Yifan Zhan,Qingjie Wang,Zhanpeng Ouyang,Yu Li,Zihao Yang,Xiaoyang Guo,Weiqiang Ren,Qian Zhang,Zhen Dong,Yinqiang Zheng,Wei Yin,Zhengqing Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Closed-loop driving simulation, pushing current driving, current driving world, short offline clips, requires real-time interaction
备注:
点击查看摘要
Abstract:Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.
121. 【2605.11594】PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations
链接:https://arxiv.org/abs/2605.11594
作者:Cheng Chi,Xianqi Wang,Hongcheng Luo,Mingfei Tu,Gangwei Xu,Zehan Zhang,Bing Wang,Guang Chen,Hangjun Ye,Sida Peng,Xin Yang,Haiyang Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High-fidelity reconstruction, Gaussian Splatting, High-fidelity, crucial for autonomous, Abstract
备注:
点击查看摘要
Abstract:High-fidelity reconstruction of driving scenes is crucial for autonomous driving. While recent feedforward 3D Gaussian Splatting (3DGS) methods enable fast reconstruction, their per-pixel Gaussian prediction paradigm often suffers from multi-view inconsistency and layering artifacts. Moreover, existing methods often model dynamic instances via dense flow prediction, which lacks explicit cross-view correspondence and instance-level consistency. In this paper, we propose PointForward, a feedforward driving reconstruction framework through point-aligned representations. Unlike pixel-aligned methods, we initialize sparse 3D queries in world space and aggregate multi-view image information via spatial-temporal fusion onto these queries, enforcing explicit cross-view consistency in a single feedforward pass. To handle scene dynamics, we introduce scene graphs that explicitly organize moving instances during reconstruction. By leveraging 3D bounding boxes, our method enables instance-level motion propagation and temporally consistent dynamic representations. Extensive experiments demonstrate that PointForward achieves state-of-the-art performance on large-scale driving benchmarks. The code will be available upon the publication of the paper.
122. 【2605.11591】Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration
链接:https://arxiv.org/abs/2605.11591
作者:Mingtao Xian,Yifeng Yang,Qinying Gu,Xinbing Wang,Nanyang Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, multi-image cross-modal retrieval, severe position bias, Multimodal Large
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40\% compared to baselines. Code is available at this https URL.
123. 【2605.11585】A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods
链接:https://arxiv.org/abs/2605.11585
作者:Shota Saito,Yuta Nakahara,Kohei Horinouchi,Naoki Ichijo,Manabu Kobayashi,Toshiyasu Matsushima
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:paper addresses, addresses the problem, lower bound, variational lower bound, grayscale images
备注:
点击查看摘要
Abstract:This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.
124. 【2605.11578】he Midas Touch for Metric Depth
链接:https://arxiv.org/abs/2605.11578
作者:Yu Ma,Zizhan Guo,Zuyi Xiong,Haoran Zhang,Yi Feng,Hongbo Zhao,Hanli Wang,Rui Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low computational efficiency, practical applicability remains, applicability remains limited, Recent advances, computational efficiency
备注:
点击查看摘要
Abstract:Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph{\textbf{M}idas \textbf{T}ouch for \textbf{D}epth} (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at this https URL.
125. 【2605.11572】B-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
链接:https://arxiv.org/abs/2605.11572
作者:Seongah Kim,Dinh Phu Tran,Hyeontaek Hwang,Saad Wazir,Duc Do Minh,Daeyoung Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:http URL propose, http URL, Text-Bridged Audio-Visual Adapter, temporally aligned audio, visual signals lack
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic this http URL propose to use text as a semantic anchor for audio-visual representation this http URL this end, we introduce a parameter-efficient adaptation frameworkbuilt on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.
126. 【2605.11567】Dynamic Execution Commitment of Vision-Language-Action Models
链接:https://arxiv.org/abs/2605.11567
作者:Feng Chen,Xianghui Wang,Yuxuan Chen,Boying Li,Yefei He,Zeyu Zhang,Yicheng Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single forward pass, reduce per-step latency, adopt action chunking, predominantly adopt action, consecutive low-level actions
备注: Code will be released in the next version
点击查看摘要
Abstract:Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.
127. 【2605.11563】CP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
链接:https://arxiv.org/abs/2605.11563
作者:Sara Shoouri,Morteza Tavakoli Taba,Hun-Seok Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:State Space Models, State Space, long-range vision tasks, offering input-dependent recurrence, Space Models
备注:
点击查看摘要
Abstract:State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.
128. 【2605.11559】When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
链接:https://arxiv.org/abs/2605.11559
作者:Fanpu Cao,Xin Zou,Xuming Hu,Hui Xiong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:grounded question answering, mention nonexistent objects, generated responses contradict, large language models, responses contradict image
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at this https URL.
129. 【2605.11555】ScribbleDose: Scribble-Guided Dose Prediction in Radiotherapy
链接:https://arxiv.org/abs/2605.11555
作者:Zhenxi Zhang,Yitao Zhuang,Yao Pu,Peixin Yu,Zirong Li,Yan Xia,Hui Li,Bin Li,Fuchen Zheng,Ge Ren
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:facilitate structure-dose coupling, provide explicit geometric, explicit geometric constraints, structure-dose coupling, widely adopted
备注: This is a preprint version of the paper. The final version will appear in the proceedings of MICCAI 2026
点击查看摘要
Abstract:Anatomical structure masks are widely adopted in radiotherapy dose prediction, as they provide explicit geometric constraints that facilitate structure-dose coupling. However, conventional manual delineation of these masks requires precise annotation of structure boundaries relevant to radiotherapy, which is time-consuming and labor-intensive. To address these limitations, we propose a scribble-guided dose prediction framework that relies solely on anatomical structures annotated with sparse scribbles. Specifically, we design a Scribble Completion Module (SCM) to generate dense anatomical masks by propagating sparse scribble labels to semantically similar voxels. During the propagation process, a supervoxel-based regularization is introduced to preserve geometric boundary consistency to ensure anatomical plausibility. Furthermore, we propose a Structure-Guided Dose Generation Module (SGDGM) to strengthen the correspondence between sparse structural cues and dose distribution. The completed dense masks derived from scribbles serve as structural guidance to condition dose prediction, forming a scribble-mask-dose learning pipeline under sparse annotation. Experiments on the GDP-HMM dataset demonstrate that ScribbleDose achieves competitive dose prediction performance using only sparse structural annotations. The source code and reannotated scribble annotations are publicly available at this https URL.
130. 【2605.11551】VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck
链接:https://arxiv.org/abs/2605.11551
作者:Aryan Gondkar,Hayder Radha,Yiming Deng
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
关键词:safety-critical applications, critical for safe, safe deployment, deployment of neural, neural networks
备注: 6 pages, 3 figures, Fall 2025 version
点击查看摘要
Abstract:Detecting out-of-distribution (OOD) samples is critical for safe deployment of neural networks in safety-critical applications. While maximum softmax probability (MSP) provides a simple baseline, it lacks theoretical grounding and suffers from miscalibration. We propose VNDUQE (VIB-based Novelty Detection and Uncertainty Quantification for Nondestructive Evaluation), which investigates novelty detection through the Deep Variational Information Bottleneck (VIB), which explicitly constrains information flow through learned representations. We train VIB models on MNIST with held-out digit classes and evaluate OOD detection using information-theoretic metrics: KL divergence and prediction entropy. Our results reveal complementary detection signals: KL divergence achieves perfect detection (100\% AUROC on noise) on far-OOD samples (noise, domain shift), while prediction entropy excels at near-OOD detection (94.7\% AUROC on novel digit classes). A parallel detection strategy combining both metrics achieves 95.3\% average AUROC and 92\% true positive rate at 5\% false positive rate, which is a 32 percentage point improvement over baseline MSP (85.0\% AUROC, 60.1\% TPR). Compression via the information bottleneck principle ($\beta=10^{-3}$) reduces Expected Calibration Error by 38\%, demonstrating that information-theoretic constraints produce fundamentally more reliable uncertainty estimates. These findings directly support active learning with expensive computational oracles, where well-calibrated novelty detection enables principled threshold selection for oracle queries.
131. 【2605.11550】he DAWN of World-Action Interactive Models
链接:https://arxiv.org/abs/2605.11550
作者:Hongbo Lu,Liang Yao,Chenghao He,Haoyu Wang,Xiang Gu,Xianfei Li,Wenlong Liao,Tao He,Pai Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:good maneuver depends, maneuver depends, good maneuver, textbf, scene evolution depends
备注:
点击查看摘要
Abstract:A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.
132. 【2605.11541】GeoR-Bench: Evaluating Geoscience Visual Reasoning
链接:https://arxiv.org/abs/2605.11541
作者:Yushuo Zheng,Zicheng Zhang,Huiyu Duan,Chunyi Li,Zijian Chen,Ziheng Jia,Yue Shi,Ke Gu,Xiongkuo Min,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:support human decision-making, expected to understand, disaster response, climate adaptation, environmental protection
备注:
点击查看摘要
Abstract:Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.
133. 【2605.11533】Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
链接:https://arxiv.org/abs/2605.11533
作者:Sike Xiang,Shuang Chen,Kevin Qinghong Lin,Jialin Yu,Yijia Sun,Philip Torr,Amir Atapour-Abarghouei
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:combine page layouts, Clinical check-up reports, check-up reports, numerical biomarkers, abnormality flags
备注:
点击查看摘要
Abstract:Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, imaging-related evidence, and physician summaries. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.
134. 【2605.11521】XWOD: A Real-World Benchmark for Object Detection under Extreme Weather Conditions
链接:https://arxiv.org/abs/2605.11521
作者:Chih-Hsin Chen,Yu-Tung Liu,Amar Fadillah,Kuan-Ting Lai,Dong Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intelligent transportation systems, Federal Highway Administration, transportation systems remain, systems remain vulnerable, Autonomous driving
备注:
点击查看摘要
Abstract:Autonomous driving and intelligent transportation systems remain vulnerable under extreme weather. The U.S. Federal Highway Administration reports that roughly 745,000 crashes and 3,800 fatalities per year are weather-related, and recent regulatory investigations have examined failures of Level-2/3 driving systems under reduced-visibility conditions. However, datasets commonly used to evaluate weather robustness remain limited in scale, diversity, and realism. In this paper, we introduce XWOD (Extreme Weather Object Detection), a large-scale real-world traffic-object detection benchmark containing 10,010 images and 42,924 bounding boxes across seven extreme weather conditions: rain, snow, fog, haze/sand/dust, flooding, tornado, and wildfire. The dataset covers six traffic-object categories, including car, person, truck, motorcycle, bicycle, and bus. XWOD extends the weather taxonomy from one to seven conditions, and is the first to cover the emerging class of climate-amplified hazards, such as flooding, tornado, and wildfire. To evaluate the quality of our data, we train standard YOLO-family detectors on XWOD and test them zero-shot on external weather benchmarks, achieving mAP$_{50}$ scores of 63.00% on RTTS, 59.94% on DAWN, and 61.12% on WEDGE, compared with the corresponding published YOLO-based baselines of 40.37%, 32.75%, and 45.41%, respectively, representing relative improvements of 56%, 83%, and 35%. These cross-dataset results show that XWOD provides a strong source domain for learning weather-robust traffic perception. We release the dataset, splits, baseline weights, and reproducible evaluation code under a research-use license.
135. 【2605.11520】PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
链接:https://arxiv.org/abs/2605.11520
作者:Yixiao Song,Qingyong Li,Wen Wang,Zhicheng Yan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:embodied artificial intelligence, point-level annotations required, fully supervised methods, dense point-level annotations, autonomous driving
备注: Accepted by Computer Vision and Pattern Recognition (CVPR) 2026
点击查看摘要
Abstract:Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.
136. 【2605.11508】LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing
链接:https://arxiv.org/abs/2605.11508
作者:Yongcong Wang,Chengchao Shen,Guangwei Gao,Wei Wang,Pengwen Dai,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video dehazing due, video dehazing methods, video dehazing, processing continuous UHD, continuous UHD sequences
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3--5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial--color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the $\mathfrak{gl}(3)$ Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at this https URL.
137. 【2605.11506】Principled Design of Diffusion-based Optimizers for Inverse Problems
链接:https://arxiv.org/abs/2605.11506
作者:Julio Oscanoa,Irmak Sivgin,Cagan Alkan,Daniel Ennis,John Pauly,Mert Pilanci,Shreyas Vasanawala
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Score-based diffusion models, diffusion models achieve, Score-based diffusion, long inference times, performance for inverse
备注: 22 pages, 8 figures, 6 tables
点击查看摘要
Abstract:Score-based diffusion models achieve state-of-the-art performance for inverse problems, but their practical deployment is hindered by long inference times and cumbersome hyperparameter tuning. While pretrained diffusion models can be reused across tasks without retraining, inference-time hyperparameters such as the noise schedule and posterior sampling weights typically require ad-hoc adjustment for each problem setup. We propose principled reparameterizations that induce invariances, allowing the same hyperparameters to be reused across multiple problems without re-tuning. In addition, building on the RED-diff framework, which reformulates posterior sampling as an optimization problem, we further develop the OptDiff pipeline. OptDiff provides a simplified tuning framework that facilitates the integration of convex optimization tools to accelerate inference. Experiments on image reconstruction, deblurring, and super-resolution show substantial speedups and improved image quality.
138. 【2605.11497】PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition
链接:https://arxiv.org/abs/2605.11497
作者:Sanghyeon Lee,Jinwoo Kim,Jong Taek Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:encode joint-coordinate sequences, Zero-shot skeleton-based action, Zero-shot skeleton-based, skeleton-based action recognition, classify unseen actions
备注:
点击查看摘要
Abstract:Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.
139. 【2605.11494】STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
链接:https://arxiv.org/abs/2605.11494
作者:Ankit Yadav,Arpit Garg,Ta Duc Huy,Lingqiao Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exhibit reduced sample, real-time image generation, reduced sample diversity, sample diversity compared, enable real-time image
备注: 11 Pages 3 figures 4 tables
点击查看摘要
Abstract:Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.
140. 【2605.11492】A Mimetic Detector for Adversarial Image Perturbations
链接:https://arxiv.org/abs/2605.11492
作者:Johnny Corbino
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Adversarial attacks fool, attacks fool deep, invisible noise patterns, fool deep image, Adversarial attacks
备注:
点击查看摘要
Abstract:Adversarial attacks fool deep image classifiers by adding tiny, almost invisible noise patterns to a clean image. The standard $\ell^\infty$-bounded attacks (FGSM, PGD, and the $\ell^\infty$ variant of Carlini--Wagner) produce high-frequency, near-random sign patterns at the pixel level: nearly invisible in $\ell^2$, but carrying disproportionate gradient energy. We exploit this with a single-shot, training-free detector using the high-order Corbino--Castillo mimetic operators from the open-source MOLE library. No retraining, no surrogate classifier, no access to the network under attack: the verdict is a property of the input alone, computed in $O(HW)$ time. We validate the detector on the standard \texttt{peppers} test image at the canonical $\ell^\infty$ budget $\varepsilon = 16/255$ and observe a clean-vs-adversarial separation that grows monotonically from $3.55\times$ at order $k=2$ to $4.19\times$ at $k=6$.
141. 【2605.11489】3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering
链接:https://arxiv.org/abs/2605.11489
作者:Yibo Zhao,Fan Gao,Youcheng Cai,Ligang Liu
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, enables high-quality real-time, Aware Super Sampling, high-quality real-time, latency-sensitive applications
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) enables high-quality real-time 3D rendering but faces challenges in efficiently scaling to ultra-dense scenes and high-resolution due to computational bottlenecks that limit its use in latency-sensitive applications. Instead of optimizing the splatting pipeline itself, we propose \textbf{3DGS$^3$}, a unified post-rendering framework that jointly performs super sampling and frame interpolation through differentiable processing of low-resolution outputs to achieve both high-resolution and high-frame-rate rendering. Our \textbf{Gradient\- \-Aware Super Sampling (GASS)} module leverages the continuous differentiability of 3DGS to extract image gradients that guide a GRU-based refinement network to enable high-fidelity super sampling. Furthermore, a \textbf{Lightweight Temporal Frame Interpolation (LTFI)} module based on a compact U-Net-like backbone fuses temporal and differentiable spatial cues from consecutive frames to synthesize temporally coherent intermediate frames. Experiments on public datasets demonstrate that 3DGS$^3$ achieves superior rendering efficiency and visual quality when compared with state-of-the-art methods and remains compatible with existing 3DGS acceleration techniques. The code will be publicly released upon acceptance.
142. 【2605.11477】LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs
链接:https://arxiv.org/abs/2605.11477
作者:Jingfeng Chen,Jiawen Qian,Wendi Deng,Yinuo Guo,Jiaqi Yu,Sicong Leng,Raghuveer Thirukovalluru,Bhuwan Dhingra
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, limited visual-token budgets, requires selecting informative, large language models, language models requires
备注: 21 pages, 4 figures
点击查看摘要
Abstract:Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.
143. 【2605.11475】Deep Probabilistic Unfolding for Quantized Compressive Sensing
链接:https://arxiv.org/abs/2605.11475
作者:Gang Qu,Ping Wang,Siming Zheng,Xin Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep probabilistic unfolding, probabilistic unfolding model, classical quantized compressive, compressive sensing problem, quantized compressive sensing
备注:
点击查看摘要
Abstract:We propose a deep probabilistic unfolding model to address the classical quantized compressive sensing problem that leverages an unfolding framework to enhance the reconstruction accuracy and efficiency. Unlike previous unfolding methods that apply L2 projection to measurements, we derive a closed-form, numerically stable likelihood gradient projection, which allows the model to respect the true quantization physics, turning the hard quantization constraint into a soft probabilistic guidance. Furthermore, an efficient, dual-domain Mamba module is specifically designed to dynamically capture and fuse the multi-scale local and global features, ensuring the interactions between the distant but correlated regions. Extensive experiments demonstrate the state-of-the-art performance of the proposed method over previous works, which is capable of promoting the application of quantized compressive sensing in real life.
144. 【2605.11463】Encore: Conditioning Trajectory Forecasting via Biased Ego Rehearsals
链接:https://arxiv.org/abs/2605.11463
作者:Conghao Wong,Ziqian Zou,Xinge You
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Learning and representing, challenging but crucial, crucial problem, trajectory prediction task, Learning
备注:
点击查看摘要
Abstract:Learning and representing the subjectivities of agents has become a challenging but crucial problem in the trajectory prediction task. Such subjectivities not only present specific spatial or temporal structures, but also are anisotropic for all interaction participants. Despite great efforts, it remains difficult to explicitly learn and forecast these subjectivities, let alone further modulate models' predictions through a specific ego's subjectivity. Inspired by prefactual thoughts in psychology and relevant theatrical concepts, we interpret such subjectivities in future trajectories as the continuous process from rehearsal to encore. In the rehearsal phase, the proposed ego predictor focuses on how each ego agent learns to derive and direct a set of explicitly biased rehearsal trajectories for all participants in the scene from the short-term observations. Then, these rehearsal trajectories serve as immediate controls to condition final predictions, providing direct yet distinct ego biases for the prediction network to simulate agents' various subjectivities. Experiments across datasets not only demonstrate a consistent improvement in the performance of the proposed \emph{Encore} trajectory prediction model but also provide clear interpretability regarding subjectivities as biased ego rehearsals.
145. 【2605.11462】SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
链接:https://arxiv.org/abs/2605.11462
作者:Zishan Liu,Ruoxi Zang,Yanglin Zhang,Wei Liu,Yin Zhang,Jian Yao,Jiayin Zheng,Zhengzhe Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, models consistently struggle, exceptional semantic understanding, precise coordinate grounding, demonstrated exceptional semantic
备注:
点击查看摘要
Abstract:Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.
146. 【2605.11459】Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
链接:https://arxiv.org/abs/2605.11459
作者:Yanyan Zhang,Chaoda Song,Vikash Singh,Xinpeng Li,Kai Ye,Zhe Hu,Zhongzhu Pu,Yu Yin,Vipin Chaudhary
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:achieve remarkable flexibility, classical control paradigms, models achieve remarkable, achieve remarkable, remarkable flexibility
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.
147. 【2605.11444】Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
链接:https://arxiv.org/abs/2605.11444
作者:Eunho Lee,Youngbae Hwang,Rei Kawakami
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recover clean images, seeks to recover, recover clean, image restoration seeks, unknown degradations
备注:
点击查看摘要
Abstract:All-in-one image restoration seeks to recover clean images from inputs affected by diverse and unknown degradations using a unified framework. Recent methods have shown strong performance by identifying degradation characteristics to guide the restoration process. However, many of them treat degradations as discrete categories, which limits their ability to model the continuous relational structure that arises in composite degradations. To address this issue, we propose a multimodal large language model (MLLM)-guided image restoration framework that exploits multimodal embeddings as guidance for low-level restoration. Specifically, MLLM-derived features are injected into an encoder-decoder architecture through an MLLM-guided fusion block (MGFB) to enhance degradation-aware representations. In addition, we incorporate a mixture-of-frequency-experts (MoFE) module that adaptively combines frequency experts using MLLM-guided contextual cues. To further improve expert routing, we design an MLLM-guided router with a relational alignment loss that encourages routing patterns consistent with the embedding-space relationships of degraded inputs. Extensive experiments on multiple benchmarks show that the proposed method achieves strong performance across diverse restoration settings and establishes a new state of the art on the challenging CDD11 dataset, outperforming previous methods by up to 1.35 dB.
148. 【2605.11439】Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment
链接:https://arxiv.org/abs/2605.11439
作者:Armin Zarbaft,Ehsan Karimi,Nhut Le,Maryam Rahnemoonfar
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:significantly hinder decision-making, accurate situational awareness, Rapid and accurate, hinder decision-making, accurate situational
备注: Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)
点击查看摘要
Abstract:Rapid and accurate situational awareness is essential for effective response during natural disasters, where delays in analysis can significantly hinder decision-making. Training task-specific models for post-disaster assessment is often time-consuming and computationally expensive, making such approaches impractical in time-critical scenarios. Consequently, pretrained multimodal large language models (MLLMs) have emerged as a promising alternative for post-disaster visual question answering (VQA), a task that aims to answer structured questions about visual scenes by jointly reasoning over images and text. While these models demonstrate strong multimodal reasoning capabilities, their responses can be sensitive to prompt formulation, which can limit their reliability in real-world disaster assessment scenarios. In this paper, we investigate whether structured reasoning strategies can improve the reliability of pretrained MLLMs for post-disaster VQA. Specifically, we explore multiple prompting paradigms in which one MLLM is used to generate task-specific instructions that serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are incorporated during answer generation with varying degrees of in-context learning (ICL), enabling the model to leverage both explicit reasoning guidance and contextual examples. We conduct our evaluation on the FloodNet dataset and compare these approaches against a zero-shot baseline. Our results demonstrate that integrating instruction-driven CoT reasoning consistently improves answer accuracy.
149. 【2605.11438】Beyond Masks: The Case for Medical Image Parsing
链接:https://arxiv.org/abs/2605.11438
作者:Siddharth Gupta,Alan L. Yuille,Zongwei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical imaging research, producing per-voxel masks, spent a decade, Medical imaging, per-voxel masks
备注:
点击查看摘要
Abstract:Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.
150. 【2605.11435】ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models
链接:https://arxiv.org/abs/2605.11435
作者:Hai Jiang,Zhen Liu,Yinjie Lei,Songchen Han,Bing Zeng,Shuaicheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:zero-reference diffusion-based framework, low-quality degraded images, adaptive illumination correction, illumination degradation image, diffusion-based framework
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code is available at this https URL.
151. 【2605.11430】Diabetic Retinopathy Classification using Downscaling Algorithms and Deep Learning
链接:https://arxiv.org/abs/2605.11430
作者:Nishi Doshi,Urvi Oza,Pankaj Kumar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Diabetic Retinopathy, Indian Diabetic Retinopathy, Diabetic Retinopathy Image, science of recording, Deep Learning Network
备注:
点击查看摘要
Abstract:Diabetic Retinopathy (DR) is an art and science of recording and classifying the retinal images of a diabetic patient. DR classification deals with classifying retinal fundus image into five stages on the basis of severity of diabetes. One of the major issue faced while dealing with DR classification problem is the large and varying size of images. In this paper we propose and explore the use of several downscaling algorithms before feeding the image data to a Deep Learning Network for classification. For improving training and testing; we amalgamate two datasets: Kaggle and Indian Diabetic Retinopathy Image Dataset. Our experiments have been performed on a novel Multi Channel Inception V3 architecture with a unique self crafted preprocessing phase. We report results of proposed approach using accuracy, specificity and sensitivity, which outperform the previous state of the art methods. Index Terms: Diabetic Retinopathy, Downscaling Algorithms, Multichannel CNN Architecture, Deep Learning
152. 【2605.11427】PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming
链接:https://arxiv.org/abs/2605.11427
作者:Jiachen Li,Guangzhi Han,Jin Wan,Delong Han,Yuan Gao,Min Li,Mingle Zhou,Gang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enables high-quality dynamic, causing black-screen waits, modern adaptive-bitrate delivery, Gaussian Splatting, current models remain
备注:
点击查看摘要
Abstract:4D Gaussian Splatting (4DGS) enables high-quality dynamic novel view synthesis, yet current models remain monolithic bitstreams that clients must download in full before any frame can be rendered, causing black-screen waits of tens to hundreds of seconds on mobile bandwidth and leaving 4DGS incompatible with modern adaptive-bitrate delivery. Progressive 3DGS compression alleviates this for static scenes, but it acts only on spatial anchors and cannot partition the temporal deformation networks that dominate dynamic-scene size. We present PD-4DGS, the first framework for progressive compression and on-demand transmission of 4DGS. Hierarchical Deformation Decomposition (HDD) externalises the coarse-to-fine motion hierarchy already latent in 4DGS into three independently transmittable layers -- a static scaffold, a global deformation, and a local refinement -- so that any prefix of the bitstream is already renderable, turning a single training run into a scalable, DASH/HLS-compatible bitstream. A Gaussian-entropy attribute rate-distortion loss together with a temporal mask consistency regulariser shrink the base layer while suppressing low-bitrate flicker; a capacity-weighted rollout schedule, gated online by a learnt activation rate rho, then prevents deformation-network under-training without any per-scene hyperparameter. On the Dycheck iPhone benchmark, PD-4DGS cuts the streamed bitstream by 60% at matched rendering fidelity and reduces first-frame latency from 73--930 s to ~1.7 s on a 2 Mbps link, uniquely enabling true on-demand progressive streaming for 4DGS.
153. 【2605.11424】VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors
链接:https://arxiv.org/abs/2605.11424
作者:Jimin Tang,Wenyuan Zhang,Junsheng Zhou,Zian Huang,Kanle Shi,Shenkun Xu,Yu-Shen Liu,Zhizhong Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, achieved remarkable progress, exhibits notable degradation, Splatting has achieved, achieved remarkable
备注: Accepted by SIGGRAPH Conference 2026. Project Page: [this https URL](https://tangjm24.github.io/VidSplat)
点击查看摘要
Abstract:Gaussian Splatting has achieved remarkable progress in multi-view surface reconstruction, yet it exhibits notable degradation when only few views are available. Although recent efforts alleviate this issue by enhancing multi-view consistency to produce plausible surfaces, they struggle to infer unseen, occluded, or weakly constrained regions beyond the input coverage. To address this limitation, we present VidSplat, a training-free generative reconstruction framework that leverages powerful video diffusion priors to iteratively synthesize novel views that compensate for missing input coverage, and thereby recover complete 3D scenes from sparse inputs. Specifically, we tackle two key challenges that enable the effective integration of generation and reconstruction. First, for 3D consistent generation, we elaborate a training-free, stage-wise denoising strategy that adaptively guides the denoising direction toward the underlying geometry using the rendered RGB and mask images. Second, to enhance the reconstruction, we develop an iterative mechanism that samples camera trajectories, explores unobserved regions, synthesizes novel views, and supplements training through confidence weighted refinement. VidSplat performs robustly to sparse input and even a single image. Extensive experiments on widely used benchmarks demonstrate our superior performance in sparse-view scene reconstruction.
154. 【2605.11385】JACoP: Joint Alignment for Compliant Multi-Agent Prediction
链接:https://arxiv.org/abs/2605.11385
作者:Qingze Liu,Alen Mrdovic,Danrui Li,Mathew Schwartz,Sejong Yoon,Mubbasir Kapadia
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Stochastic Human Trajectory, Stochastic Human, Human Trajectory Prediction, area of research, Human Trajectory
备注: Accepted by CVPRF 2026
点击查看摘要
Abstract:Stochastic Human Trajectory Prediction (HTP) using generative modeling has emerged as a significant area of research. Although state-of-the-art models excel in optimizing the accuracy of individual agents, they often struggle to generate predictions that are collectively compliant, leading to output trajectories marred by social collisions and environmental violations, thus rendering them impractical for real-world applications. To bridge this gap, we present JACoP: Joint Alignment for Compliant Multi-Agent Prediction, an innovative multi-stage framework that ensures scene-level plausibility. JACoP incorporates an Anchor-Based Agent-Centric Profiler for effective initial compliance filtering and employs a Markov Random Field (MRF) based aligner to formalize the joint selection for scene predictions. By representing inter-agent spatial and social costs as MRF energy potentials, we successfully infer and sample from the joint trajectory distribution, achieving prediction with optimal scene compliance. Comprehensive experiments show that JACoP not only achieves competitive accuracy, but also sets a new standard in reducing both environmental violations and social collisions, thereby confirming its ability to produce collectively feasible and practically applicable trajectory predictions.
155. 【2605.11383】HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels
链接:https://arxiv.org/abs/2605.11383
作者:Ningkang Peng,Jingyang Mao,Qianfeng Yu,Xiaoqian Peng,Peirong Ma,Yanhui Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:data mining tasks, deep neural networks, large-scale visual recognition, labels severely undermines, mining tasks
备注:
点击查看摘要
Abstract:In large-scale visual recognition and data mining tasks, the presence of noisy labels severely undermines the generalization capability of deep neural networks (DNNs). Prevalent sample selection methods rely primarily on training loss or prediction confidence for passive screening. However, within a feature space degraded by noise, decision boundaries undergo systematic boundary collapse. This phenomenon hinders the ability of the model to distinguish between hard clean samples and noisy samples at the decision margins, thereby creating a significant performance bottleneck. This study is the first to emphasize the pivotal importance of active boundary restoration for noise-robust learning. We propose HamBR, a novel paradigm based on Hamiltonian dynamics. The core approach leverages the Spherical Hamiltonian Monte Carlo (Spherical HMC) mechanism to actively probe inter-class ambiguous regions within the representation space and synthesize high-quality virtual outliers. By imposing explicit repulsion constraints via energy-based modeling, these synthesized samples establish robust energy barriers at the decision boundaries. This mechanism forces real samples to move from dispersed overlapping regions toward their respective class centers, thereby restoring the discriminative sharpness of the decision boundaries. HamBR demonstrates exceptional versatility and can be integrated as a plug-and-play defense module into existing semi-supervised noisy label learning frameworks. Empirical evaluations show that the proposed paradigm significantly enhances the discriminative accuracy of hard boundary samples, achieving state-of-the-art (SOTA) performance on CIFAR-10/100 and real-world noise benchmarks. Furthermore, it exhibits superior convergence efficiency and reliable robustness, while improving significantly the capability of the model for Out-of-Distribution (OOD) detection.
156. 【2605.11369】Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers
链接:https://arxiv.org/abs/2605.11369
作者:Sanghyeok Nam,Byoungjun Kim,Daehyung Park,Tae-Kyun Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating physically plausible, Generating physically, HOI, static HOI motions, existing HOI datasets
备注: CVPR Findings 2026
点击查看摘要
Abstract:Generating physically plausible dynamic motions of human-object interaction (HOI) remains challenging, mainly due to existing HOI datasets limited to static interactions, and pretrained agents capable of either dynamic full-body motions without objects or static HOI motions. Recent works such as InsActor and CLoSD generate HOI motions in planning and execution stages, are yet limited to either static or short-term contacts e.g. striking. In this work, we propose a framework that fulfills dynamic and long-term interaction motions such as running while holding a table, by combining pretrained motion priors and imitation agents in planning and execution stages. In the planning stage, we augment HOI datasets with dynamic priors from a pretrained human motion diffusion model, followed by object trajectory generation. This plans dynamic HOI sequences. In the execution stage, a composer network blends actions of pretrained imitation agents specialized either for dynamic human motions or static HOI motions, enabling spatio-temporal composition of their complementary skills. Our method over relevant prior-arts consistently improves success rates while maintaining interaction for dynamic HOI tasks. Furthermore, blending pretrained experts with our composer achieves competitive performance in significantly reduced training time. Ablation studies validate the effectiveness of our augmentation and composer blending.
157. 【2605.11367】3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
链接:https://arxiv.org/abs/2605.11367
作者:Yifan Yin,Zehao Wen,Jieneng Chen,Zehan Zheng,Nanru Dai,Haojun Shi,Suyu Ye,Aydan Huang,Zheyuan Zhang,Alan Yuille,Jianwen Xie,Ayush Tewari,Tianmin Shu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, highlighted the promise, promise of learning, world, learning generative world
备注:
点击查看摘要
Abstract:Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel-view synthesis or future-frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent's belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D-Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D-Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D-Belief on 2D visual quality for scene memory and unobserved-scene imagination, object- and scene-level 3D imagination using our proposed 3D-CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D-Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state-of-the-art methods.
158. 【2605.11363】PresentAgent-2: Towards Generalist Multimodal Presentation Agents
链接:https://arxiv.org/abs/2605.11363
作者:Wei Wu,Ziyang Xu,Zeyu Zhang,Yang Zhao,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Presentation, presentation video, presentation video generation, interactive delivery, moving beyond static
备注:
点击查看摘要
Abstract:Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: this https URL. Website: this https URL.
159. 【2605.11354】Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
链接:https://arxiv.org/abs/2605.11354
作者:Haoyu Zhang,Zeyu Zhang,Zedong Zhou,Yang Zhao,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:offering strong performance, challenging visual conditions, offering strong, visual conditions, powerful paradigm
备注:
点击查看摘要
Abstract:Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: this https URL. Website: this https URL.
160. 【2605.11347】Gradient-Free Noise Optimization for Reward Alignment in Generative Models
链接:https://arxiv.org/abs/2605.11347
作者:Jeongsol Kim,Hongeun Kim,Jian Wang,Jong Chul Ye
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-step stochastic trajectories, flow models rely, reward alignment methods, Existing reward alignment, stochastic trajectories
备注:
点击查看摘要
Abstract:Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein--Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.
161. 【2605.11314】Quantifying Rodda and Graham Gait Classification from 3D Makerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort
链接:https://arxiv.org/abs/2605.11314
作者:Lauhitya Reddy,Seth Donahue,Jeremy Bauer,Susan Sienko,Anita Bagley,Joseph Krzak,Maura Eveld,Karen Kruger,Ross Chafetz,Vedant Kulkarni,Hyeokhyen Kwon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:lifelong physical disability, Rodda and Graham, Instrumented Gait Analysis, disability in childhood, neurological disorder
备注: 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format
点击查看摘要
Abstract:Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.
162. 【2605.11311】Couple to Control: Joint Initial Noise Design in Diffusion Models
链接:https://arxiv.org/abs/2605.11311
作者:Jing Jia,Liyue Shen,Guanyang Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)
关键词:models typically generate, Diffusion models typically, typically generate image, generate image batches, typically generate
备注: 26 pages
点击查看摘要
Abstract:Diffusion models typically generate image batches from independent Gaussian initial noises. We argue that this independence assumption is only one choice within a broader class of valid joint noise designs. Instead, one can specify a coupling of the initial noises: each noise remains marginally standard Gaussian, so the pretrained diffusion model receives the same single-sample input distribution, while the dependence across samples is chosen by design. This reframes initial-noise control from selecting or optimizing individual seeds to designing the dependence structure of a multi-sample gallery. This view gives a general framework for initial-noise design, covering several existing methods as special cases and leading naturally to new coupled-noise constructions. Coupled noise can improve generation on its own without adding sampling cost, and it is flexible enough to serve as a structured initialization for optimization-based pipelines when additional computation is available. Empirically, repulsive Gaussian coupling improves gallery diversity on SD1.5, SDXL, and SD3 while largely preserving prompt alignment and image quality. It matches or outperforms recent test-time noise-optimization baselines on several diversity metrics at the same sampling cost as independent generation. Subspace couplings also support fixed-object background generation, producing diverse, natural backgrounds compared with specialized inpainting baselines, with a tunable trade-off in foreground fidelity.
163. 【2605.11307】Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
链接:https://arxiv.org/abs/2605.11307
作者:Ajay Vikram Periasami,Junlin Wang,Bhuwan Dhingra
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:recover the structure, generation, executable, code, executable reference code
备注: Project page: [this https URL](https://image2code.github.io/vision2code/)
点击查看摘要
Abstract:Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at this https URL.
164. 【2605.11304】CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
链接:https://arxiv.org/abs/2605.11304
作者:Eva Prakash,Yunhe Gao,Chong Wang,Justin Xu,Neal Prakash,Arne Michalson,Seena Dehkharghani,Eun Kyoung Hong,Julie Bauml,Roger Boodoo,Jean-Benoit Delbrouck,Sophie Ostmeier,Curtis Langlotz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:radiograph interpretation requires, static image-report pairs, Chest radiograph interpretation, interpretation requires temporal, prior-current chest X-rays
备注:
点击查看摘要
Abstract:Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.
165. 【2605.11301】LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
链接:https://arxiv.org/abs/2605.11301
作者:Xueqi Cheng,Yushun Dong
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:strengths across OCR, Multimodal large language, large language models, visual question answering, chart understanding
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: this https URL.
166. 【2605.11300】Can Graphs Help Vision SSMs See Better?
链接:https://arxiv.org/abs/2605.11300
作者:Dhruv Parikh,Anvitha Ramachandran,Haoyang Fan,Mustafa Munir,Rajgopal Kannan,Viktor Prasanna
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ability of Mamba-style, Mamba-style selective scans, long-range modeling ability, Mamba-style selective, Vision state space
备注: Technical Report
点击查看摘要
Abstract:Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.
167. 【2605.11276】Generative AI for Visualizing Highway Construction Hazards Through Synthetic Images and Temporal Sequences
链接:https://arxiv.org/abs/2605.11276
作者:Trevor Neece,Mason Smetana,Lev Khazanovich
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:construction workers face, Highway construction workers, Severe Injury Report, OSHA Severe Injury, workers face
备注:
点击查看摘要
Abstract:Highway construction workers face a high risk of serious injury or death. Image-based training materials depicting hazardous scenarios are essential for engaging safety instruction but remain scarce due to ethical and logistical barriers. This study develops and evaluates a generative AI methodology for producing synthetic visualizations of highway construction hazards from OSHA Severe Injury Report narratives. Two modes were developed: a single-pass approach yielding one image per incident, and a temporal approach producing a four-stage sequence. A sample of 75 incident records yielded 750 images, evaluated using CLIP-based semantic retrieval and expert assessment across dimensions such as educational utility, fidelity, and alignment. Single-pass images achieved 81.1% educational acceptability with fidelity and alignment scores of 4.14/5 and 4.07/5, respectively, while temporal sequences achieved 60.9% acceptability with comparable alignment (3.94/5) but lower fidelity (3.51/5). CLIP-based retrieval revealed that both modes produce images with statistically significant retrieval capabilities. This is among the first studies to leverage modern autoregressive image generation models for visualizing construction hazards from reported severe injuries and to generate temporally sequenced hazard imagery, and a new multi-dimensional evaluation framework was developed to support future research in this domain. The work enables safety trainers to pair narrative storytelling with visual learning material without photographing real-world hazards, and the framework could be applied to datasets across diverse domains, enabling synthetic image generation tailored to new application areas.
168. 【2605.11267】Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates
链接:https://arxiv.org/abs/2605.11267
作者:Quanyun Wu,Kyle Gao,Wentao Sun,Hongjie He,Yuhao Chen,David A. Clausi,Jonathan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:coastal zone monitoring, Accurate measurement, oceanographic analysis, length is crucial, crucial for coastal
备注: Accepted for publication at IEEE OCEANS (Sanya) 2026
点击查看摘要
Abstract:Accurate measurement of island area and coastline length is crucial for coastal zone monitoring and oceanographic analysis. However, traditional measurement and mapping methods usually rely heavily on orthophotos, expensive airborne depth sensors, or dense ground control points, which face serious limitations of high labor costs, time-consuming efforts, and low operational efficiency in vast and inaccessible open sea environments. To overcome these challenges and break away from the reliance on manual field exploration, this paper proposes a geometrically consistent, real-scale island measurement framework based on pure monocular vision. This project significantly reduces the mapping cost through a fully automated process and achieves high-efficiency measurement without prior GIS data. In our system pipeline, only the geographical coordinates or names of the target area need to be input to obtain a low-altitude surrounding image sequence. After obtaining the point clouds, a lightweight trajectory alignment algorithm (Umeyama) is used to restore the global physical scale, and the scaled model is orthorectified, enabling high-precision area and perimeter extraction directly on the 2D rasterized plane. We have fully verified this pipeline on four islands with different terrain features (covering natural landform islands and islands with complex artificial facilities). The experimental results show that the final measurement error of the system is stable at around 10\%, demonstrating excellent accuracy and robustness. Moreover, this framework has outstanding inference speed, requiring only 70 ms to process a single high-resolution image and generate point clouds, providing a highly practical new paradigm for large-scale marine and coastline
169. 【2605.11266】PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives
链接:https://arxiv.org/abs/2605.11266
作者:Zachary Lee,Maxwell Jacobson,Yexiang Xue
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:methods remain purely, Gaussian Splatting, Recent advances, remain purely visual, enabled fast
备注: Submitted to Artificial Intelligence. 52 pages
点击查看摘要
Abstract:Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.
170. 【2605.11265】DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction
链接:https://arxiv.org/abs/2605.11265
作者:Guiqiu Liao,Matjaž Jogan,Daniel A. Hashimoto
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:provide valuable guidance, surgical computer vision, computer vision, robotic surgery, provide valuable
备注: Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)
点击查看摘要
Abstract:Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.
171. 【2605.11224】ABRA: Agent Benchmark for Radiology Applications
链接:https://arxiv.org/abs/2605.11224
作者:Bulat Maksudov,Vladislav Kurenkov,Kathleen M. Curran,Alessandra Mileo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Existing medical-agent benchmarks, medical-agent benchmarks deliver, benchmarks deliver imaging, Existing medical-agent, Orthanc DICOM server
备注:
点击查看摘要
Abstract:Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at this https URL
172. 【2605.11208】Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
链接:https://arxiv.org/abs/2605.11208
作者:Kedi Sun,Chaohui Dang,Yue Feng,James Glasbey,Theodoros N. Arvanitis,Le Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:provide objective feedback, reduce documentation burden, remain challenging due, clinician-grade assessment reports, aligning dense spatio-temporal
备注: 11 pages, 2 figures
点击查看摘要
Abstract:Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.
173. 【2605.11203】FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry
链接:https://arxiv.org/abs/2605.11203
作者:Elias B. Krey,Nils Neukirch,Nils Strodthoff
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:deep neural networks, Intermediate feature representations, Intermediate feature, feature representations represent, neural networks
备注: 27 pages, 24 figures, 3 tables, Code is available at [this https URL](https://github.com/AI4HealthUOL/FeatMap)
点击查看摘要
Abstract:Intermediate feature representations represent the backbone for the expressivity and adaptability of deep neural networks. However, their geometric structure remains poorly understood. In this submission, we provide indirect insights into this matter by applying a broad selection of manipulations in input space, ranging from geometric and photometric transformations to local masking and semantic manipulations using generative image editing models, and assess the feasibility of learning a mapping in the feature space, mapping from the original to the manipulated feature map. To this end, we devise different types of mappings, from linear to non-linear and local to global mappings and assess both the reconstruction quality of the mapping as well as the semantic content of the mapped representations. We demonstrate the feasibility of learning such mappings for all considered transformations. While global (transformer) models that operate on the full feature map often achieve best results, we show that the same can be achieved with a shared linear model operating on a single feature vector typically with very little degradation in reconstruction quality, even for highly non-trivial semantic manipulations. We analyze the corresponding mappings across different feature layers and characterize them according to dominance of weight vs. bias and the effective rank of the linear transformations. These results provide hints for the hypothesis that the feature space is to a first degree of approximation organized in linear structures. From a broader perspective, the study demonstrates that generative image editing models might open the door to a deeper understanding of the feature space through input manipulation.
174. 【2605.11166】Unpacking the Eye of the Beholder: Social Location, Identity, and the Moving Target of Political Perspectives
链接:https://arxiv.org/abs/2605.11166
作者:Elena Sirotkina
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:finding decades deep, produce single scores, social identities structure, evaluate political information, people evaluate political
备注:
点击查看摘要
Abstract:Political and social identities structure how people evaluate political information, a finding decades deep in political science and routinely discarded by computational tools that often produce single scores that treat a piece of text, an image, or a video as if it means the same thing to everyone. This paper shows that it does not, and that the difference is consequential. To address this problem, I develop the Perspectivist Visual Political Sentiment (PVPS) classifier, which learns from approximately 82,000 evaluations by 5,575 U.S. adults to predict how audiences defined by political and social identities will evaluate the same image. Unlike standard tools that average systematic disagreement away, PVPS preserves it, returning an evaluative profile that records who agrees, who diverges, and along which identity lines. Applied to several influential studies of visual sentiment, PVPS shows that perceived violence in protest imagery and the emotional mechanisms behind protest image engagement both change substantively once audience identity is taken into account. It follows that what a political image conveys is a moving target, and measuring it requires knowing whom it is moving.
175. 【2605.11131】USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation
链接:https://arxiv.org/abs/2605.11131
作者:Elisha Dayag,Nhat Thanh Tran,Jack Xin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate medical image, image analysis pipeline, medical image segmentation, medical image analysis, Accurate medical
备注:
点击查看摘要
Abstract:Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.
176. 【2605.11115】LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
链接:https://arxiv.org/abs/2605.11115
作者:Pedram Fekri,WenChen Li,William Chen,Peter Altamirano
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:generative models, dynamic range outputs, largely limited, limited to low, low dynamic range
备注:
点击查看摘要
Abstract:High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.
177. 【2605.11107】Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
链接:https://arxiv.org/abs/2605.11107
作者:Youssef Zaazou,Mark Thomas
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:vision encoders remain, encoders remain vulnerable, Vision-language models, CLIP and SigLIP, image classification
备注: 36 pages, 7 figures
点击查看摘要
Abstract:Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.
178. 【2605.11061】HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
链接:https://arxiv.org/abs/2605.11061
作者:Qi Cai,Jingwen Chen,Chengmin Gao,Zijian Gong,Yehao Li,Yingwei Pan,Yi Peng,Zhaofan Qiu,Kai Yu,Yiheng Zhang,Hao Ai,Siying Bai,Yang Chen,Zhihui Chen,Fengbin Gao,Ying Guo,Dong Li,Zhen Shen,Leilei Shi,Jing Wang,Siyu Wang,Yimeng Wang,Rui Zheng,Ting Yao,Tao Mei
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:pixel-space Diffusion Transformer, fragmented architectures relying, Diffusion Transformer, long been constrained, constrained by fragmented
备注: Source codes and models are available at Github: [this https URL](https://github.com/HiDream-ai/HiDream-O1-Image) and Huggingface: [this https URL](https://huggingface.co/HiDream-ai/HiDream-O1-Image)
点击查看摘要
Abstract:The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.
179. 【2605.11055】he first global agricultural field boundary map at 10m resolution
链接:https://arxiv.org/abs/2605.11055
作者:Caleb Robinson,Gedeon Muhawenayo,Subash Khanal,Zhanpei Fang,Isaac Corley,Ana M. Tárano,Lyndon Estes,Jennifer Marcus,Nathan Jacobs,Hannah Kerner,Inbal Becker-Reshef,Juan M. Lavista Ferres
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:pixel level, field, field boundary dataset, global remote-sensing products, data products exist
备注:
点击查看摘要
Abstract:The agricultural field is the natural unit at which crops are planted, managed, regulated, and reported, yet most global remote-sensing products for agriculture are only available at the pixel level. While some high-quality field-level data products exist, they come from parcel registries covering only parts of Europe or from ML-derived products for individual countries. No openly available, globally consistent map of agricultural field boundaries exists to date. Here we present the first global field boundary dataset at 10\,m resolution for the years 2024 and 2025, comprising 3.17 billion remote-sensing field polygons (1.62 B in 2024 and 1.55 B in 2025) across 241 countries and territories, produced by applying a U-Net segmentation model trained on the Fields of The World dataset to cloud-free Sentinel-2 mosaics. Validated against ground-truth field boundaries in 24 countries, the map achieved a mean pixel-level recall of 0.85 with 14 countries exceeding 0.90. Evaluation against full-country ground-truth datasets in Austria, Latvia, and Finland yielded F1 scores of 0.89, 0.88, and 0.74, respectively. Because reference data for global validation is inherently incomplete, we accompanied the map with a 500 m confidence layer that identifies regions where predictions are reliable. We release the dataset openly as three global maps: the confidence-thresholded default field boundary dataset, the full unfiltered dataset, and the continuous-valued confidence raster. These maps provide the first globally consistent field-level unit of analysis for crop monitoring, food security, and downstream agricultural science.
180. 【2605.10984】Principle-Guided Supervision for Interpretable Uncertainty in Medical Image Segmentation
链接:https://arxiv.org/abs/2605.10984
作者:An Sui,Yuzhu Li,Gunter Schumann,Fuping Wu,Xiahai Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:quantification complements model, complements model predictions, high-stakes decision making, Uncertainty quantification complements, characterizing their reliability
备注: 14 pages, 8 figures
点击查看摘要
Abstract:Uncertainty quantification complements model predictions by characterizing their reliability, which is essential for high-stakes decision making such as medical image segmentation. However, most existing methods reduce uncertainty to a scalar confidence estimate, leaving its spatial distribution semantically underconstrained. In this work, we focus on uncertainty interpretability, namely, whether estimated uncertainty behaves in a human-understandable manner with respect to sources of ambiguity. We identify three perception-aligned principles requiring the spatial distribution of uncertainty to reflect: (1) image contrast between structures, (2) severity of image corruption, and (3) geometric complexity in anatomical structures. Accordingly, we develop a principle-guided uncertainty supervision framework (PriUS) based on evidential learning, in which the corresponding supervision objectives are explicitly enforced during training. We further introduce quantitative metrics to measure the consistency between predicted uncertainty and image attributes that induce ambiguity. Experiments on ACDC, ISIC, and WHS datasets showed that, compared with state-of-the-art methods, PriUS produced more consistent uncertainty estimates while maintaining competitive segmentation performance.
181. 【2605.10983】MPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
链接:https://arxiv.org/abs/2605.10983
作者:Jiaming Li,Chenyu Zhu,Zhiyuan Ma,Nanxi Yi,Youjun Bao,Li Sun,Quanying Lv,Xiang Fang,Daizong Liu,Jianjun Li,Kun He,Bowen Zhou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown extraordinary potential, inducing visual mode, visual mode collapse, Reinforcement learning, significant reward hacking
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.
182. 【2605.11758】DiffSegLung: Diffusion Radiomic Distillation for Unsupervised Lung Pathology Segmentation
链接:https://arxiv.org/abs/2605.11758
作者:Rezkellah Noureddine Khiati,Pierre-Yves Brillet,Catalin Fetita
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:quantitative Hounsfield Unit, Hounsfield Unit, open challenge due, existing diffusion-based methods, physically distinguishes tissue
备注:
点击查看摘要
Abstract:Unsupervised segmentation of pulmonary pathologies in CT remains an open challenge due to the absence of annotated multi pathology cohorts and the failure of existing diffusion-based methods to exploit the quantitative Hounsfield Unit (HU) signal that physically distinguishes tissue classes. To address this, we propose DiffSegLung,a framework that introduces Diffusion Radiomic Distillation, in which handcrafted radiomic descriptors serve as a physics grounded teacher to shape the bottleneck of a 3D diffusion U-Net via a contrastive objective, transferring pathology discriminative structure into the learned representation without any annotations. At inference, the teacher is discarded and multitimestep bottleneck features are clustered by a Gaussian Mixture Model with HU-guided label assignment, followed by Sobel Diffusion Fusion for boundary refinement. Evaluated on 190 expert annotated axial slices drawn from four heterogeneous CT cohorts, Diff-SegLung improves segmentation across all four pathology classes over unsupervised baselines and improves generation fidelity over prior CT diffusion models.
183. 【2605.11583】NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI
链接:https://arxiv.org/abs/2605.11583
作者:Tal Oved,Efrat Shimron
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
关键词:Modern low-field magnetic, standard high-field MRI, low-field magnetic resonance, magnetic resonance imaging, technology offers
备注:
点击查看摘要
Abstract:Modern low-field magnetic resonance imaging (MRI) technology offers a compelling alternative to standard high-field MRI, with portable, low-cost systems. However, its clinical utility is limited by a low Signal-to-Noise Ratio (SNR), which hampers diagnostic image quality. A common approach to increase SNR is through repetitive signal acquisitions, known as NEX, but this results in excessively long scan durations. Although recent work has introduced methods to accelerate MRI scans through k-space sampling optimization, the NEX dimension remains unexploited; typically, a single sampling mask is used across all repetitions. Here we introduce NexOP, a deep-learning framework for joint optimization of the sampling and reconstruction in multi-NEX acquisitions, tailored for low-SNR settings. NexOP enables optimizing the sampling density probabilities across the extended k-space-NEX domain, under a fixed sampling-budget constraint, and introduces a new deep-learning architecture for reconstructing a single high-SNR image from multiple low-SNR measurements. Experiments with raw low-field (0.3T) brain data demonstrate that NexOP consistently outperforms competing methods, both quantitatively and qualitatively, across diverse acceleration factors and tissue contrasts. The results also demonstrate that NexOP yields non-uniform sampling strategies, with progressively decreasing sampling across repetitions, hence exploiting the NEX dimension efficiently. Moreover, we present a theoretical analysis supporting these numerical observations. Overall, this work proposes a sampling-reconstruction optimization framework highly suitable for low-field MRI, which can enable faster, higher-quality imaging with low-cost systems and contribute to advancing affordable and accessible healthcare.
184. 【2605.11109】Deploying Self-Supervised Learning for Real Seismic Data Denoising
链接:https://arxiv.org/abs/2605.11109
作者:Giovanny A. M. Arboleda,Claudio D. T. de Souza,Carlos E. M. dos Anjos,Lessandro de S. S. Valente,Roosevelt de L. Sardinha,Albino Aveleda,Pablo M. Barros,André Bulcão,Alexandre G. Evsukoff
类目:Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:require clean reference, NaC SSL method, clean reference data, seismic data, real seismic data
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) has emerged as a promising approach to seismic data denoising as it does not require clean reference data. In this work, the deployment of the Noisy-as-Clean (NaC) method was evaluated for real seismic data denoising under controlled conditions. Two independent seismic acquisitions, each comprising noisy and filtered data, were organized into four real datasets. The NaC SSL method was adapted to add real noise to the noisy input, controlled by a parameter. An experimental protocol with ten experiments was designed to compare different strategies for deploying the NaC SSL method with the supervised learning baseline, using identical network topology and hyperparameters. The models were evaluated in terms of denoising performance, computational cost, and generalization capability. The results show that the synthetic additive white Gaussian noise (AWGN) is inadequate for the denoising of seismic data within the NaC method, and performance strongly depends on the compatibility between the injected and actual noise characteristics. Furthermore, both the characteristics of the seismic data and the noise level influence the performance of the model. Self-supervised fine-tuning on test data has improved SSL performance, whereas no such gain was observed for fine-tuning of supervised models. Finally, NaC has shown to be a simple, effective, and model-independent method that offers a feasible solution for the denoising of real seismic data.
185. 【2605.11060】SplitFed-CL: A Split Federated Co-Learning Framework for Medical Image Segmentation with Inaccurate Labels
链接:https://arxiv.org/abs/2605.11060
作者:Zahra Hafezi Kafshgari,Hadi Hadizadeh,Parvaneh Saeedi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Split Federated Learning, reducing client-side computation, Split Federated, Federated Learning, split learning
备注:
点击查看摘要
Abstract:Split Federated Learning (SplitFed) combines federated and split learning to preserve privacy while reducing client-side computation. However, in medical image segmentation, heterogeneous label quality across clients can significantly degrade performance. We propose SplitFed-CL, a co-learning framework where a global teacher guides local students to detect and refine unreliable annotations. Reliable labels supervise training directly, while unreliable labels are corrected via weighted student--teacher refinement. SplitFed-CL further incorporates consistency regularization for robustness to input perturbations and a trainable weighting module to balance loss terms adaptively. We also introduce a novel difficulty guided strategy to simulate human like boundary centric annotation errors, where the degree of perturbation is governed by shape complexity and the associated annotation difficulty. Experiments on two multiclass segmentation datasets with controlled synthetic noise, together with a binary segmentation dataset containing real-world annotation errors, demonstrate that SplitFed-CL consistently outperforms seven state-of-the-art baselines, yielding improved segmentation quality and robustness.
186. 【2605.10995】Streaming of rendered content with adaptive frame rate and resolution
链接:https://arxiv.org/abs/2605.10995
作者:Yaru Liu,Joseph G. March,Rafal K. Mantiuk
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
关键词:bring high-quality graphics, sufficient rendering power, Streaming rendered content, Streaming rendered, bring high-quality
备注:
点击查看摘要
Abstract:Streaming rendered content is an attractive way to bring high-quality graphics to billions of mobile devices that do not have sufficient rendering power. Existing solutions render content on a server at a fixed frame rate, typically 30 or 60 frames per second, and reduce resolution when bandwidth is restricted. However, this strategy leads to suboptimal rendering quality under the bandwidth constraints. In this work, we exploit the spatio-temporal limits of the human visual system to improve perceived quality while reducing rendering costs by adaptively adjusting both frame rate and resolution based on scene content and motion. Our approach is codec-agnostic and requires only minimal modifications to existing rendering infrastructure. We propose a system in which a lightweight neural network predicts the optimal combination of frame rate and resolution for a given transmission bandwidth, content, and motion velocity. This prediction significantly enhances perceptual quality while minimizing computational cost under bandwidth constraints. The network is trained on a large dataset of rendered content labeled with a perceptual video quality metric. The dataset and further information can be found at the project web page: this https URL.
187. 【2605.10953】Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising
链接:https://arxiv.org/abs/2605.10953
作者:Jiahua Zhao,Umair bin Waheed,Jing Sun,Yang Cui,Nikos Savva,Eric Verschuur
类目:Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:continuous Earth monitoring, distributed acoustic sensing, dense geophone deployments, high-resolution subsurface imaging, driven rapid growth
备注: 34 pages, 8 figures, 6 tables. Submitted to Geophysics for publication consideration
点击查看摘要
Abstract:The demand for high-resolution subsurface imaging and continuous Earth monitoring has driven rapid growth in active and passive seismic data from dense geophone deployments, distributed acoustic sensing (DAS) arrays, and large-scale 2D and 3D surveys. This expansion makes complex noise suppression increasingly challenging, especially when signal fidelity must be preserved. Conventional supervised deep learning methods are often task-specific, require large paired datasets, and can suffer from domain shift under new acquisition conditions. Foundation models offer a promising alternative, but pre-training seismic foundation models from scratch requires massive domain-specific data and substantial computation. We propose an efficient framework that repurposes general-purpose Vision Foundation Models (VFMs) for geophysical tasks through Parameter-Efficient Fine-Tuning. The architecture uses a pre-trained VFM, a DINOv3 encoder, adapted with Low-Rank Adaptation (LoRA) to enable effective feature adaptation with few additional parameters. To improve robustness under unseen field conditions without ground truth, we introduce a kurtosis-guided unsupervised test-time adaptation module that updates only LoRA parameters during inference. This module self-calibrates the model to site-specific noise by identifying information-rich regions via kurtosis and performing self-training without labeled data. Experiments on public exploration seismic images and DAS vertical seismic profiling data from the Utah FORGE site show that the framework matches or outperforms domain-specific models. Tests on unseen cross-site data from a land survey in China and the Groß Schönebeck geothermal site in Germany further demonstrate strong generalization and effective signal-noise separation. These results highlight the potential of adapting pre-trained VFMs to data-intensive problems in exploration seismology.
188. 【2605.10949】AlphaEarth Satellite Embeddings for Modelling Climate Sensitive Diseases Towards Global Health Resilience
链接:https://arxiv.org/abs/2605.10949
作者:Usman Nazir,I-Han Cheng,Sara Khalid
类目:Applications (stat.AP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:variability modulates transmission, million deaths annually, climate variability modulates, modulates transmission, child undernutrition
备注: Visualising Climate 2026
点击查看摘要
Abstract:Malaria, childhood acute respiratory infection, and child undernutrition together account for over two million deaths annually in children under five, with the burden concentrated in low and middle-income countries where climate variability modulates transmission, exposure, and nutritional outcomes. Routine health surveillance in these settings remains sparse and reactive. Satellite-derived representations of the Earth's surface offer a scalable, low-cost complement to traditional covariates, yet their utility as predictors of population health outcomes is poorly characterised. We summarise findings from three studies evaluating AlphaEarth Foundations 64-dimensional satellite embeddings as predictors of population health outcomes, focusing on vulnerable populations. The studies span infectious disease (malaria, respiratory infection) and stunting. In each study, embeddings provide predictive value at sufficient spatial granularity: (i) malaria prediction across Nigeria shows consistent per-region R^2 gains; (ii) childhood acute respiratory infection prediction across 11 DHS countries increases pooled R^2 from 0.157 to 0.206 across three tree-based estimators; (iii) stunting prediction across 35 countries is neutral at country level due to collinearity with fixed effects. The stunting case is currently limited by lack of DHS cluster-level coordinates, which is the next key experiment.

