本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新681篇论文，其中：

自然语言处理93篇
信息检索15篇
计算机视觉147篇

自然语言处理

1. 【2603.19225】FinTradeBench: A Financial Reasoning Benchmark for LLMs

链接：https://arxiv.org/abs/2603.19225

作者：Yogesh Agrawal,Aniruddha Dutta,Md Mahadi Hasan,Santu Karmaker,Aritra Dutta(University of Central Florida)

类目：Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)

关键词：Real-world financial decision-making, Large Language Models, Real-world financial, price dynamics, trading signals computed

备注： 8 pages main text, 22 pages total (including references and appendix). 5 figures, 14 tables. Preprint under review. Code and data will be made available upon publication

点击查看摘要

Abstract:Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

2. 【2603.19223】F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

链接：https://arxiv.org/abs/2603.19223

作者：Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：distinct sizes ranging, distinct sizes, multilingual embedding models, sizes ranging, multilingual embedding

备注：

点击查看摘要

Abstract:We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

3. 【2603.19221】Online Learning and Equilibrium Computation with Ranking Feedback

链接：https://arxiv.org/abs/2603.19221

作者：Mingyang Liu,Yongshan Chen,Zhiyuan Fan,Gabriele Farina,Asuman Ozdaglar,Kaiqing Zhang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)

关键词：emph, Online learning, possibly adversarial, sequential decision-making, existing online learning

备注：

点击查看摘要

Abstract:Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

4. 【2603.19220】Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

链接：https://arxiv.org/abs/2603.19220

作者：Zhuolin Yang,Zihan Liu,Yang Chen,Wenliang Dai,Boxin Wang,Sheng-Chieh Lin,Chankyu Lee,Yangyi Chen,Dongfu Jiang,Jiafan He,Renjie Pi,Grace Lam,Nayeon Lee,Alexander Bukharin,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：International Mathematical Olympiad, ICPC World Finals, strong agentic capabilities, Gold Medal-level performance, activated parameters

备注： We release the model and data at [this https URL](https://huggingface.co/collections/nvidia/nemotron-cascade-2)

点击查看摘要

Abstract:We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

5. 【2603.19182】Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

链接：https://arxiv.org/abs/2603.19182

作者：Zou Qiang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：demonstrate strong generative, strong generative capabilities, demonstrate strong, strong generative, generative capabilities

备注： 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.

Comments:
10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
I.2.0

Cite as:
arXiv:2603.19182 [cs.AI]

(or
arXiv:2603.19182v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.19182

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

6. 【2603.19167】Evaluating Counterfactual Strategic Reasoning in Large Language Models

链接：https://arxiv.org/abs/2603.19167

作者：Dimitrios Georgousis,Maria Lymperaiou,Angeliki Dimitriou,Giorgos Filandrianos,Giorgos Stamou

类目：Computation and Language (cs.CL)

关键词：Large Language Models, evaluate Large Language, repeated game-theoretic settings, performance reflects genuine, Language Models

备注：

点击查看摘要

Abstract:We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

7. 【2603.19166】Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

链接：https://arxiv.org/abs/2603.19166

作者：Swagat Padhan,Lakshya Jain,Bhavya Minesh Shah,Omkar Patil,Thao Nguyen,Nakul Gopalan

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：convert natural language, collaborating with humans, humans must convert, convert natural, grounding

备注： Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: [this https URL](https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/)

点击查看摘要

Abstract:Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

8. 【2603.19152】VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

链接：https://arxiv.org/abs/2603.19152

作者：Chonghan Liu,Yimin Du,Qi An,Xin He,Cunqi Zhai,Fei Tan,Weijia Lin,Xiaochun Gong,Yongchao Deng,Shousheng Jia,Xiangzheng Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：frequently exhibit suboptimal, inefficient subword segmentation, training data imbalances, systemic training data, models frequently exhibit

备注： 23 pages. Includes figures and tables. Conference submission

点击查看摘要

Abstract:Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

9. 【2603.19149】Optimal Splitting of Language Models from Mixtures to Specialized Domains

链接：https://arxiv.org/abs/2603.19149

作者：Skyler Seto,Pierre Ablin,Anastasiia Filippova,Jiayuan Ye,Louis Bethune,Angelos Katharopoulos,David Grangier

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：models achieve impressive, reasoning tasks due, achieve impressive performance, achieve impressive, tasks due

备注： 26 pages, 11 tables, 17 figures

点击查看摘要

Abstract:Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

10. 【2603.19144】UGID: Unified Graph Isomorphism for Debiasing Large Language Models

链接：https://arxiv.org/abs/2603.19144

作者：Zikang Ding,Junchi Yao,Junhao Li,Yi Zhang,Wenbo Jiang,Hongbo Liu,Lijie Hu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：exhibit pronounced social, Large language models, pronounced social biases, Large language, language models

备注：

点击查看摘要

Abstract:Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation--level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

11. 【2603.19118】How Uncertainty Estimation Scales with Sampling in Reasoning Models

链接：https://arxiv.org/abs/2603.19118

作者：Maksym Del,Markus Kängsepp,Marharyta Domnich,Ardi Tampuu,Lisa Yankovskaya,Meelis Kull,Mark Fishel

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：remains poorly understood, understood under extended, deploying reasoning language, estimation is critical, critical for deploying

备注：

点击查看摘要

Abstract:Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2603.19118 [cs.AI]

(or
arXiv:2603.19118v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.19118

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

12. 【2603.19097】DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

链接：https://arxiv.org/abs/2603.19097

作者：Yilin Wang,Yuchun Fan,Jiaoyang Li,Ziming Zhu,Yongyu Mu,Qiaozhi He,Tong Xiao,Jingbo Zhu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Retrieval-augmented generation, multi-hop question answering, complex multi-hop question, solving complex multi-hop, made significant progress

备注： Accepted by ICASSP 2026

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.

13. 【2603.19092】SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

链接：https://arxiv.org/abs/2603.19092

作者：Carlos Hinojosa,Clemens Grange,Bernard Ghanem

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Vision-language models, increasingly deployed, deployed in real-world, real-world and embodied, embodied settings

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

14. 【2603.19087】Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

链接：https://arxiv.org/abs/2603.19087

作者：Qiawen Ella Liu,Marina Dubova,Henry Conklin,Takumi Harada,Thomas L. Griffiths

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, language models, large language, cross-domain mapping, cross-domain

备注：

点击查看摘要

Abstract:Are large language models (LLMs) creative in the same way humans are, and can the same interventions increase creativity in both? We evaluate a promising but largely untested intervention for creativity: forcing creators to draw an analogy from a random, remote source domain (''cross-domain mapping''). Human participants and LLMs generated novel features for ten daily products (e.g., backpack, TV) under two prompts: (i) cross-domain mapping, which required translating a property from a randomly assigned source (e.g., octopus, cactus, GPS), and (ii) user-need, which required proposing innovations targeting unmet user needs. We show that humans reliably benefit from randomly assigned cross-domain mappings, while LLMs, on average, generate more original ideas than humans and do not show a statistically significant effect of cross-domain mappings. However, in both systems, the impact of cross-domain mapping increases when the inspiration source becomes more semantically distant from the target. Our results highlight both the role of remote association in creative ideation and systematic differences in how humans and LLMs respond to the same intervention for creativity.

15. 【2603.19082】A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

链接：https://arxiv.org/abs/2603.19082

作者：Madeline Bittner,Dina Demner-Fushman,Yasmeen Shabazz,Davis Bartels,Dukyong Yoon,Brad Quitadamo,Rajiv Menghrajani,Leo Celi,Sarvesh Soni

类目：Computation and Language (cs.CL)

关键词：current screening tools, structured electronic health, electronic health records, health records difficult, Health literacy

备注：

点击查看摘要

Abstract:Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).

16. 【2603.19066】Parallelograms Strike Back: LLMs Generate Better Analogies than People

链接：https://arxiv.org/abs/2603.19066

作者：Qiawen Ella Liu,Raja Marjieh,Jian-Qiao Zhu,Adele E. Goldberg,Thomas L. Griffiths

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：classically modeled geometrically, simple local-similarity heuristics, model poorly captures, Four-term word analogies, Four-term word

备注：

点击查看摘要

Abstract:Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.

17. 【2603.19044】MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

链接：https://arxiv.org/abs/2603.19044

作者：Chenyang Gu,Jiahao Cheng,Meicong Zhang,Pujun Zheng,Jinquan Zheng,Guoxiu He

类目：Computation and Language (cs.CL)

关键词：Scientific ideation aims, ideation aims, textbf, Scientific, Scientific ideation

备注：

点击查看摘要

Abstract:Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{this https URL}{GitHub}.

18. 【2603.19017】What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

链接：https://arxiv.org/abs/2603.19017

作者：Gagan Bhatia,Ahmad Muhammad Isa,Maxime Peyrard,Wei Zhao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Chinese Lunar, time zone conversion, multiple calendar conventions, reasoning benchmark spanning, temporal relation extraction

备注：

点击查看摘要

Abstract:We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: this https URL

19. 【2603.19008】Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

链接：https://arxiv.org/abs/2603.19008

作者：Hangeol Chang,Changsun Lee,Seungjoon Rho,Junho Yeo,Jong Chul Ye

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, improves Large Language, Language Models, Large Language, improves Large

备注：

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at this https URL.

20. 【2603.19002】RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

链接：https://arxiv.org/abs/2603.19002

作者：Weronika Łajewska,Paul Missault,George Davidson,Saab Mansour

类目：Computation and Language (cs.CL)

关键词：generating human-like responses, responses at scale, LLMs is emerging, generating human-like, human-like responses

备注：

点击查看摘要

Abstract:Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.

21. 【2603.18945】A conceptual framework for ideology beyond the left and right

链接：https://arxiv.org/abs/2603.18945

作者：Kenneth Joseph,Kim Williams,David Lazer

类目：Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词：partisan axis, CSS work, CSS, operationalized ideology, existing NLP tasks

备注：

点击查看摘要

Abstract:NLP+CSS work has operationalized ideology almost exclusively on a left/right partisan axis. This approach obscures the fact that people hold interpretations of many different complex and more specific ideologies on issues like race, climate, and gender. We introduce a framework that understands ideology as an attributed, multi-level socio-cognitive concept network, and explains how ideology manifests in discourse in relation to other relevant social processes like framing. We demonstrate how this framework can clarifies overlaps between existing NLP tasks (e.g. stance detection and natural language inference) and also how it reveals new research directions. Our work provides a unique and important bridge between computational methods and ideology theory, enabling richer analysis of social discourse in a way that benefits both fields.

22. 【2603.18940】Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

链接：https://arxiv.org/abs/2603.18940

作者：Xinghao Zhao

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：reasoning improves LLM, cheaply remains elusive, detecting failures cheaply, failures cheaply remains, improves LLM accuracy

备注：

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($\rho$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186-0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2603.18940 [cs.CL]

(or
arXiv:2603.18940v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.18940

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

23. 【2603.18911】Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

链接：https://arxiv.org/abs/2603.18911

作者：Vedant Pandya

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：contextually relevant responses, external knowledge sources, Knowledge-grounded dialogue systems, dialogue systems aim, generate informative

备注： 30 pages, 15 figures, 11 tables. Comprehensive study across 6 LLMs (250M-7B parameters) with explainability analysis. Code and data available upon request

点击查看摘要

Abstract:Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

24. 【2603.18886】Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

链接：https://arxiv.org/abs/2603.18886

作者：Pranjal Aggarwal,Marjan Ghazvininejad,Seungone Kim,Ilia Kulikov,Jack Lanchantin,Xian Li,Tianjian Li,Bo Liu,Graham Neubig,Anaelia Ovalle,Swarnadeep Saha,Sainbayar Sukhbaatar,Sean Welleck,Jason Weston,Chenxi Whitehouse,Adina Williams,Jing Xu,Ping Yu,Weizhe Yuan,Jingyu Zhang,Wenting Zhao

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：downstream STEM applications, formally structured expressions, precisely derive mathematical, STEM applications, downstream STEM

备注：

点击查看摘要

Abstract:The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

25. 【2603.18879】A Human-in/on-the-Loop Framework for Accessible Text Generation

链接：https://arxiv.org/abs/2603.18879

作者：Lourdes Moreno,Paloma Martínez

类目：Computation and Language (cs.CL)

关键词：Plain Language, essential for cognitive, current automatic simplification, Key Performance Indicators, accessibility Key Performance

备注： Accepted at LREC 2026. To appear in the Proceedings of the 14th International Conference on Language Resources and Evaluation (LREC 2026)

点击查看摘要

Abstract:Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.

26. 【2603.18873】Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo

链接：https://arxiv.org/abs/2603.18873

作者：Carlos Rafael Catalan,Patricia Nicole Monderin,Lheane Marie Dizon,Gap Estrella,Raymund John Sarmimento,Marie Antoinette Patalagsa

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：Popular language learning, large language models, Popular language, Popular, language models

备注： 5 pages,3 figures,presented at the 3rd HEAL Workshop at CHI 2026

点击查看摘要

Abstract:Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual's needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.

27. 【2603.18863】Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

链接：https://arxiv.org/abs/2603.18863

作者：Yana Veitsman,Yihong Liu,Hinrich Schütze

类目：Computation and Language (cs.CL)

关键词：assumed to yield, alignment, cross-lingual transfer, task, cross-lingual

备注：

点击查看摘要

Abstract:Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques -- despite increasing embedding similarity -- frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why ``better'' alignment often fails to translate into ``better'' cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.

28. 【2603.18859】RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

链接：https://arxiv.org/abs/2603.18859

作者：Xiao Feng,Bo Han,Zhanke Zhou,Jiaqi Fan,Jiangchao Yao,Ka Ho Li,Dahai Yu,Michael Kwok-Po Ng

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：holds significant promise, Reinforcement learning, large language models, holds significant, external environments

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at this https URL.

29. 【2603.18822】Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

链接：https://arxiv.org/abs/2603.18822

作者：Maria Milkova,Maksim Rudnev

类目：Computation and Language (cs.CL)

关键词：noisy Russian language, multi-stage classification framework, million public text, language social media, public text posts

备注：

点击查看摘要

Abstract:This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

30. 【2603.18788】Mi:dm K 2.5 Pro

链接：https://arxiv.org/abs/2603.18788

作者：KT Tech innovation Group

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：simple text generation, evolving LLM landscape, LLM landscape requires, prioritizing multi-step reasoning, landscape requires capabilities

备注：

点击查看摘要

Abstract:The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.18788 [cs.CL]

(or
arXiv:2603.18788v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.18788

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

31. 【2603.18765】Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

链接：https://arxiv.org/abs/2603.18765

作者：Rudra Jadhav,Janhavi Danve,Sonalika Shaw

类目：Computation and Language (cs.CL)

关键词：educational settings, concerns about fairness, increasingly deployed, deployed as automated, automated graders

备注： 7 pages, 5 figures, 2 tables, 11 references

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

32. 【2603.18756】Are complicated loss functions necessary for teaching LLMs to reason?

链接：https://arxiv.org/abs/2603.18756

作者：Gabriele Carrino,Andrea Sassella,Nicolo Brunello,Federico Toschi,Mark James Carman

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, group relative advantage, Recent advances, Group Relative, Relative Policy Optimization

备注：

点击查看摘要

Abstract:Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage estimation but removes PPO style clipping and policy ratio terms. Experiments across standard mathematical benchmarks indicate that RGRA has the potential to achieve stronger performance than GRPO. Our results suggest that simpler REINFORCE based approaches can effectively enhance reasoning in LLMs, offering a more transparent and efficient alternative to GRPO.

33. 【2603.18750】Automatic detection of Gen-AI texts: A comparative framework of neural models

链接：https://arxiv.org/abs/2603.18750

作者：Cristian Buttaro,Irene Amerini

类目：Computation and Language (cs.CL)

关键词：raising critical issues, Large Language Models, proliferation of Large, generated text detection, raising critical

备注：

点击查看摘要

Abstract:The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, this http URL, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.

34. 【2603.18743】Memento-Skills: Let Agents Design Agents

链接：https://arxiv.org/abs/2603.18743

作者：Huichi Zhou,Siyuan Guo,Anjie Liu,Zhongwei Yu,Ziqin Gong,Bowen Zhao,Zhixun Chen,Menglong Zhang,Yihang Chen,Jinsong Li,Runyu Yang,Qiangbin Liu,Xinlei Yu,Jianmin Zhou,Na Wang,Chunyang Sun,Jun Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：emph, continually-learnable LLM agent, autonomously constructs, LLM agent system, continually-learnable LLM

备注： Memento-Skills Technical Report

点击查看摘要

Abstract:We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at this https URL.

Comments:
Memento-Skills Technical Report

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2603.18743 [cs.AI]

(or
arXiv:2603.18743v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.18743

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

35. 【2603.18736】CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

链接：https://arxiv.org/abs/2603.18736

作者：Hao Wang,Licheng Pan,Zhichao Chen,Chunyuan Zheng,Zhixuan Chu,Xiaoxi Li,Yuan Lu,Xinggao Liu,Haoxuan Li,Zhouchen Lin

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词：aligning language models, modeling heavily relies, collected from human, human annotators, current reward modeling

备注：

点击查看摘要

Abstract:Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

36. 【2603.18688】STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

链接：https://arxiv.org/abs/2603.18688

作者：Chen Zhang,Liwei Liu,Jun Tao,Xiaoyu Yang,Xuenan Xu,Kai Chen,Bowen Zhou,Wen Wu,Chao Zhang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Scientific time series, time series, Scientific time, relevant time series, time series domains

备注：

点击查看摘要

Abstract:Scientific time series are central to scientific AI but are typically sparse, highly heterogeneous, and limited in scale, making unified representation learning particularly challenging. Meanwhile, foundation models pretrained on relevant time series domains such as audio, general time series, and brain signals contain rich knowledge, but their applicability to scientific signals remains underexplored. In this paper, we investigate the transferability and complementarity of foundation models from relevant time series domains, and study how to effectively leverage them to build a unified encoder for scientific time series. We first systematically evaluate relevant foundation models, showing the effectiveness of knowledge transfer to scientific tasks and their complementary strengths. Based on this observation, we propose STEP, a Scientific Time Series Encoder Pretraining framework via cross domain distillation. STEP introduces adaptive patching to handle extreme-length sequences and a statistics compensation scheme to accommodate diverse numerical scales. It further leverages cross-domain distillation to integrate knowledge from multiple foundation models into a unified encoder. By combining complementary representations across different domains, STEP learns general-purpose and transferable features tailored for scientific signals. Experiments on seven scientific time series tasks demonstrate that STEP provides both an effective structure and an effective pretraining paradigm, taking a STEP toward scientific time series representation learning.

37. 【2603.18683】HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

链接：https://arxiv.org/abs/2603.18683

作者：Zhicong Lu,Zichuan Lin,Wei Jia,Changyuan Tian,Deheng Ye,Peiguang Li,Li Jin,Nayu Liu,Guangluan Xu,Wei Feng

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：complex longhorizon agentic, longhorizon agentic decision-making, tasks remains limited, agentic decision-making tasks, decision-making tasks remains

备注： Submitted to ACL 2026 on Jan 5, 2026

点击查看摘要

Abstract:While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited. Most existing methods concentrate on designing effective reward models (RMs) to advance performance via multi-turn reinforcement learning. However, they suffer from delayed propagation in sparse outcome rewards and unreliable credit assignment with potentially overly fine-grained and unfocused turnlevel process rewards. In this paper, we propose (HISR) exploiting Hindsight Information to modulate Segmental process Rewards, which closely aligns rewards with sub-goals and underscores significant segments to enhance the reliability of credit assignment. Specifically, a segment-level process RM is presented to assign rewards for each sub-goal in the task, avoiding excessively granular allocation to turns. To emphasize significant segments in the trajectory, a hindsight model is devised to reflect the preference of performing a certain action after knowing the trajectory outcome. With this characteristic, we design the ratios of sequence likelihoods between hindsight and policy model to measure action importance. The ratios are subsequently employed to aggregate segment importance scores, which in turn modulate segmental process rewards, enhancing credit assignment reliability. Extensive experimental results on three publicly benchmarks demonstrate the validity of our method.

38. 【2603.18678】Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models

链接：https://arxiv.org/abs/2603.18678

作者：Yuchen Su,Shaoxin Zhong,Yonghua Zhu,Ruofan Wang,Zijian Huang,Qiqi Wang,Na Zhao,Diana Benavides-Prado,Michael Witbrock

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：typical linguistic phenomenon, posing unique challenges, natural language understanding, generate humour, posing unique

备注： The paper is currently under review

点击查看摘要

Abstract:Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence.

39. 【2603.18641】A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

链接：https://arxiv.org/abs/2603.18641

作者：Aram Abrahamyan,Sachin Kumar

类目：Computation and Language (cs.CL)

关键词：previously acquired knowledge, Neural language models, language models deployed, forgetting previously acquired, Artificial Neural Network

备注：

点击查看摘要

Abstract:Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

40. 【2603.18637】MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

链接：https://arxiv.org/abs/2603.18637

作者：Yipu Dou,Wang Yang

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：multi-turn safety alignment, benign boundary queries, Slice-Aware Iterative Curation, multi-turn safety, boundary queries

备注： 9 pages, 5 figures. Code available at [this https URL](https://github.com/douyipu/mosaic)

点击查看摘要

Abstract:We study how to allocate a fixed supervised fine-tuning budget when three objectives must be balanced at once: multi-turn safety alignment, low over-refusal on benign boundary queries, and instruction following under verifiable constraints. We propose MOSAIC (Multi-Objective Slice-Aware Iterative Curation for Alignment), a multi-objective framework for closed-loop data mixture search built on a unified L1-L3 evaluation interface. MOSAIC turns slice-level failure profiles into executable data actions, including dataset-level mixture ratios, bucket-level weights, and focus criteria. Under a fixed 1M-token budget and five rounds of independent fine-tuning from the same base model, MOSAIC improves internal XGuard from 2.76 to 4.67 while keeping OrBench at 4.41 and IFEval at 3.65. The final Pareto solution also generalizes better than a random static LoRA baseline on independent attack, over-refusal, and capability tests, suggesting that structured failure diagnosis can serve as a practical control signal for budgeted data construction. Code is available at this https URL.

41. 【2603.18620】Learning to Self-Evolve

链接：https://arxiv.org/abs/2603.18620

作者：Xiaoyin Chen,Canwen Xu,Yite Wang,Boyi Liu,Zhewei Yao,Yuxiong He

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：reinforcement learning framework, trains large language, introduce Learning, reinforcement learning, learning framework

备注：

点击查看摘要

Abstract:We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

42. 【2603.18612】DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

链接：https://arxiv.org/abs/2603.18612

作者：Maxime Poli,Manel Khentout,Angelo Ortiz Tandazo,Ewan Dunbar,Emmanuel Chemla,Emmanuel Dupoux

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：evaluating unsupervised phoneme, unsupervised phoneme discovery, benchmark for evaluating, evaluating unsupervised, introduce DiscoPhon

备注： 6 pages, 2 figures. Submitted to Interspeech 2026

点击查看摘要

Abstract:We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

43. 【2603.18611】Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

链接：https://arxiv.org/abs/2603.18611

作者：Thi Huyen Nguyen,Koustav Rudra,Wolfgang Nejdl

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：social media data, media data dissemination, data dissemination enable, Advances in social, social media

备注： Accepted at WWW 2026

点击查看摘要

Abstract:Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

44. 【2603.18597】myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

链接：https://arxiv.org/abs/2603.18597

作者：Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Convolutional Neural Network, Gated Recurrent Unit, Burmese handwritten digit, publicly available Burmese, Burmese handwritten

备注： 7 pages, 2 figures, 3 tables, Accepted to ICNLP 2026, Xi'an, China

点击查看摘要

Abstract:We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN's strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.

45. 【2603.18593】Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

链接：https://arxiv.org/abs/2603.18593

作者：Yusuke Takase,Momose Oyama,Hidetoshi Shimodaira

类目：Computation and Language (cs.CL)

关键词：conditional distributions, represents language models, prompt-response pairs, pairs and constructs, constructs model maps

备注：

点击查看摘要

Abstract:We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.

46. 【2603.18579】ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

链接：https://arxiv.org/abs/2603.18579

作者：Abhinaba Basu,Pavan Chakraborty

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：model reasoning remains, explanations faithfully reflect, open problem, faithfully reflect, reflect a model

备注：

点击查看摘要

Abstract:Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

47. 【2603.18567】SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding

链接：https://arxiv.org/abs/2603.18567

作者：Shenggui Li,Chao Wang,Yikai Zhu,Yubo Wang,Fan Yin,Shuai Shi,Yefei Chen,Xiaomin Dong,Qiaoling Chen,Jin Pan,Ji Li,Laixin Xie,Yineng Zhang,Lei Yu,Yonggang Wen,Ivor Tsang,Tianwei Zhang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, Large language, sequential autoregressive decoding, language models incur, models incur high

备注：

点击查看摘要

Abstract:Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source, production-oriented framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B. In addition, we release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs. Through a systematic study of speculative decoding training recipes, SpecBundle addresses the scarcity of high-quality drafts in the community, and our draft models achieve up to 4.48x end-to-end inference speedup on SGLang, establishing SpecForge as a practical foundation for real-world speculative decoding deployment.

48. 【2603.18557】Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

链接：https://arxiv.org/abs/2603.18557

作者：Ivaxi Sheth,Zeno Jonke,Amin Mantrach,Saab Mansour

类目：Computation and Language (cs.CL)

关键词：diverse real-world applications, extending automated evaluation, real-world applications, extending automated, critical challenge

备注： 19 pages

点击查看摘要

Abstract:As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

49. 【2603.18533】Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

链接：https://arxiv.org/abs/2603.18533

作者：Yinan Xia,Haotian Zhang,Huiming Wang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large Reasoning Models, Toggle, Large Reasoning, generating excessively long, shown exceptional reasoning

备注： 13 pages

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at this https URL.

Comments:
13 pages

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2603.18533 [cs.LG]

(or
arXiv:2603.18533v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.18533

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yinan Xia [view email] [v1]
Thu, 19 Mar 2026 06:30:26 UTC (326 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning, by Yinan Xia and Haotian Zhang and Huiming WangView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.LG

|
next

new
|
recent
| 2026-03

Change to browse by:

cs
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

信息检索

1. 【2603.19225】FinTradeBench: A Financial Reasoning Benchmark for LLMs

链接：https://arxiv.org/abs/2603.19225

作者：Yogesh Agrawal,Aniruddha Dutta,Md Mahadi Hasan,Santu Karmaker,Aritra Dutta(University of Central Florida)

类目：Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)

关键词：Real-world financial decision-making, Large Language Models, Real-world financial, price dynamics, trading signals computed

备注： 8 pages main text, 22 pages total (including references and appendix). 5 figures, 14 tables. Preprint under review. Code and data will be made available upon publication

点击查看摘要

2. 【2603.18898】Comparative Analysis of Large Language Models in Generating Telugu Responses for Maternal Health Queries

链接：https://arxiv.org/abs/2603.18898

作者：Anagani Bhanusree,Sai Divya Vissamsetty,K VenkataKrishna Rao,Rimjhim

类目：Information Retrieval (cs.IR)

关键词：Large Language Models, Large Language, Language Models, progressively exhibiting, exhibiting there capabilities

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have been progressively exhibiting there capabilities in various areas of research. The performance of the LLMs in acute maternal healthcare area, predominantly in low resource languages like Telugu, Hindi, Tamil, Urdu etc are still unstudied. This study presents how ChatGPT-4o, GeminiAI, and Perplexity AI respond to pregnancy related questions asked in different languages. A bilingual dataset is used to obtain results by applying the semantic similarity metrics (BERT Score) and expert assessments from expertise gynecologists. Multiple parameters like accuracy, fluency, relevance, coherence and completeness are taken into consideration by the gynecologists to rate the responses generated by the LLMs. Gemini excels in other LLMs in terms of producing accurate and coherent pregnancy relevant responses in Telugu, while Perplexity demonstrated well when the prompts were in Telugu. ChatGPT's performance can be improved. The results states that both selecting an LLM and prompting language plays a crucial role in retrieving the information. Altogether, we emphasize for the improvement of LLMs assistance in regional languages for healthcare purposes.

3. 【2603.18652】Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

链接：https://arxiv.org/abs/2603.18652

作者：Pius Horn,Janis Keuper

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Reliably extracting tables, knowledge base construction, Reliably extracting, capture semantic equivalence, existing evaluation approaches

备注： Submitted to ICDAR 2026

点击查看摘要

Abstract:Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: this https URL Metric study and human evaluation: this https URL

Comments:
Submitted to ICDAR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.18652 [cs.CV]

(or
arXiv:2603.18652v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.18652

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

4. 【2603.18573】Interplay: Training Independent Simulators for Reference-Free Conversational Recommendation

链接：https://arxiv.org/abs/2603.18573

作者：Jerome Ramos,Feng Xia,Xi Wang,Shubham Chatterjee,Xiao Fu,Hossein A. Rahmani,Aldo Lipani

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：requires extensive dialogue, Training conversational recommender, Training conversational, requires extensive, collect at scale

备注： Accepted at ECIR 2026

点击查看摘要

Abstract:Training conversational recommender systems (CRS) requires extensive dialogue data, which is challenging to collect at scale. To address this, researchers have used simulated user-recommender conversations. Traditional simulation approaches often utilize a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. We propose a reference-free simulation framework that trains two independent LLMs, one as the user and one as the conversational recommender. These models interact in real-time without access to predetermined target items, but preference summaries and target attributes, enabling the recommender to genuinely infer user preferences through dialogue. This approach produces more realistic and diverse conversations that closely mirror authentic human-AI interactions. Our reference-free simulators match or exceed existing methods in quality, while offering a scalable solution for generating high-quality conversational recommendation data without constraining conversations to pre-defined target items. We conduct both quantitative and human evaluations to confirm the effectiveness of our reference-free approach.

5. 【2603.18556】Latent Factor Modeling with Expert Network for Multi-Behavior Recommendation

链接：https://arxiv.org/abs/2603.18556

作者：Mingshi Yan,Zhiyong Cheng,Yahong Han,Meng Wang

类目：Information Retrieval (cs.IR)

关键词：Traditional recommendation methods, Traditional recommendation, severe data sparsity, face severe data, single user behavior

备注：

点击查看摘要

Abstract:Traditional recommendation methods, which typically focus on modeling a single user behavior (e.g., purchase), often face severe data sparsity issues. Multi-behavior recommendation methods offer a promising solution by leveraging user data from diverse behaviors. However, most existing approaches entangle multiple behavioral factors, learning holistic but imprecise representations that fail to capture specific user intents. To address this issue, we propose a multi-behavior method by modeling latent factors with an expert network (MBLFE). In our approach, we design a gating expert network, where the expert network models all latent factors within the entire recommendation scenario, with each expert specializing in a specific latent factor. The gating network dynamically selects the optimal combination of experts for each user, enabling a more accurate representation of user preferences. To ensure independence among experts and factor consistency of a particular expert, we incorporate self-supervised learning during the training process. Furthermore, we enrich embeddings with multi-behavior data to provide the expert network with more comprehensive collaborative information for factor extraction. Extensive experiments on three real-world datasets demonstrate that our method significantly outperforms state-of-the-art baselines, validating its effectiveness.

6. 【2603.18516】otal Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

链接：https://arxiv.org/abs/2603.18516

作者：Mahta Rafiee,Heydar Soudani,Zahra Abbasiantaeb,Mohammad Aliannejadi,Faegheh Hasibi,Hamed Zamani

类目：Information Retrieval (cs.IR)

关键词：perform multi-step information, multi-step information seeking, Deep research agents, LLM-based systems designed, answer complex questions

备注： 7 pages, 4 figures

点击查看摘要

Abstract:Deep research agents have emerged as LLM-based systems designed to perform multi-step information seeking and reasoning over large, open-domain sources to answer complex questions by synthesizing information from multiple information sources. Given the complexity of the task and despite various recent efforts, evaluation of deep research agents remains fundamentally challenging. This paper identifies a list of requirements and optional properties for evaluating deep research agents. We observe that existing benchmarks do not satisfy all identified requirements. Inspired by prior research on TREC Total Recall Tracks, we introduce the task of Total Recall Question Answering and develop a framework for deep research agents evaluation that satisfies the identified criteria. Our framework constructs single-answer, total recall queries with precise evaluation and relevance judgments derived from a structured knowledge base paired with a text corpus, enabling large-scale data construction. Using this framework, we build TRQA, a deep research benchmark constructed from Wikidata-Wikipedia as a real-world source and a synthetically generated e-commerce knowledge base and corpus to mitigate the effects of data contamination. We benchmark the collection with representative retriever and deep research models and establish baseline retrieval and end-to-end results for future comparative evaluation.

7. 【2603.18459】HypeMed: Enhancing Medication Recommendations with Hypergraph-Based Patient Relationships

链接：https://arxiv.org/abs/2603.18459

作者：Xiangxu Zhang,Xiao Zhou,Hongteng Xu,Jianxun Lian

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：effective medication sets, health records, Medication recommendations aim, aim to generate, generate safe

备注： Accepted by TOIS

点击查看摘要

Abstract:Medication recommendations aim to generate safe and effective medication sets from health records. However, accurately recommending medications hinges on inferring a patient's latent clinical condition from sparse and noisy observations, which requires both (i) preserving the visit-level combinatorial semantics of co-occurring entities and (ii) leveraging informative historical references through effective, visit-conditioned retrieval. Most existing methods fall short in one of both aspects: graph-based modeling often fragments higher-order intra-visit patterns into pairwise relations, while inter-visit augmentation methods commonly exhibit an imbalance between learning a globally stable representation space and performing dynamic retrieval within it. To address these limitations, this paper proposes HypeMed, a two-stage hypergraph-based framework unifying intra-visit coherence modeling and inter-visit augmentation. HypeMed consists of two core modules: MedRep for representation pre-training, and SimMR for similarity-enhanced recommendation. In the first stage, MedRep encodes clinical visits as hyperedges via knowledge-aware contrastive pre-training, creating a globally consistent, retrieval-friendly embedding space. In the second stage, SimMR performs dynamic retrieval within this space, fusing retrieved references with the patient's longitudinal data to refine medication prediction. Evaluation on real-world benchmarks shows that HypeMed outperforms state-of-the-art baselines in both recommendation precision and DDI reduction, simultaneously enhancing the effectiveness and safety of clinical decision support.

8. 【2603.18447】SODIUM: From Open Web Data to Queryable Databases

链接：https://arxiv.org/abs/2603.18447

作者：Chuxuan Hu,Philip Li,Maxwell Yang,Daniel Kang

类目：Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：analytical questions, questions whose answers, wide range, answers require integrating, require integrating data

备注：

点击查看摘要

Subjects:

Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.18447 [cs.DB]

(or
arXiv:2603.18447v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2603.18447

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

9. 【2603.18420】From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory

链接：https://arxiv.org/abs/2603.18420

作者：Jason Dury

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：semantic content, Project Gutenberg texts, Project Gutenberg, clusters, Abstract

备注： 22 pages, 5 figures. Code and demo: [this https URL](https://github.com/EridosAI/PAM-Concept-Discovery)

点击查看摘要

10. 【2603.18300】Auditing Preferences for Brands and Cultures in LLMs

链接：https://arxiv.org/abs/2603.18300

作者：Jasmine Rienecker,Katarina Mpofu,Naman Goel,Siddhartha Datta,Jun Zhao,Oscar Danielsson,Fredrik Thorsen

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：systems increasingly mediate, Large language models, choose and buy, based AI systems, Large language

备注： 20 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) based AI systems increasingly mediate what billions of people see, choose and buy. This creates an urgent need to quantify the systemic risks of LLM-driven market intermediation, including its implications for market fairness, competition, and the diversity of information exposure. This paper introduces ChoiceEval, a reproducible framework for auditing preferences for brands and cultures in large language models (LLMs) under realistic usage conditions. ChoiceEval addresses two core technical challenges: (i) generating realistic, persona-diverse evaluation queries and (ii) converting free-form outputs into comparable choice sets and quantitative preference metrics. For a given topic (e.g. running shoes, hotel chains, travel destinations), the framework segments users into psychographic profiles (e.g., budget-conscious, wellness-focused, convenience), and then derives diverse prompts that reflect real-world advice-seeking and decision-making behaviour. LLM responses are converted into normalised top-k choice sets. Preference and geographic bias are then quantified using comparable metrics across topics and personas. Thus, ChoiceEval provides a scalable audit pipeline for researchers, platforms, and regulators, linking model behaviour to real-world economic outcomes. Applied to Gemini, GPT, and DeepSeek across 10 topics spanning commerce and culture and more than 2,000 questions, ChoiceEval reveals consistent preferences: U.S.-developed models Gemini and GPT show marked favouritism toward American entities, while China-developed DeepSeek exhibits more balanced yet still detectable geographic preferences. These patterns persist across user personas, suggesting systematic rather than incidental effects.

Comments:
20 pages, 2 figures

Subjects:

Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)

ACMclasses:
I.2.7; I.2.6; I.2.8; H.3.3; K.4.1; K.4.4

Cite as:
arXiv:2603.18300 [cs.HC]

(or
arXiv:2603.18300v1 [cs.HC] for this version)

https://doi.org/10.48550/arXiv.2603.18300

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

11. 【2603.18074】Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

链接：https://arxiv.org/abs/2603.18074

作者：Yi Yu,Junzhuo Ma,Chenghuang Shen,Xingyan Liu,Jing Gu,Hangyi Sun,Guangquan Hu,Jianfeng Liu,Weiting Liu,Mingyue Pu,Yu Wang,Zhengdong Xiao,Rui Xie,Longjiu Luo,Qianrong Wang,Gurong Cui,Honglin Qiao,Wenlian Lu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Applications (stat.AP)

关键词：Adapting Large Language, Large Language Models, Adapting Large, Large Language, Language Models

备注：

点击查看摘要

Abstract:Adapting Large Language Models in complex technical service domains is constrained by the absence of explicit cognitive chains in human demonstrations and the inherent ambiguity arising from the diversity of valid responses. These limitations severely hinder agents from internalizing latent decision dynamics and generalizing effectively. Moreover, practical adaptation is often impeded by the prohibitive resource and time costs associated with standard training paradigms. To overcome these challenges and guarantee computational efficiency, we propose a lightweight adaptation framework comprising three key contributions. (1) Latent Logic Augmentation: We introduce Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation to bridge the gap between surface-level supervision and latent decision logic. These approaches strengthen the stability of Supervised Fine-Tuning alignment. (2) Robust Noise Reduction: We construct a Multiple Ground Truths dataset through a dual-filtering method to reduce the noise by validating diverse responses, thereby capturing the semantic diversity. (3) Lightweight Adaptation: We design a Hybrid Reward mechanism that fuses an LLM-based judge with a lightweight relevance-based Reranker to distill high-fidelity reward signals while reducing the computational cost compared to standard LLM-as-a-Judge reinforcement learning. Empirical evaluations on real-world Cloud service tasks, conducted across semantically diverse settings, demonstrate that our framework achieves stability and performance gains through Latent Logic Augmentation and Robust Noise Reduction. Concurrently, our Hybrid Reward mechanism achieves alignment comparable to standard LLM-as-a-judge methods with reduced training time, underscoring the practical value for deploying technical service agents.

12. 【2603.18012】DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2603.18012

作者：Penghao Liang,Mengwei Yuan,Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan,Yichao Wu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：dynamic knowledge integration, retrieval-augmented generation, framework designed, knowledge integration, designed to handle

备注：

点击查看摘要

13. 【2603.18011】Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating

链接：https://arxiv.org/abs/2603.18011

作者：Victor P. Unda

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：question-answering systems convert, modern AI question-answering, vectors and retrieve, retrieve the closest, closest matches

备注： 21 pages, 1 figures, 4 tables

点击查看摘要

Comments:
21 pages, 1 figures, 4 tables

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

MSC classes:
68T50

ACMclasses:
I.2.7; H.3.3

Cite as:
arXiv:2603.18011 [cs.CL]

(or
arXiv:2603.18011v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.18011

Focus to learn more

              arXiv-issued DOI via DataCite</p>

14. 【2603.18005】Negative Sampling Techniques in Information Retrieval: A Survey

链接：https://arxiv.org/abs/2603.18005

作者：Laurin Wischounig,Abdelrahman Abdallah,Adam Jatowt

类目：Information Retrieval (cs.IR)

关键词：Information Retrieval, modern NLP applications, modern NLP, NLP applications, Large Language Model

备注： Accepted at findings EACL 2026

点击查看摘要

Abstract:Information Retrieval (IR) is fundamental to many modern NLP applications. The rise of dense retrieval (DR), using neural networks to learn semantic vector representations, has significantly advanced IR performance. Central to training effective dense retrievers through contrastive learning is the selection of informative negative samples. Synthesizing 35 seminal papers, this survey provides a comprehensive and up-to-date overview of negative sampling techniques in dense IR. Our unique contribution is the focus on modern NLP applications and the inclusion of recent Large Language Model (LLM)-driven methods, an area absent in prior reviews. We propose a taxonomy that categorizes techniques including random, static/dynamically mined, and synthetic datasets. We then analyze these approaches with respect to trade-offs between effectiveness, computational cost, and implementation difficulty. The survey concludes by outlining current challenges and promising future directions for the use of LLM-generated synthetic data.

15. 【2602.11322】Predictive Associative Memory: Retrieval Beyond Similarity Through Temporal Co-occurrence

链接：https://arxiv.org/abs/2602.11322

作者：Jason Dury

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)

关键词：Current approaches, neural systems rely, representationally similar stored, Predictive Associative Memory, rely on similarity-based

备注： 20 pages, 6 figures, for associated Git: [this https URL](https://github.com/EridosAI/PAM-Benchmark)

点击查看摘要

Abstract:Current approaches to memory in neural systems rely on similarity-based retrieval: given a query, find the most representationally similar stored state. This assumption -- that useful memories are similar memories -- fails to capture a fundamental property of biological memory: association through temporal co-occurrence. We propose Predictive Associative Memory (PAM), an architecture in which a JEPA-style predictor, trained on temporal co-occurrence within a continuous experience stream, learns to navigate the associative structure of an embedding space. We introduce an Inward JEPA that operates over stored experience (predicting associatively reachable past states) as the complement to the standard Outward JEPA that operates over incoming sensory data (predicting future states). We evaluate PAM as an associative recall system -- testing faithfulness of recall for experienced associations -- rather than as a retrieval system evaluated on generalisation to unseen associations. On a synthetic benchmark, the predictor's top retrieval is a true temporal associate 97% of the time (Association Precision@1 = 0.970); it achieves cross-boundary Recall@20 = 0.421 where cosine similarity scores zero; and it separates experienced-together from never-experienced-together states with a discrimination AUC of 0.916 (cosine: 0.789). Even restricted to cross-room pairs where embedding similarity is uninformative, the predictor achieves AUC = 0.849 (cosine: 0.503, chance). A temporal shuffle control confirms the signal is genuine temporal co-occurrence structure, not embedding geometry: shuffling collapses cross-boundary recall by 90%, replicated across training seeds. All results are stable across seeds (SD 0.006) and query selections (SD $\leq$ 0.012).

计算机视觉

1. 【2603.19235】Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

链接：https://arxiv.org/abs/2603.19235

作者：Xianjin Wu,Dingkang Liang,Tianrui Feng,Kui Xia,Yumeng Zhang,Xiaofan Li,Xiao Tan,Xiang Bai

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, impressive semantic capabilities

备注： 31 pages, 12 figures

点击查看摘要

Abstract:While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at this https URL.

2. 【2603.19234】Matryoshka Gaussian Splatting

链接：https://arxiv.org/abs/2603.19234

作者：Zhilin Guo,Boqiao Zhang,Hakan Aktas,Kyle Fogarty,Jeffrey Hu,Nursena Koprucu Aslan,Wenzhao Li,Canberk Baykal,Albert Miao,Josef Bengtson,Chenliang Zhou,Weihao Xia,Cristina Nader Vasconcelos. Cengiz Oztireli

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Matryoshka Gaussian Splatting, Gaussian Splatting, level of detail, ability to render, render scenes

备注： project page: [this https URL](https://zhilinguo.github.io/MGS)

点击查看摘要

Abstract:The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

3. 【2603.19232】Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

链接：https://arxiv.org/abs/2603.19232

作者：Yuqing Wang,Chuofan Ma,Zhijie Lin,Yao Teng,Lijun Yu,Shuai Wang,Jiaming Han,Jiashi Feng,Yi Jiang,Xihui Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gained significant attention, prediction paradigm shared, token prediction paradigm, promising seamless multimodal, Visual generation

备注： Accepted by CVPR 2026 main track; Code: [this https URL](https://github.com/YuqingWang1029/CubiD)

点击查看摘要

Abstract:Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: this https URL.

4. 【2603.19231】MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

链接：https://arxiv.org/abs/2603.19231

作者：Haitian Li,Haozhe Xie,Junxiang Xu,Beichen Wen,Fangzhou Hong,Ziwei Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires jointly inferring, jointly inferring object, limited visual evidence, image requires jointly, inferring object geometry

备注： Project page: [this https URL](https://lihaitian.com/MonoArt)

点击查看摘要

Abstract:Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

5. 【2603.19229】NavTrust: Benchmarking Trustworthiness for Embodied Navigation

链接：https://arxiv.org/abs/2603.19229

作者：Huaide Jiang,Yash Chaudhary,Yuping Wang,Zehao Wang,Raghav Sharma,Manan Mehta,Yang Zhou,Lichao Sun,Zhiwen Fan,Zhengzhong Tu,Jiachen Li

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)

关键词：natural language instructions, agents navigate, target object, major categories, natural language

备注： Project Website: [this https URL](https://navtrust.github.io)

点击查看摘要

Abstract:There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: this https URL.

6. 【2603.19228】SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

链接：https://arxiv.org/abs/2603.19228

作者：Xinyao Zhang,Wenkai Dong,Yuxin Song,Bo Fang,Qi Zhang,Jing Wang,Fan Chen,Hui Zhang,Haocheng Feng,Yu Lu,Hang Zhou,Chun Yuan,Jingdong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current instruction-guided video, simultaneously balance precise, faithful motion preservation, Current instruction-guided, balance precise semantic

备注： 24 pages, 12 figures

点击查看摘要

Abstract:Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

7. 【2603.19227】Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

链接：https://arxiv.org/abs/2603.19227

作者：Chenyang Gu,Mingyuan Zhang,Haozhe Xie,Zhongang Cai,Lei Yang,Ziwei Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：continuous diffusion models, discrete token-based generators, models that excel, token-based generators, motion generation largely

备注： Project Page: [this https URL](https://rheallyc.github.io/projects/motok) GitHub: [this https URL](https://github.com/rheallyc/MoTok)

点击查看摘要

Abstract:Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

8. 【2603.19226】Under One Sun: Multi-Object Generative Perception of Materials and Illumination

链接：https://arxiv.org/abs/2603.19226

作者：Nobuo Yoshii,Xinran Nicole Han,Ryo Kawahara,Todd Zickler,Ko Nishino

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-Object Generative Perception, generative inverse rendering, introduce Multi-Object Generative, Generative Perception, inverse rendering method

备注：

点击查看摘要

Abstract:We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

9. 【2603.19224】EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

链接：https://arxiv.org/abs/2603.19224

作者：Yang Fu,Yike Zheng,Ziyun Dai,Henghui Ding

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video object removal, restoring seamless backgrounds, object removal aims, object removal, Video object

备注： CVPR 2026, Project Page: [this https URL](https://henghuiding.com/EffectErase/)

点击查看摘要

Abstract:Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

10. 【2603.19222】Spectrally-Guided Diffusion Noise Schedules

链接：https://arxiv.org/abs/2603.19222

作者：Carlos Esteves,Ameesh Makadia

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Denoising diffusion models, Denoising diffusion, noise schedules, video generation, noise

备注：

点击查看摘要

Abstract:Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

11. 【2603.19219】DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

链接：https://arxiv.org/abs/2603.19219

作者：Dong Zhuo,Wenzhao Zheng,Sicheng Zuo,Siming Yan,Lu Hou,Jie Zhou,Jiwen Lu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：autonomous driving systems, scene tokens, growing adoption, scene, driving systems

备注： Project Page: [this https URL](https://paryi555.github.io/DriveTok/) Code: [this https URL](https://github.com/paryi555/DriveTok)

点击查看摘要

Abstract:With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

12. 【2603.19218】Rethinking Vector Field Learning for Generative Segmentation

链接：https://arxiv.org/abs/2603.19218

作者：Chaoyang Wang,Yaobo Liang,Boci Peng,Fan Duan,Jingdong Wang,Yunhai Tong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracted increasing attention, Taming diffusion models, increasing attention, Taming diffusion, attracted increasing

备注：

点击查看摘要

Abstract:Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

13. 【2603.19217】LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

链接：https://arxiv.org/abs/2603.19217

作者：Keda Tao,Yuhua Zheng,Jia Xu,Wenjie Du,Kele Shao,Hesong Wang,Xueyi Chen,Xin Jin,Junhan Zhu,Bohan Yu,Weiqiang Wang,Jian Liu,Can Qin,Yulun Zhang,Ming-Hsuan Yang,Huan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：omnimodal large language, Recent advancements, large language models, advancements in omnimodal, omnimodal large

备注： Project page: [this https URL](https://kd-tao.github.io/LVOmniBench/)

点击查看摘要

Abstract:Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

14. 【2603.19216】DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

链接：https://arxiv.org/abs/2603.19216

作者：Tianjiao Yu,Xinzhuo Li,Muntasir Wahed,Jerry Xiong,Yifan Shen,Ying Shen,Ismini Lourentzou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Understanding and generating, objects as compositions, perception and reasoning, compositions of meaningful, fundamental to human

备注：

点击查看摘要

Abstract:Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

15. 【2603.19209】Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

链接：https://arxiv.org/abs/2603.19209

作者：Shang-Jui Ray Kuo,Paola Cascante-Bonilla

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：large language model, large language, Large vision, SSM, lightweight connector

备注： Project page: [this https URL](https://lab-spell.github.io/vlm-ssm-vision-encoders/) ; Code: [this https URL](https://github.com/raykuo18/vlm-ssm-vision-encoders)

点击查看摘要

Abstract:Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

16. 【2603.19206】RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

链接：https://arxiv.org/abs/2603.19206

作者：Yue Gong,Hongyu Li,Shanyuan Liu,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Manyuan Zhang,Dawei Leng,Yuhui Yin,Lijun Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models shifting denoising, diffusion models shifting, efficiency and scalability, dominant paradigm, shifting denoising

备注：

点击查看摘要

Abstract:Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

17. 【2603.19203】nted Frames: Question Framing Blinds Vision-Language Models

链接：https://arxiv.org/abs/2603.19203

作者：Wan-Cyuan Fan,Jiayun Luo,Declan Kutscher,Leonid Sigal,Ritwik Gupta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-Language Models, require visual reasoning, tasks that require, Models, visual reasoning

备注： Preprint. Project page: [this https URL](https://davidhalladay.github.io/tinted_frames_demo/)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

18. 【2603.19199】FASTER: Rethinking Real-Time Flow VLAs

链接：https://arxiv.org/abs/2603.19199

作者：Yuxiang Lu,Zhe Liu,Xianzhe Fan,Zhenya Yang,Jinghua Hou,Junyi Li,Kaixin Ding,Hengshuang Zhao

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：crucial for deploying, physical world, reaction, reaction time, Fast Action Sampling

备注： Project page: [this https URL](https://innovator-zero.github.io/FASTER)

点击查看摘要

Abstract:Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $\pi_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

19. 【2603.19193】Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.19193

作者：Yiren Lu,Xin Ye,Burhaneddin Yaman,Jingru Luo,Zhexiao Xiong,Liu Ren,Yu Yin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fuses surrounding-view images, unified spatial representation, BEV, object detection, BEV perception

备注： Project page at [this https URL](https://vulab-ai.github.io/Splat2BEV/)

点击查看摘要

Abstract:Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

20. 【2603.19176】Few-shot Acoustic Synthesis with Multimodal Flow Matching

链接：https://arxiv.org/abs/2603.19176

作者：Amandine Brunetto

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)

关键词：immersive virtual environments, Generating audio, acoustically consistent, essential for immersive, immersive virtual

备注： To appear at CVPR 2026. 23 pages, 16 figures. Project Page: [this https URL](https://amandinebtto.github.io/FLAC/)

点击查看摘要

Abstract:Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

21. 【2603.19169】ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

链接：https://arxiv.org/abs/2603.19169

作者：Zhan Jin,Yu Luo,Yizhou Zhang,Ziyang Cui,Yuqing Wei,Xianchao Liu,Xueying Zeng,Qing Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Conventional pixel-wise loss, producing fragmented vascular, high pixel-level accuracy, loss functions fail, fragmented vascular trees

备注： 28 pages, 5 figures . arXiv:submit/7385738 [cs.AI]

点击查看摘要

Abstract:Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

22. 【2603.19166】Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

链接：https://arxiv.org/abs/2603.19166

作者：Swagat Padhan,Lakshya Jain,Bhavya Minesh Shah,Omkar Patil,Thao Nguyen,Nakul Gopalan

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：convert natural language, collaborating with humans, humans must convert, convert natural, grounding

点击查看摘要

23. 【2603.19158】Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

链接：https://arxiv.org/abs/2603.19158

作者：Kwanyoung Lee,SeungJu Cha,Yebin Ahn,Hyunwoo Oh,Sungho Koh,Dong-Jin Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made remarkable progress, semantically rich images, made remarkable, remarkable progress, progress in generating

备注： Accepted in CVPR 2026 (main track). 10 pages, 6 figures; supplementary material included (14 pages, 11 figures)

点击查看摘要

Abstract:Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

24. 【2603.19157】ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

链接：https://arxiv.org/abs/2603.19157

作者：Kwanyoung Lee,Hyunwoo Oh,SeungJu Cha,Sungho Koh,Dong-Jin Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating rare compositional, Generating rare, synthesis remains, diffusion models, Generating

备注： Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)

点击查看摘要

Abstract:Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

25. 【2603.19137】GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

链接：https://arxiv.org/abs/2603.19137

作者：Yiren Lu,Yi Du,Disheng Liu,Yunlai Zhou,Chen Wang,Yu Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：retain spatial knowledge, Effective embodied exploration, knowledge over time, Effective embodied, accumulate and retain

备注： Project page at [this https URL](https://vulab-ai.github.io/GSMem/)

点击查看摘要

Abstract:Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

26. 【2603.19122】Revisiting Autoregressive Models for Generative Image Classification

链接：https://arxiv.org/abs/2603.19122

作者：Ilia Sudakov,Artem Babenko,Dmitry Baranchuk

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrating clear advantages, Class-conditional generative models, visual generative paradigms, Class-conditional generative, including autoregressive

备注： Tech report

点击查看摘要

Abstract:Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

27. 【2603.19121】CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

链接：https://arxiv.org/abs/2603.19121

作者：Weilin Chen,Jiahao Rao,Wenhao Wang,Xinyang Li,Xuan Cheng,Liujuan Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：significant challenge, indoor scene textures, remains a significant, scene textures remains, reference images

备注： Accepted to CVPR 2026. This version integrates the main paper and supplementary material

点击查看摘要

Abstract:The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

28. 【2603.19098】AU-R1: Visual Language Model for Traffic Anomaly Understanding

链接：https://arxiv.org/abs/2603.19098

作者：Yuqiang Lin,Kehua Chen,Sam Lockyer,Arjun Yadav,Mingxuan Sui,Shucheng Zhang,Yan Shi,Bingzhang Wang,Yuang Zhang,Markus Zarbock,Florain Stanek,Adrian Evans,Wenbin Li,Yinhai Wang,Nic Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Intelligent Transportation Systems, Transportation Systems, Intelligent Transportation, safety in Intelligent, Traffic Anomaly Understanding

备注：

点击查看摘要

Abstract:Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: this https URL

29. 【2603.19092】SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

链接：https://arxiv.org/abs/2603.19092

作者：Carlos Hinojosa,Clemens Grange,Bernard Ghanem

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Vision-language models, increasingly deployed, deployed in real-world, real-world and embodied, embodied settings

备注：

点击查看摘要

30. 【2603.19077】Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

链接：https://arxiv.org/abs/2603.19077

作者：Ye Wang,Wei Lu,Zhihui You,Keyan Chen,Tongfei Liu,Kaiyu Li,Hongruixuan Chen,Qingling Shu,Sibao Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：optical remote sensing, remote sensing imagery, surface land-cover materials, illumination fluctuations, optical remote

备注： 15 pages, 12 figures

点击查看摘要

Abstract:Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: this https URL

31. 【2603.19076】DROID-SLAM in the Wild

链接：https://arxiv.org/abs/2603.19076

作者：Moyang Li,Zihan Zhu,Marc Pollefeys,Daniel Barath

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Uncertainty-aware Bundle Adjustment, real-time RGB SLAM, Bundle Adjustment, differentiable Uncertainty-aware Bundle, RGB SLAM system

备注： CVPR 2026, Project Page: [this https URL](https://moyangli00.github.io/droid-w/)

点击查看摘要

Abstract:We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at this https URL.

32. 【2603.19059】SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

链接：https://arxiv.org/abs/2603.19059

作者：Oliver Cory,Ozge Mercanoglu Sincan,Richard Bowden

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, utilises Large Language, linguistically-grounded Sign Language, Language Models, Large Language

备注：

点击查看摘要

Abstract:This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

33. 【2603.19054】Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

链接：https://arxiv.org/abs/2603.19054

作者：Yikai Zheng,Xin Ding,Yifan Yang,Shiqi Jiang,Hao Wu,Qianxi Zhang,Weijun Wang,Ting Cao,Yunxin Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent advances, models respond proactively, interaction paradigm, respond proactively, Recent

备注：

点击查看摘要

Abstract:Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

34. 【2603.19053】SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

链接：https://arxiv.org/abs/2603.19053

作者：Phuc Pham,Uy Dieu Tran,Binh-Son Hua,Phong Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：digital fashion, remains a longstanding, longstanding challenge, challenge in computer, computer vision

备注： CVPR 2026

点击查看摘要

Abstract:Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

35. 【2603.19048】Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

链接：https://arxiv.org/abs/2603.19048

作者：Weijia Dou,Wenzhao Zheng,Weiliang Chen,Yu Zheng,Jie Zhou,Jiwen Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent generative models, spatial geometric inconsistencies, produce high-fidelity videos, Recent generative, models can produce

备注： Code available at [this https URL](https://github.com/tj12323/SGC)

点击查看摘要

Abstract:Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

36. 【2603.19039】rraScope: Pixel-Grounded Visual Reasoning for Earth Observation

链接：https://arxiv.org/abs/2603.19039

作者：Yan Shu,Bin Ren,Zhitong Xiong,Xiao Xiang Zhu,Begüm Demir,Nicu Sebe,Paolo Rota

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：require grounding complex, grounding complex spatial, Vision-language models, complex spatial reasoning, pixel-grounded geospatial reasoning

备注： Accepted by CVPR20206 (Main Track)

点击查看摘要

Abstract:Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

37. 【2603.19036】FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

链接：https://arxiv.org/abs/2603.19036

作者：Telang Xu,Chaoyang Zhang,Guangtao Zhai,Xiaohong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strength varies spatially, reflection strength varies, Single image reflection, real scenes, transmission structures

备注：

点击查看摘要

Abstract:Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at this https URL.

38. 【2603.19028】SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

链接：https://arxiv.org/abs/2603.19028

作者：Quentin Guimard,Federico Bartsch,Simone Caldarella,Rahaf Aljundi,Elisa Ricci,Massimiliano Mancini

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：uncurated training data, training data introduce, data introduce severe, introduce severe social, vision and language

备注： CVPR Findings 2026. Project website: [this https URL](https://sparse-embedding-modulation.github.io/)

点击查看摘要

Abstract:Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

39. 【2603.19026】Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

链接：https://arxiv.org/abs/2603.19026

作者：Anqi Zhang,Xiaokang Ji,Guangyu Gao,Jianbo Jiao,Chi Harold Liu,Yunchao Wei

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multi-modal Large Language, leveraging Multi-modal Large, Language Models, Multi-modal Large

备注： Paper is accepted by CVPR 2026

点击查看摘要

Abstract:Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: this https URL.

40. 【2603.19013】Generalized Hand-Object Pose Estimation with Occlusion Awareness

链接：https://arxiv.org/abs/2603.19013

作者：Hui Yang,Wei Sun,Jian Liu,Jian Xiao Tao Xie,Hossein Rahmani,Ajmal Saeed mian,Nicu Sebe,Gim Hee Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, hand-object pose estimation, single RGB image, RGB image remains, pose estimation

备注： 25 pages, 7 figures

点击查看摘要

Abstract:Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

41. 【2603.19004】Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

链接：https://arxiv.org/abs/2603.19004

作者：Raffaele Cappelli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Fingerprint recognition systems, verification applications, unique characteristics, characteristics of human, essential in modern

备注：

点击查看摘要

Abstract:Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.

42. 【2603.18991】CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

链接：https://arxiv.org/abs/2603.18991

作者：Zening Sun,Zhengpeng Xie,Lichen Bai,Shitong Shao,Shuo Yang,Zeke Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Aligning Diffusion models, Aligning Diffusion, achieved remarkable breakthroughs, human preference-aligned images, Diffusion models

备注： CVPR2026

点击查看摘要

Abstract:Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

43. 【2603.18943】VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

链接：https://arxiv.org/abs/2603.18943

作者：Jiayi Yuan,Haobo Jiang,De Wen Soh,Na Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：geometry-consistent panoramic depth, paper presents, panoramic depth estimation, depth estimation, geometry-consistent panoramic

备注：

点击查看摘要

Abstract:This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

44. 【2603.18924】Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

链接：https://arxiv.org/abs/2603.18924

作者：Feifan Luo,Hongyang Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Estimating correspondences, functional map, functional map solvers, non-rigid deformable, vision and graphics

备注：

点击查看摘要

Abstract:Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned this http URL then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

45. 【2603.18912】GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

链接：https://arxiv.org/abs/2603.18912

作者：Ahmed Tawfik Aboukhadra,Marcel Rogge,Nadia Robertini,Abdalla Arafa,Jameel Malik,Ahmed Elhayek,Didier Stricker

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Understanding realistic hand-object, Gaussian Hand-Object Splatting, monocular RGB videos, Understanding realistic, monocular RGB

备注：

点击查看摘要

Abstract:Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at this https URL.

46. 【2603.18896】ranslating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

链接：https://arxiv.org/abs/2603.18896

作者：Yitong Li,Igor Yakushev,Dennis M. Hedderich,Christian Wachinger

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Positron emission tomography, widely recognized technique, Positron emission, diagnosing neurodegenerative diseases, offering critical functional

备注： Accepted by Medical Image Analysis

点击查看摘要

Abstract:Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA's ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer's diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at this https URL.

47. 【2603.18892】MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

链接：https://arxiv.org/abs/2603.18892

作者：Youngwan Lee,Soojin Jang,Yoorhim Cho,Seunghwan Lee,Yong-Ju Lee,Sung Ju Hwang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Vision-Language Models, agents in physical, physical environments, foundational for Vision-Language, Spatial reasoning

备注： Project page: [this https URL](https://youngwanlee.github.io/multihopspatial)

点击查看摘要

Abstract:Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

48. 【2603.18891】PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

链接：https://arxiv.org/abs/2603.18891

作者：Tianci Luo,Jinpeng Wang,Shiyu Qin,Niu Lian,Yan Feng,Bin Chen,Chun Yuan,Shu-Tao Xia

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Visual In-Context Learning, imitating pixel demonstrations, Visual In-Context, In-Context Learning, aims to complete

备注： Accepted to ICLR 2026. 17 pages, 11 figures, and 9 tables

点击查看摘要

Abstract:Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at this https URL.

49. 【2603.18856】Motion-o: Trajectory-Grounded Video Reasoning

链接：https://arxiv.org/abs/2603.18856

作者：Bishoy Galoaa,Shayda Moezzi,Xiangyu Bai,Sarah Ostadabbas

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：made substantial progress, Recent research, models leveraging spatio-temporal, leveraging spatio-temporal evidence, inference capabilities

备注：

点击查看摘要

Abstract:Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{motion/} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at this https URL.

50. 【2603.18850】HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

链接：https://arxiv.org/abs/2603.18850

作者：Xiangyu Bai,Bishoy Galoaa,Sarah Ostadabbas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：downstream answering quality, depends critically, Video question answering, downstream answering, systems rely

备注：

点击查看摘要

Abstract:Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at this https URL.

51. 【2603.18846】owards Interpretable Foundation Models for Retinal Fundus Images

链接：https://arxiv.org/abs/2603.18846

作者：Samuel Ofosu Mensah,Maria Camila Roa Carvajal,Kerol Djoumessi,Philipp Berens

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computation (stat.CO)

关键词：extract transferable representations, typically via self-supervised, self-supervised learning, extract transferable, large amounts

备注： 11 pages, 3 figures, 2 tables, submitted to MICCAI 2026

点击查看摘要

Abstract:Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model's representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.

52. 【2603.18834】Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

链接：https://arxiv.org/abs/2603.18834

作者：Hesong Li,Ziqi Wu,Ruiwen Shao,Ying Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Transmission Electron Microscopy, High-Resolution Transmission Electron, Electron Microscopy, Transmission Electron, advanced solid materials

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at this https URL.

53. 【2603.18797】VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

链接：https://arxiv.org/abs/2603.18797

作者：Chinmay Prabhakar,Bastian Wittmann,Tamaz Amiranashvili,Paul Büschl,Ezequiel de la Rosa,Julian McGinnis,Benedikt Wiestler,Bjoern Menze,Suprosanna Shit

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Spatial graphs provide, provide a lightweight, lightweight and elegant, neuronal networks, curvilinear anatomical structures

备注：

点击查看摘要

Abstract:Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok's performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok's learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.

54. 【2603.18795】Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

链接：https://arxiv.org/abs/2603.18795

作者：Yuchen Li,Amanmeet Garg,Shalini Chaudhuri,Rui Zhao,Garin Kessler

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Vision Language, Vision Language Models, Large Vision, Vision Language, implicitly infer complex

备注：

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

55. 【2603.18792】Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

链接：https://arxiv.org/abs/2603.18792

作者：Jakob Lønborg Christensen,Vedrana Andersen Dahl,Morten Rieger Hannemose,Anders Bjorholm Dahl,Christian F. Baumgartner

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical image segmentation, image segmentation, crucial in safety-critical, safety-critical applications, medical image

备注：

点击查看摘要

Abstract:Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

56. 【2603.18782】Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

链接：https://arxiv.org/abs/2603.18782

作者：Jiatong Xia,Zicheng Duan,Anton van den Hengel,Lingqiao Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent progress, driven largely, point cloud priors, point cloud, Recent

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input this http URL practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

57. 【2603.18774】SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

链接：https://arxiv.org/abs/2603.18774

作者：Vsevolod Skorokhodov,Chenghao Xu,Shuo Sun,Olga Fink,Malcolm Mielle

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Foundational feed-forward visual, Foundational feed-forward, strong scene priors, learning strong scene, massive RGB datasets

备注：

点击查看摘要

Abstract:Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at this https URL.

58. 【2603.18764】ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

链接：https://arxiv.org/abs/2603.18764

作者：Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Source-Free Domain Adaptation, adapts pre-trained models, adapts pre-trained, requiring access, Domain Adaptation

备注：

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model's initial predictions with the current model's online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: this https URL.

59. 【2603.18758】Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

链接：https://arxiv.org/abs/2603.18758

作者：Hung-Yue Suen,Kuo-En Hung,Fan-Hsun Tseng

类目：Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词：asynchronous video-based learning, learning-enabled speaker-centric Emotion, speaker-centric Emotion, Emotion AI approach, Massive Open Online

备注： Preprint. Accepted for publication in IEEE Transactions on Computational Social Systems

点击查看摘要

Abstract:This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.

60. 【2603.18757】DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

链接：https://arxiv.org/abs/2603.18757

作者：Haochen Li,Rui Zhang,Hantao Yao,Xin Zhang,Yifan Hao,Shaohui Peng,Yongwei Zhao,Ling Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Adaptive Object Detection, Domain Adaptive Object, labeled source domain, unlabeled target domain, Domain Adaptive

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

61. 【2603.18752】WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

链接：https://arxiv.org/abs/2603.18752

作者：Isabel Rio-Torto,Jaime S. Cardoso,Luís F. Teixeira

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Natural language explanations, language explanations provide, explanations, closely reflecting, textual reports

备注：

点击查看摘要

Abstract:Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model's reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model's feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.

62. 【2603.18742】6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

链接：https://arxiv.org/abs/2603.18742

作者：Rundong Su,Jintao Zhang,Zhihang Yuan,Haojie Duanmu,Jianfei Chen,Jun Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated remarkable capabilities, demonstrated remarkable, remarkable capabilities, capabilities in generating, Quantization

备注：

点击查看摘要

Abstract:Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.

63. 【2603.18739】EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

链接：https://arxiv.org/abs/2603.18739

作者：Longfei Liu,Yongjie Hou,Yang Li,Qirui Wang,Youyang Sha,Yongjun Yu,Yinzhi Wang,Peizhe Ru,Xuanlong Yu,Xi Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Deploying high-performance dense, devices remains challenging, Deploying high-performance, remains challenging due, resource-constrained edge devices

备注： Code is available at: [this https URL](https://intellindust-ai-lab.github.io/projects/EdgeCrafter/)

点击查看摘要

Abstract:Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter's reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: this https URL

64. 【2603.18719】Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

链接：https://arxiv.org/abs/2603.18719

作者：Mohamed Youssef,Mayar Elfares,Anna-Maria Meer,Matteo Bortoletto,Andreas Bulling

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：labelled real-world data, gap remains challenging, data is scarce, remains challenging, challenging as labelled

备注：

点击查看摘要

Abstract:Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

65. 【2603.18707】From ex(p) to poly: Gaussian Splatting with Polynomial Kernels

链接：https://arxiv.org/abs/2603.18707

作者：Joerg H. Mueller,Martin Winter,Markus Steinberger

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Recent advancements, Gaussian Splatting, resulting in significant, original Gaussian kernel, Recent

备注：

点击查看摘要

Abstract:Recent advancements in Gaussian Splatting (3DGS) have introduced various modifications to the original kernel, resulting in significant performance improvements. However, many of these kernel changes are incompatible with existing datasets optimized for the original Gaussian kernel, presenting a challenge for widespread adoption. In this work, we address this challenge by proposing an alternative kernel that maintains compatibility with existing datasets while improving computational efficiency. Specifically, we replace the original exponential kernel with a polynomial approximation combined with a ReLU function. This modification allows for more aggressive culling of Gaussians, leading to enhanced performance across different 3DGS implementations. Our results show a notable performance improvement of 4 to 15% with negligible impact on image quality. We also provide a detailed mathematical analysis of the new kernel and discuss its potential benefits for 3DGS implementations on NPU hardware.

66. 【2603.18671】owards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

链接：https://arxiv.org/abs/2603.18671

作者：Juan Miguel Valverde,Dim P. Papadopoulos,Rasmus Larsen,Anders Bjorholm Dahl

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Standard deep learning, Standard deep, deep learning models, guarantee topology accuracy, failing to preserve

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at this https URL.

67. 【2603.18660】Multimodal Model for Computational Pathology:Representation Learning and Image Compression

链接：https://arxiv.org/abs/2603.18660

作者：Peihang Wu,Zehong Chen,Lijian Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gigapixel histopathology images, transformed digital pathology, enabling computational analysis, slide imaging, transformed digital

备注：

点击查看摘要

Abstract:Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

68. 【2603.18655】Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation

链接：https://arxiv.org/abs/2603.18655

作者：Jingguo Qu,Xinyang Han,Yao Pu,Man-Lik Chui,Simon Takadiyi Gunda,Ziman Chen,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Ying

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：image segmentation faces, segmentation faces significant, faces significant challenges, significant challenges due, artifacts including speckle

备注： This is the author-submitted LaTeX version with original typesetting. The final published version (with IEEE production formatting and layout changes) is available at [this http URL](http://doi.org/10.1109/TNNLS.2026.3669814) under CC BY 4.0 license

点击查看摘要

Abstract:Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5\% labeling ratio, Switch achieves remarkable improvements: 80.04\% Dice on LN-INT, 85.52\% Dice on DDTI, and 83.48\% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at this https URL

69. 【2603.18652】Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

链接：https://arxiv.org/abs/2603.18652

作者：Pius Horn,Janis Keuper

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Reliably extracting tables, knowledge base construction, Reliably extracting, capture semantic equivalence, existing evaluation approaches

备注： Submitted to ICDAR 2026

点击查看摘要

Comments:
Submitted to ICDAR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.18652 [cs.CV]

(or
arXiv:2603.18652v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.18652

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

70. 【2603.18649】Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

链接：https://arxiv.org/abs/2603.18649

作者：Ruizhi Yu,Keyang Zhong,Peng Liu,Qi Wu,Haoran Zhang,Yanhao Zhang,Chen Chen,Haonan Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Live streaming commerce, streaming commerce, modern era, prominent form, form of broadcasting

备注： 4 pages, 2 figures, Accepted at WWW2026 Demos

点击查看摘要

Abstract:Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: this https URL.

71. 【2603.18645】MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

链接：https://arxiv.org/abs/2603.18645

作者：Teer Song,Yue Zhang,Yu Tian,Ziyang Wang,Xianlin Zhang,Guixuan Zhang,Xuan Liu,Xueming Li,Yasen Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：leverage high-quality reference, restored outputs, high-quality reference images, preserve an individual, evolved from reference-free

备注：

点击查看摘要

Abstract:To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: this https URL

72. 【2603.18639】PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

链接：https://arxiv.org/abs/2603.18639

作者：Cong Wang,Hanxin Zhu,Xiao Tang,Jiayi Luo,Xin Jin,Long Chen,Fei-Yue Wang,Zhibo Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ensuring physically consistent, Recent progress, physically consistent motion, consistent motion remains, visual fidelity

备注：

点击查看摘要

Abstract:Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: this https URL.

73. 【2603.18636】raining-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

链接：https://arxiv.org/abs/2603.18636

作者：Jiayi Luo,Jiayu Chen,Jiankun Wang,Cong Wang,Hanxin Zhu,Qingyun Sun,Chen Gao,Zhibo Chen,Jianxin Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Transformers, high inference cost, inference cost due, strong video generation, video generation quality

备注：

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

74. 【2603.18634】SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

链接：https://arxiv.org/abs/2603.18634

作者：Rong Fu,Jiekai Wu,Haiyun Wei,Xiaowen Ma,Shiyin Lin,Kangan Qian,Chuang Liu,Jianyuan Ni,Simon James Fong

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：multi-date satellite imagery, remains difficult due, urban planning, environmental monitoring, multi-date satellite

备注： 24 pages, 6 figures

点击查看摘要

Abstract:Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.

75. 【2603.18626】GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

链接：https://arxiv.org/abs/2603.18626

作者：Zelin Liu,Bocheng Li,Yuling Zhou,Xuanting Li,Yixuan Yang,Jing Wang,Weishu Zhao,Xiaofeng Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Mariana Trench, Qinghai-Tibet Plateau exhibit, microbial metabolic functions, Qinghai-Tibet Plateau, textbf

备注：

点击查看摘要

Abstract:The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline{\textbf{G}}eography-knowledge \underline{\textbf{E}}nhanced \underline{\textbf{A}}nalog \underline{\textbf{R}}ecognition (\textbf{GEAR}) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline{\textbf{M}}orphology-integrated \underline{\textbf{S}}iamese \underline{\textbf{G}}raph \underline{\textbf{N}}etwork (\textbf{MSG-Net}) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.

76. 【2603.18625】GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

链接：https://arxiv.org/abs/2603.18625

作者：Yueying Zou,Pei Pei Li,Zekun Li,Xinyu Guo,Xing Cui,Huaibo Huang,Ran He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recent years, realistic and sophisticated, increasingly realistic, Large Vision-Language Models, LVLMs

备注： ECCV 2026 submission. 14 pages, 6 figures, 4 tables. Supplementary material included

点击查看摘要

Abstract:In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

77. 【2603.18624】REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation

链接：https://arxiv.org/abs/2603.18624

作者：Shuqi Xiao,Maani Ghaffari,Chengzhong Xu,Hui Kong

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Zero-shot object-goal navigation, requires navigating unknown, navigating unknown environments, Zero-shot object-goal, object-goal navigation

备注：

点击查看摘要

Abstract:Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.

78. 【2603.18623】OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

链接：https://arxiv.org/abs/2603.18623

作者：Bin Cao,Sipeng Zheng,Hao Luo,Boyuan Li,Jing Liu,Zongqing Lu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：create realistic human, realistic human movements, animation and robotics, aims to create, create realistic

备注：

点击查看摘要

Abstract:Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

79. 【2603.18616】Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

链接：https://arxiv.org/abs/2603.18616

作者：Lukas Bayer,Sheethal Bhat,Andreas Maier

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate multi-organ segmentation, Accurate multi-organ, diagnosis and treatment, essential for computer-aided, computer-aided diagnosis

备注：

点击查看摘要

Abstract:Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

80. 【2603.18611】Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

链接：https://arxiv.org/abs/2603.18611

作者：Thi Huyen Nguyen,Koustav Rudra,Wolfgang Nejdl

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：social media data, media data dissemination, data dissemination enable, Advances in social, social media

备注： Accepted at WWW 2026

点击查看摘要

81. 【2603.18600】Improving Joint Audio-Video Generation with Cross-Modal Context Learning

链接：https://arxiv.org/abs/2603.18600

作者：Bingqi Ma,Linlong Lang,Ming Zhang,Dailan He,Xingtong Ge,Yi Zhang,Guanglu Song,Yu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：architecture-based joint audio-video, transformer architecture-based joint, dual-stream transformer architecture-based, joint audio-video generation, current research

备注：

点击查看摘要

Abstract:The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

82. 【2603.18599】SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

链接：https://arxiv.org/abs/2603.18599

作者：Jialiang Kang,Han Shu,Wenshuo Li,Yingjie Zhai,Xinghao Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Speculative Jacobi Decoding, Speculative Jacobi, Jacobi Decoding, approach to accelerate, accelerate autoregressive

备注： CVPR 2026

点击查看摘要

Abstract:Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality.

83. 【2603.18598】Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

链接：https://arxiv.org/abs/2603.18598

作者：Lu Yu,Haiyang Zhang,Changsheng Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracted widespread attention, Attention Constraint Module, Global Attention Constraint, impressive zero-shot capabilities, attention

备注： Accepted to TPAMI 2026. arXiv admin note: substantial text overlap with [arXiv:2410.21802](https://arxiv.org/abs/2410.21802)

点击查看摘要

Abstract:Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

84. 【2603.18597】myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

链接：https://arxiv.org/abs/2603.18597

作者：Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Convolutional Neural Network, Gated Recurrent Unit, Burmese handwritten digit, publicly available Burmese, Burmese handwritten

备注： 7 pages, 2 figures, 3 tables, Accepted to ICNLP 2026, Xi'an, China

点击查看摘要

85. 【2603.18596】Elastic Weight Consolidation Done Right for Continual Learning

链接：https://arxiv.org/abs/2603.18596

作者：Xuan Liu,Xiaobin Chang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：alleviate catastrophic forgetting, important model weights, Elastic Weight Consolidation, continual learning, alleviate catastrophic

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients. However, it has consistently shown suboptimal performance. In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. For the first time, we find that EWC's reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios. Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection. Consequently, both EWC and its variants exhibit fundamental misalignments in estimating weight importance, leading to inferior performance. To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies EWC's importance estimation. Specifically, reversing the logit values during the calculation of FIM can effectively prevent both gradient vanishing and redundant protection. Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).

86. 【2603.18588】AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

链接：https://arxiv.org/abs/2603.18588

作者：Jiahe Wang,Cong Liang,Xuandong Huang,Yuxin Wang,Xin Yun,Yi Wu,Yanan Chang,Shangfei Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：underexplored challenge, remains a critical, critical yet underexplored, Facial, AUs

备注：

点击查看摘要

Abstract:Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs--defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

87. 【2603.18586】Color image restoration based on nonlocal saturation-value similarity

链接：https://arxiv.org/abs/2603.18586

作者：Wei Wang,Yakun Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：color image, color image patches, color image restoration, similarity based nonlocal, image patches

备注：

点击查看摘要

Abstract:In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.

88. 【2603.18585】HAViT: Historical Attention Vision Transformer

链接：https://arxiv.org/abs/2603.18585

作者：Swarnendu Banik,Manish Das,Shiv Ram Dubey,Satish Kumar Singh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mechanisms operate independently, limiting information flow, Vision Transformers, attention mechanisms operate, computer vision

备注：

点击查看摘要

Abstract:Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at this https URL.

89. 【2603.18561】CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

链接：https://arxiv.org/abs/2603.18561

作者：Jiacheng Tang,Zhiyuan Zhou,Zhuolin He,Jia Zhang,Kai Zhang,Jian Pu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：fundamentally learn statistical, learn statistical correlations, true causal relationships, show great promise, great promise

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

90. 【2603.18558】HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

链接：https://arxiv.org/abs/2603.18558

作者：Dan Ben-Ami,Gabriele Serussi,Kobi Cohen,Chaim Baskin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Long-form video question, finite context windows, large vision-language models, video question answering, question answering requires

备注：

点击查看摘要

Abstract:Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

91. 【2603.18545】CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

链接：https://arxiv.org/abs/2603.18545

作者：Xiang Chen,Fangfang Yang,Chunlei Meng,Chengyin Hu,Ang Li,Yiwei Wei,Jiahuan Long,Jiujiang Guo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：workflows remains underexplored, real clinical workflows, clinical workflows remains, visual front end, remains underexplored

备注：

点击查看摘要

Abstract:Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

92. 【2603.18541】Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

链接：https://arxiv.org/abs/2603.18541

作者：Yongwei Jiang,Yixiong Zou,Yuhua Li,Ruixuan Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Cross-domain few-shot object, adapt pretrained detectors, severe domain shifts, Cross-domain few-shot, data scarcity problems

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

93. 【2603.18524】3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

链接：https://arxiv.org/abs/2603.18524

作者：Hyun-kyu Ko,Jihyeon Park,Younghyun Kim,Dongheok Park,Eunbyung Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：virtual production, emerging applications, including immersive, next-generation e-commerce, highly sought

备注： Project page: [this https URL](https://ko-lani.github.io/3DreamBooth) Code: [this https URL](https://github.com/Ko-Lani/3DreamBooth)

点击查看摘要

Abstract:Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: this https URL

94. 【2603.18523】Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

链接：https://arxiv.org/abs/2603.18523

作者：Liwei Che,Zhiyu Xue,Yihao Quan,Benlin Liu,Zeru Shi,Michelle Hurst,Jacob Feldman,Ruixiang Tang,Ranjay Krishna,Vladimir Pavlovic

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Vision-Language Model, Large Vision-Language, Vision-Language Model, powerful test, identify each individual

备注：

点击查看摘要

Abstract:Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

95. 【2603.18513】CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

链接：https://arxiv.org/abs/2603.18513

作者：Elad Yoshai,Ariel D. Yoshai,Natan T. Shaked

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：making computationally intensive, intensive generative super-resolution, computationally intensive generative, exceed gigapixel resolution, routinely exceed gigapixel

备注：

点击查看摘要

Abstract:In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.

96. 【2603.18510】OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.18510

作者：Hongjia Zhai,Qi Zhang,Xiaokun Pan,Xiyu Zhang,Yitong Dong,Huaqi Zhang,Dan Xu,Guofeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：interact with environments, essential for embodied, embodied applications, applications to perceive, perceive and interact

备注： CVPR 2026

点击查看摘要

Abstract:Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

97. 【2603.18508】Foundations and Architectures of Artificial Intelligence for Motor Insurance

链接：https://arxiv.org/abs/2603.18508

作者：Teerapong Panboonyuen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：large-scale real-world deployment, grounded in large-scale, real-world deployment, presents a systematic, systematic treatment

备注： 173 pages

点击查看摘要

Abstract:This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.

98. 【2603.18505】From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions

链接：https://arxiv.org/abs/2603.18505

作者：Jingzhi Chen,Lijian Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：protein folding problem, static structure prediction, complex biomolecular interactions, dynamic conformational ensembles, protein protein complexes

备注： 17 pages, 4 figures

点击查看摘要

Abstract:The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence's transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.

99. 【2603.18502】HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection

链接：https://arxiv.org/abs/2603.18502

作者：Teerapong Panboonyuen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Heuristic Object Masking, Automated property risk, real estate, Automated property, high-impact yet underexplored

备注： 21 pages

点击查看摘要

Abstract:Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.

100. 【2603.18501】Efficient Video Diffusion with Sparse Information Transmission for Video Compression

链接：https://arxiv.org/abs/2603.18501

作者：Mingde Zhou,Zheng Chen,Yulun Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Frame Type Embedder, Video compression aims, aims to maximize, perceptual quality, Efficient Video Diffusion

备注：

点击查看摘要

Abstract:Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at this https URL.

101. 【2603.18496】NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data

链接：https://arxiv.org/abs/2603.18496

作者：Daniel DeTone,Federica Bogo,Eric-Tuan Le,Duncan Frost,Julian Straub,Yawar Siddiqui,Yuting Ye,Jakob Engel,Richard Newcombe,Lingni Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：human activities captured, multiple egocentric wearable, egocentric wearable devices, temporally synchronized, large-scale collection

备注：

点击查看摘要

Abstract:The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.

102. 【2603.18493】FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

链接：https://arxiv.org/abs/2603.18493

作者：Seonghyun Jin,Jong Chul Ye

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：enabling constant-memory inference, enabling constant-memory, constant-memory inference, persistent latent state, Streaming

备注：

点击查看摘要

Abstract:Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise -- governing how much the latent state is expected to change between frames -- is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at this https URL.

103. 【2603.18488】xEditor: Structure-Preserving Text-Driven Texture Editing

链接：https://arxiv.org/abs/2603.18488

作者：Bo Zhao,Yihang Liu,Chenfeng Zhang,Huan Yang,Kun Gai,Wei Ji

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modify object appearance, underlying geometric structure, Text-guided texture editing, texture editing aims, texture editing

备注： 19pages

点击查看摘要

Abstract:Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at this https URL.

104. 【2603.18481】-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

链接：https://arxiv.org/abs/2603.18481

作者：Aditi Naiknaware,Salimeh Sekeh

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：evolving data distributions, OOD detection, multimodal OOD detection, OOD, remains a critical

备注：

点击查看摘要

Abstract:Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.

105. 【2603.18480】Do Vision Language Models Understand Human Engagement in Games?

链接：https://arxiv.org/abs/2603.18480

作者：Ziyi Wang,Qizan Guo,Rishitosh Singh,Xiyang Hu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：latent psychological states, Inferring human engagement, language models, Inferring human, player-experience research

备注：

点击查看摘要

Abstract:Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

106. 【2603.18472】Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

链接：https://arxiv.org/abs/2603.18472

作者：Yinghui Li,Jiayi Kuang,Peng Xing,Daixian Liu,Junnan Dong,Shu-Yu Guo,Yangning Li,Qingyu Zhou,Wenhao Jiang,Hai-Tao Zheng,Ying Shen,Liang Lin,Philip S. Yu

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, interpreting natural scenes, critical open question, Multimodal Large, achieved remarkable success

备注：

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

107. 【2603.18466】Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion

链接：https://arxiv.org/abs/2603.18466

作者：Yuqi Yang,Dongliang Chang,Yijia Ling,Ruoyi Du,Zhanyu Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：perceptually salient, controllable attributes, Colour, Abstract, perceptual Lab-space Loss

备注： 18 pages, 12 figures

点击查看摘要

Abstract:Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic variations. To overcome this limitation, we propose ColourCrafter, a unified diffusion framework that transforms colour editing from global tone transfer into a structured, region-aware generation process. Unlike traditional colour driven methods, ColourCrafter performs token-level fusion of RGB colour tokens and image tokens in latent space, selectively propagating colour information to semantically relevant regions while preserving structural fidelity. A perceptual Lab-space Loss further enhances pixel-level precision by decoupling luminance and chrominance and constraining edits within masked areas. Additionally, we build ColourfulSet, a largescale dataset of high-quality image pairs with continuous and diverse colour variations. Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art colour accuracy, controllability and perceptual fidelity in fine-grained colour editing. Our project is available at this https URL.

108. 【2603.18465】MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling

链接：https://arxiv.org/abs/2603.18465

作者：Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing medical image, Existing medical, heterogeneous degradations encountered, methods are typically, modality-specific or degradation-specific

备注：

点击查看摘要

Abstract:Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.

109. 【2603.18461】Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

链接：https://arxiv.org/abs/2603.18461

作者：Kazuya Nishimura,Ryoma Bise,Shinnosuke Matsuo,Haruka Hirose,Yasuhiro Kojima

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：broad clinical impact, low-cost molecular analysis, Estimating slide, pathology images enables, images enables rapid

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes-mean expression profiles that reflect stable gene-gene co-variation this http URL then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression. Code is publicly available at this https URL.

110. 【2603.18460】Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images

链接：https://arxiv.org/abs/2603.18460

作者：Vahid Monfared,Mohammad Hadi Gharib,Ali Sabri,Maryam Shahali,Farid Rashidi,Amit Mehta,Reza Rawassizadeh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：prostate MRI remains, remains challenging due, MRI remains challenging, prostate MRI, Prostate cancer

备注： 26 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.

111. 【2603.18453】Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

链接：https://arxiv.org/abs/2603.18453

作者：Arushi Rai,Adriana Kovashka

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requiring precise temporal, tasks requiring precise, precise temporal grounding, requiring precise, irrelevant frames

备注：

点击查看摘要

Abstract:Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

112. 【2603.18447】SODIUM: From Open Web Data to Queryable Databases

链接：https://arxiv.org/abs/2603.18447

作者：Chuxuan Hu,Philip Li,Maxwell Yang,Daniel Kang

类目：Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：analytical questions, questions whose answers, wide range, answers require integrating, require integrating data

备注：

点击查看摘要

Subjects:

Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.18447 [cs.DB]

(or
arXiv:2603.18447v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2603.18447

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

113. 【2603.18443】SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

链接：https://arxiv.org/abs/2603.18443

作者：Leyuan Fang,Zan Mao,Zijing Wang,Yinlong Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Zero-shot object-goal navigation, Zero-shot object-goal, object-goal navigation aims, aims to find, unseen environments

备注：

点击查看摘要

Abstract:Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at this https URL

114. 【2603.18429】AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

链接：https://arxiv.org/abs/2603.18429

作者：Yibo Shi,Jungang Li,Linghao Zhang,Zihao Dongfang,Biao Wu,Sicheng Tao,Yibo Yan,Chenxi Qin,Weiting Liu,Zhixin Lin,Hanqian Li,Yu Huang,Song Dai,Yonghua Hei,Yue Ding,Xiang Li,Shikang Wang,Chengdong Xu,Jingqi Liu,Xueying Ma,Zhiwen Zheng,Xiaofei Zhang,Bincheng Wang,Nichen Yang,Jie Wu,Lihua Tian,Chen Li,Xuming Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：paradigms remains under-explored, prevailing paradigms remains, long-horizon Android GUI, GUI agents, real-world deployment

备注：

点击查看摘要

Abstract:Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [this https URL](this https URL).

115. 【2603.18427】RD: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

链接：https://arxiv.org/abs/2603.18427

作者：Huy Che,Dinh-Duy Phan,Duc-Khai Lam

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Collecting and annotating, highly labor-intensive, Collecting, Data, https URL

备注：

点击查看摘要

Abstract:Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \href{this https URL}{this https URL}.

116. 【2603.18423】SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning

链接：https://arxiv.org/abs/2603.18423

作者：Minjun Kim,Jongjin Kim,U Kang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：existing ZSQ methods, accurately quantize, Zero-shot Quantization, ZSQ methods, existing ZSQ

备注： ICLR 2025

点击查看摘要

Abstract:How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and practical scenario where training data are inaccessible for privacy or security reasons. However, three significant challenges hinder the performance of existing ZSQ methods: 1) noise in the synthetic dataset, 2) predictions based on off-target patterns, and the 3) misguidance by erroneous hard labels. In this paper, we propose SynQ (Synthesis-aware Fine-tuning for Zero-shot Quantization), a carefully designed ZSQ framework to overcome the limitations of existing methods. SynQ minimizes the noise from the generated samples by exploiting a low-pass filter. Then, SynQ trains the quantized model to improve accuracy by aligning its class activation map with the pre-trained model. Furthermore, SynQ mitigates misguidance from the pre-trained model's error by leveraging only soft labels for difficult samples. Extensive experiments show that SynQ provides the state-of-the-art accuracy, over existing ZSQ methods.

117. 【2603.18418】Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning?

链接：https://arxiv.org/abs/2603.18418

作者：Yang Liu,Jiyao Yang,Hongjin Zhao,Xiaoyong Li,Yanzhe Ji,Xingjian Li,Runmin Jiang,Tianyang Wang,Saeed Anwar,Dongwoo Kim,Yue Yao,Zhenyue Qin,Min Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large vision-language models, remains largely unexplored, rare conditions remains, conditions remains largely, Large vision-language

备注：

点击查看摘要

Abstract:Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models' reasoning capabilities.

118. 【2603.18402】Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

链接：https://arxiv.org/abs/2603.18402

作者：Yonghan Lee,Dinesh Manocha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, long-horizon per-Gaussian trajectories, per-Gaussian trajectories, Gaussian, Splatting

备注：

点击查看摘要

Abstract:We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.

119. 【2603.18401】Pixel-Accurate Epipolar Guided Matching

链接：https://arxiv.org/abs/2603.18401

作者：Oleksii Nasypanyi,Francois Rameau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：wide-baseline views, slow and unreliable, unreliable in challenging, challenging conditions, repetitive textures

备注：

点击查看摘要

Abstract:Keypoint matching can be slow and unreliable in challenging conditions such as repetitive textures or wide-baseline views. In such cases, known geometric relations (e.g., the fundamental matrix) can be used to restrict potential correspondences to a narrow epipolar envelope, thereby reducing the search space and improving robustness. These epipolar-guided matching approaches have proved effective in tasks such as SfM; however, most rely on coarse spatial binning, which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. We address these limitations with an exact formulation that performs candidate selection directly in angular space. In our approach, each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching then becomes a 1D angular interval query, solved efficiently in logarithmic time with a segment tree. This guarantees pixel-level tolerance, supports per-keypoint control, and removes unnecessary descriptor comparisons. Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets.

120. 【2603.18373】o See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

链接：https://arxiv.org/abs/2603.18373

作者：Rui Hong,Shuxue Quan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Latent Anomaly Detection, VLMs answer correctly, exploit language shortcuts, answer correctly, Visual Necessity Score

备注： 14 pages, 1 figures

点击查看摘要

Abstract:When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy--models detect visual anomalies but hallucinate to satisfy user expectations--while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.

121. 【2603.18348】Epistemic Generative Adversarial Networks

链接：https://arxiv.org/abs/2603.18348

作者：Muhammad Mubashar,Fabio Cuzzolin

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Generative Adversarial Networks, Adversarial Networks, frequently generating similar, generating similar samples, Generative Adversarial

备注： 14 pages, 6 figures

点击查看摘要

Abstract:Generative models, particularly Generative Adversarial Networks (GANs), often suffer from a lack of output diversity, frequently generating similar samples rather than a wide range of variations. This paper introduces a novel generalization of the GAN loss function based on Dempster-Shafer theory of evidence, applied to both the generator and discriminator. Additionally, we propose an architectural enhancement to the generator that enables it to predict a mass function for each image pixel. This modification allows the model to quantify uncertainty in its outputs and leverage this uncertainty to produce more diverse and representative generations. Experimental evidence shows that our approach not only improves generation variability but also provides a principled framework for modeling and interpreting uncertainty in generative processes.

122. 【2603.18343】VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

链接：https://arxiv.org/abs/2603.18343

作者：Bo-Cheng Qiu,Yu-Fan Lin,Yu-Zhe Pien,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Capsule endoscopy event, noisy video streams, diagnostically relevant findings, Capsule endoscopy, endoscopy event detection

备注：

点击查看摘要

Abstract:Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.

123. 【2603.18315】DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

链接：https://arxiv.org/abs/2603.18315

作者：Zilin Huang,Zihao Sheng,Zhengyang Wan,Yansong Qu,Junwei You,Sicong Jiang,Sikai Chen

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Ensuring safe decision-making, remains a fundamental, fundamental challenge, challenge despite rapid, rapid advances

备注： 32 pages, 15 figures. Code and demo available online

点击查看摘要

Abstract:Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: this https URL

124. 【2603.18309】Unrolled Reconstruction with Integrated Super-Resolution for Accelerated 3D LGE MRI

链接：https://arxiv.org/abs/2603.18309

作者：Md Hasibul Husain Hisham,Shireen Elhabian,Ganesh Adluru,Jason Mendes,Andrew Arai,Eugene Kholmovski,Ravi Ranjan,Edward DiBella

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：MRI requires robust, recover thin atrial, late gadolinium enhancement, thin atrial structures, requires robust reconstruction

备注：

点击查看摘要

Abstract:Accelerated 3D late gadolinium enhancement (LGE) MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. While unrolled model-based networks effectively integrate physics-driven data consistency with learned priors, they operate at the acquired resolution and may fail to fully recover high-frequency detail. We propose a hybrid unrolled reconstruction framework in which an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. The model is trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets and compared against compressed sensing, Model-Based Deep Learning (MoDL), and self-guided Deep Image Prior (DIP) baselines. Across acceleration factors, the proposed method consistently improves PSNR and SSIM over standard unrolled reconstruction and better preserves fine cardiac structures, leading to improved LA (left atrium) segmentation performance. These results demonstrate that integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI.

125. 【2603.18306】Fast and Generalizable NeRF Architecture Selection for Satellite Scene Reconstruction

链接：https://arxiv.org/abs/2603.18306

作者：Devjyoti Chakraborty,Zaki Sukma,Rakandhiya D. Rachmanto,Kriti Ghosh,In Kee Kim,Suchendra M. Bhandarkar,Lakshmish Ramaswamy,Nancy K. O'Hare,Deepak Mishra

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Neural Radiance Fields, Radiance Fields, Neural Radiance, approach for photorealistic, Neural Architecture Search

备注：

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in 30 seconds with 1 dB prediction error, achieving 1000$\times$ speedup over NAS. We further demonstrate PreSCAN's deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.

126. 【2603.18298】Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision

链接：https://arxiv.org/abs/2603.18298

作者：Nikhil Gosala,B. Ravi Kiran,Senthil Yogamani,Abhinav Valada

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：enabling autonomous agents, estimate temporally consistent, object tracking aims, temporally consistent, enabling autonomous

备注： 22 pages, 8 figures

点击查看摘要

Abstract:Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively transforming sparse supervision into dense 3D track annotations. This enables existing fully-supervised trackers to effectively operate under extreme label sparsity. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method significantly improves tracking performance, achieving an improvement of up to 15.50 p.p. while using at most four ground truth annotations per track.

127. 【2603.18282】CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

链接：https://arxiv.org/abs/2603.18282

作者：Marios Krestenitis,Christos Tzelepis,Konstantinos Ioannidis,Steafanos Vrochidis,Ioannis Kompatsiaris,Georgios Tzimiropoulos,Shaogang Gong,Ioannis Patras

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual question answering, achieved remarkable progress, visual reasoning, visual question, question answering

备注：

点击查看摘要

Abstract:Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.

128. 【2603.18261】LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression

链接：https://arxiv.org/abs/2603.18261

作者：Tamer Shanableh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：conventional video codecs, encode entire video, entire video sequences, Neural Representations, encode entire

备注： This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.

129. 【2603.18235】oward Reliable, Safe, and Secure LLMs for Scientific Applications

链接：https://arxiv.org/abs/2603.18235

作者：Saket Sanjeev Chaturvedi,Joshua Bergerson,Tanwi Mallick

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：large language models, promise transformative advances, evolve into autonomous, dangerous explosions, language models

备注：

点击查看摘要

Abstract:As large language models (LLMs) evolve into autonomous "AI scientists," they promise transformative advances but introduce novel vulnerabilities, from potential "biosafety risks" to "dangerous explosions." Ensuring trustworthy deployment in science requires a new paradigm centered on reliability (ensuring factual accuracy and reproducibility), safety (preventing unintentional physical or biological harm), and security (preventing malicious misuse). Existing general-purpose safety benchmarks are poorly suited for this purpose, suffering from a fundamental domain mismatch, limited threat coverage of science-specific vectors, and benchmark overfitting, which create a critical gap in vulnerability evaluation for scientific applications. This paper examines the unique security and safety landscape of LLM agents in science. We begin by synthesizing a detailed taxonomy of LLM threats contextualized for scientific research, to better understand the unique risks associated with LLMs in science. Next, we conceptualize a mechanism to address the evaluation gap by utilizing dedicated multi-agent systems for the automated generation of domain-specific adversarial security benchmarks. Based on our analysis, we outline how existing safety methods can be brought together and integrated into a conceptual multilayered defense framework designed to combine a red-teaming exercise and external boundary controls with a proactive internal Safety LLM Agent. Together, these conceptual elements provide a necessary structure for defining, evaluating, and creating comprehensive defense strategies for trustworthy LLM agent deployment in scientific disciplines.

130. 【2603.18218】Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.18218

作者：Guillem Casadesus Vila,Adam Dai,Grace Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：including poorly textured, poorly textured environments, limited computational resources, require robust perception, surface require robust

备注：

点击查看摘要

Abstract:Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.

131. 【2603.18192】MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

链接：https://arxiv.org/abs/2603.18192

作者：Alexander Rasch,Rahul Rajendra Pai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vulnerable road users, parked micromobility vehicles, support traffic safety, traffic safety, road users

备注：

点击查看摘要

Abstract:Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images -- a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as "person", or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at this https URL.

132. 【2603.18178】VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

链接：https://arxiv.org/abs/2603.18178

作者：Mohammad Qazim Bhat,Yufan Huang,Niket Agarwal,Hao Wang,Michael Woods,John Kenyon,Tsung-Yi Lin,Xiaodong Yang,Ming-Yu Liu,Kevin Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：detecting safety-critical events, ego-centric dashcam footage, dashcam footage presents, generic vision models, rapid growth

备注： 16 pages, 9 figures, submitted to arXiv

点击查看摘要

Abstract:The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

Comments:
16 pages, 9 figures, submitted to arXiv

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.18178 [cs.CV]

(or
arXiv:2603.18178v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.18178

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

133. 【2603.18118】Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

链接：https://arxiv.org/abs/2603.18118

作者：Yuhao Dong,Zuyan Liu,Shulin Tian,Yongming Rao,Ziwei Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Multi-modal Large Language, Large Language, achieved remarkable reliability, Language Models

备注： arXiv admin note: text overlap with [arXiv:2411.14432](https://arxiv.org/abs/2411.14432)

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

134. 【2603.18108】From Concepts to Judgments: Interpretable Image Aesthetic Assessment

链接：https://arxiv.org/abs/2603.18108

作者：Xiao-Chang Liu,Johan Wagemans

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：aims to predict, Image aesthetic assessment, IAA, aesthetic assessment, aesthetic

备注： 12 pages, 8 figures

点击查看摘要

Abstract:Image aesthetic assessment (IAA) aims to predict the aesthetic quality of images as perceived by humans. While recent IAA models achieve strong predictive performance, they offer little insight into the factors driving their predictions. Yet for users, understanding why an image is considered pleasing or not is as valuable as the score itself, motivating growing interest in interpretability within IAA. When humans evaluate aesthetics, they naturally rely on high-level cues to justify their judgments. Motivated by this observation, we propose an interpretable IAA framework grounded in human-understandable aesthetic concepts. We learn these concepts in an accessible manner, constructing a subspace that forms the foundation of an inherently interpretable model. To capture nuanced influences on aesthetic perception beyond explicit concepts, we introduce a simple yet effective residual predictor. Experiments on photographic and artistic datasets demonstrate that our method achieves competitive predictive performance while offering transparent, human-understandable aesthetic judgments.

135. 【2603.18101】raining-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

链接：https://arxiv.org/abs/2603.18101

作者：Mohammed Rahman Sherif Khan Mohammad,Ardhendu Behera,Sandip Pradhan,Swagat Kumar,Amr Ahmed

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Recent adapter-based CLIP, adapter-based CLIP tuning, CLIP tuning, Recent adapter-based, adapter-based CLIP

备注： Accepted at The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at this https URL.

136. 【2603.18095】Q-Drift: Quantization-Aware Drift Correction for Diffusion Model Sampling

链接：https://arxiv.org/abs/2603.18095

作者：Sooyoung Ryu,Mathieu Salzmann,Saqib Javed

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：degrade generation quality, deploy large diffusion, Post-training quantization, generation quality, large diffusion models

备注： 29 pages, 6 figures

点击查看摘要

Abstract:Post-training quantization (PTQ) is a practical path to deploy large diffusion models, but quantization noise can accumulate over the denoising trajectory and degrade generation quality. We propose Q-Drift, a principled sampler-side correction that treats quantization error as an implicit stochastic perturbation on each denoising step and derives a marginal-distribution-preserving drift adjustment. Q-Drift estimates a timestep-wise variance statistic from calibration, in practice requiring as few as 5 paired full-precision/quantized calibration runs. The resulting sampler correction is plug-and-play with common samplers, diffusion models, and PTQ methods, while incurring negligible overhead at inference. Across six diverse text-to-image models (spanning DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over the corresponding quantized baseline in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.

137. 【2603.18093】One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

链接：https://arxiv.org/abs/2603.18093

作者：Haoxiang Rao,Zhao Wang,Chenyang Si,Yan Lyu,Yuanyi Duan,Fang Zhao,Caifeng Shan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Industrial anomaly detection, Industrial anomaly, abundance of normal, anomaly, downstream anomaly detection

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

138. 【2603.18091】Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

链接：https://arxiv.org/abs/2603.18091

作者：Chen Zhao,Zhuoran Wang,Haoyang Li,Shifeng Bao,Guanlin Li,Youhe Feng,Yang Li,Jie Tang,Jing Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：recently demonstrated strong, demonstrated strong performance, models have recently, embodied tasks, recently demonstrated

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

139. 【2603.18089】CytoSyn: a Foundation Diffusion Model for Histopathology -- Tech Report

链接：https://arxiv.org/abs/2603.18089

作者：Thomas Duboudin,Xavier Fontaine,Etienne Andrier,Lionel Guillou,Alexandre Filiot,Thalyssa Baiocco-Rodrigues,Antoine Olivier,Alberto Romagnoni,John Klein,Jean-Baptiste Schiratti

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：clinically ready tools, made significant progress, Computational pathology, fundamental disease understanding, recent years

备注： 21 pages, 5 figures, tech report, model page: [this https URL](https://huggingface.co/Owkin-Bioptimus/CytoSyn)

点击查看摘要

Abstract:Computational pathology has made significant progress in recent years, fueling advances in both fundamental disease understanding and clinically ready tools. This evolution is driven by the availability of large amounts of digitized slides and specialized deep learning methods and models. Multiple self-supervised foundation feature extractors have been developed, enabling downstream predictive applications from cell segmentation to tumor sub-typing and survival analysis. In contrast, generative foundation models designed specifically for histopathology remain scarce. Such models could address tasks that are beyond the capabilities of feature extractors, such as virtual staining. In this paper, we introduce CytoSyn, a state-of-the-art foundation latent diffusion model that enables the guided generation of highly realistic and diverse histopathology HE-stained images, as shown in an extensive benchmark. We explored methodological improvements, training set scaling, sampling strategies and slide-level overfitting, culminating in the improved CytoSyn-v2, and compared our work to PixCell, a state-of-the-art model, in an in-depth manner. This comparison highlighted the strong sensitivity of both diffusion models and performance metrics to preprocessing-specific details such as JPEG compression. Our model has been trained on a dataset obtained from more than 10,000 TCGA diagnostic whole-slide images of 32 different cancer types. Despite being trained only on oncology slides, it maintains state-of-the-art performance generating inflammatory bowel disease images. To support the research community, we publicly release CytoSyn's weights, its training and validation datasets, and a sample of synthetic images in this repository: this https URL.

140. 【2603.18086】SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

链接：https://arxiv.org/abs/2603.18086

作者：Wei Tang,Xuejing Liu,Yanpeng Sun,Zechao Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Referring Expression Segmentation, Segment Anything Model, Referring Expression, general image segmentation, understand natural language

备注：

点击查看摘要

Abstract:The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: this https URL.

141. 【2603.18082】EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

链接：https://arxiv.org/abs/2603.18082

作者：Xinyuan Qian,Xinjia Zhu,Alessio Brutti,Dong Liang

类目：Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词：human social interactions, understanding human social, social interactions, aiming to determine, pivotal component

备注：

点击查看摘要

Abstract:TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response this http URL evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.

142. 【2603.18067】DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment

链接：https://arxiv.org/abs/2603.18067

作者：Wuqi Wang,Haochen Yang,Baolu Li,Jiaqi Sun,Xiangmo Zhao,Zhigang Xu,Qing Guo,Haigen Min,Tianyun Zhang,Hongkai Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词：low-light enhancement, autonomous driving, low-light, vision-centric perception systems, driving

备注： 8 pages, 8 figures. Accepted to ICRA 2026

点击查看摘要

Abstract:The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.

143. 【2603.18062】S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

链接：https://arxiv.org/abs/2603.18062

作者：Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yujia Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Artificial Neural Networks, power-hungry Artificial Neural, Spiking Neural Networks, resource-constrained edge devices, Neural Networks

备注：

点击查看摘要

Abstract:Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

144. 【2603.18045】RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

链接：https://arxiv.org/abs/2603.18045

作者：X. Gao,C. Chien,G. Liu,A. Manullang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gastro Competition, capsule endoscopic videos, Competition for multi-label, Google Vision Transformer, multi-label classification

备注：

点击查看摘要

Abstract:This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.

145. 【2603.18572】UEPS: Robust and Efficient MRI Reconstruction

链接：https://arxiv.org/abs/2603.18572

作者：Xiang Zhou,Hong Shang,Zijian Zhan,Tianyu He,Jintao Meng,Dong Liang

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep unrolled models, accelerated MRI reconstruction, Deep unrolled, domain shift remains, MRI reconstruction

备注： The document contains the main paper and additional experimental details in the supplementary material. Open-source code can be found at: [this https URL](https://github.com/HongShangGroup/UEPS)

点击查看摘要

Abstract:Deep unrolled models (DUMs) have become the state of the art for accelerated MRI reconstruction, yet their robustness under domain shift remains a critical barrier to clinical adoption. In this work, we identify coil sensitivity map (CSM) estimation as the primary bottleneck limiting generalization. To address this, we propose UEPS, a novel DUM architecture featuring three key innovations: (i) an Unrolled Expanded (UE) design that eliminates CSM dependency by reconstructing each coil independently; (ii) progressive resolution, which leverages k-space-to-image mapping for efficient coarse-to-fine refinement; and (iii) sparse attention tailored to MRI's 1D undersampling nature. These physics-grounded designs enable simultaneous gains in robustness and computational efficiency. We construct a large-scale zero-shot transfer benchmark comprising 10 out-of-distribution test sets spanning diverse clinical shifts -- anatomy, view, contrast, vendor, field strength, and coil configurations. Extensive experiments demonstrate that UEPS consistently and substantially outperforms existing DUM, end-to-end, diffusion, and untrained methods across all OOD tests, achieving state-of-the-art robustness with low-latency inference suitable for real-time deployment.

146. 【2603.18554】End-to-End QGAN-Based Image Synthesis via Neural Noise Encoding and Intensity Calibration

链接：https://arxiv.org/abs/2603.18554

作者：Xue Yang,Rigui Zhou,Shizheng Jia,Dax Enshan Koh,Siong Thye Goh,Yaochong Li,Hongyu Chen,Fuhui Xiong

类目：Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)

关键词：Generative Adversarial Networks, Quantum Generative Adversarial, Adversarial Networks, Generative Adversarial, learning data distributions

备注：

点击查看摘要

Abstract:Quantum Generative Adversarial Networks (QGANs) offer a promising path for learning data distributions on near-term quantum devices. However, existing QGANs for image synthesis avoid direct full-image generation, relying on classical post-processing or patch-based methods. These approaches dilute the quantum generator's role and struggle to capture global image semantics. To address this, we propose ReQGAN, an end-to-end framework that synthesizes an entire N=2^D-pixel image using a single D-qubit quantum circuit. ReQGAN overcomes two fundamental bottlenecks hindering direct pixel generation: (1) the rigid classical-to-quantum noise interface and (2) the output mismatch between normalized quantum statistics and the desired pixel-intensity space. We introduce a learnable Neural Noise Encoder for adaptive state preparation and a differentiable Intensity Calibration module to map measurements to a stable, visually meaningful pixel domain. Experiments on MNIST and Fashion-MNIST demonstrate that ReQGAN achieves stable training and effective image synthesis under stringent qubit budgets, with ablation studies verifying the contribution of each component.

147. 【2603.18544】SCISSR: Scribble-Conditioned Interactive Surgical Segmentation and Refinement

链接：https://arxiv.org/abs/2603.18544

作者：Haonan Ping,Jian Jiang,Cheng Yuan,Qizhen Sun,Lv Wu,Yutong Ban

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：thin structures, Accurate segmentation, irregular shapes, frequent occlusions, tissues and instruments

备注：

点击查看摘要

Abstract:Accurate segmentation of tissues and instruments in surgical scenes is annotation-intensive due to irregular shapes, thin structures, specularities, and frequent occlusions. While SAM models support point, box, and mask prompts, points are often too sparse and boxes too coarse to localize such challenging targets. We present SCISSR, a scribble-promptable framework for interactive surgical scene segmentation. It introduces a lightweight Scribble Encoder that converts freehand scribbles into dense prompt embeddings compatible with the mask decoder, enabling iterative refinement for a target object by drawing corrective strokes on error regions. Because all added modules (the Scribble Encoder, Spatial Gated Fusion, and LoRA adapters) interact with the backbone only through its standard embedding interfaces, the framework is not tied to a single model: we build on SAM 2 in this work, yet the same components transfer to other prompt-driven segmentation architectures such as SAM 3 without structural modification. To preserve pre-trained capabilities, we train only these lightweight additions while keeping the remaining backbone frozen. Experiments on EndoVis 2018 demonstrate strong in-domain performance, while evaluation on the out-of-distribution CholecSeg8k further confirms robustness across surgical domains. SCISSR achieves 95.41% Dice on EndoVis 2018 with five interaction rounds and 96.30% Dice on CholecSeg8k with three interaction rounds, outperforming iterative point prompting on both benchmarks.